Voice interfaces become more reliable for entertainment and Internet of Things (IoT) devices

April 05, 2016 // By EDN Europe
Huw Geddes, Director of Marketing, XMOS
Systems such as Amazon’s Echo and Microsoft Xbox have shown the convenience of talking to devices around the home to get information and even buy products. These devices connect to increasingly cost-effective cloud processing that enables more powerful natural language interface (NLI) algorithms, which can recognise different voices and more complex instructions. However, both these systems, Echo and Xbox, have also shown some of the problems of voice interfaces. The Echo was recently reported to have been triggered by a radio programme using its key word Alexa and ended up changing the setting on a thermostat. And Xboxes have been known to switch on after Microsoft’s TV ads triggered them.

One of the ways around this problem is to use an array of MEMS microphones to create a ‘beam’ that listens in different directions. The array can listen out for instructions, and then narrow in on the location of the voice by adjusting the phase of the incoming sounds, matching the different parts of each audio channel and eliminating noise from the signal.

This has a number of benefits that give users more confidence in the correct operation of the system, including:

  • Increased accuracy of the voice input as other sounds are less likely to interfere, allowing different voices to be separated not only by their sound but also by their location in the room.
  • Less inadvertent triggering in a voice interface (VUI). Over time, the controller can identify the position of audio sources such as the TV or a radio or hi-fi system and either discount or double check any inputs from these positions in the room.

Voice capture demands significant local processing power at point of capture to accommodate speech enhancement processing, such as beam forming, and noise reduction. But it also requires another key element – determinism. The timing of each channel from every microphone in the array is a vital part of the processing, particularly for arrays of more than two or four microphones.

Technologies such as XMOS xCORE multicore microcontrollers ( www.xmos.com) have proven ability to manage multichannel audio streams with true deterministic timing. With integrated signal processing and the ability to backhaul the captured voice over standard interfaces (including USB and I2S) voice capture and NLI interfaces are now realistic options for smart TVs and soundbars that would otherwise have significant challenges with other sources of audio.

And then there are whole new classes of interactive IoT devices, such as Rokid ( www.rokid.com), that are using the technology to develop new types of artificial intelligence for the home.

More technical information on voice capture, including a white paper on the subject, is available at http://www.xmos.com/products/voice.

About the author

Huw Geddes has an extensive background in the delivery of technology to designers, developers and engineers. Prior to joining XMOS as a Information/Documentation Manager, Huw worked as Technology Transfer Manager at the 3D graphics company Superscape Ltd, and Technical Author at VideoLogic Ltd. Huw also has a strong background and interest in fine art and exhibition management.

http://www.xmos.com

YouTube: https://www.youtube.com/user/xmosmmcu

LinkedIn: https://www.linkedin.com/company/97385?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A70233611459851269108%2CVSRPtargetId%3A97385%2CVSRPcmpt%3Aprimary

Twittter: https://twitter.com/xmos