Voice interfaces to unlock the potential of speech recognition
Voice interfaces are often called conversational interfaces for good reason - they allow users to talk to an embedded device instead of using touch screens and buttons to execute a command. Conversations take place between two or more people who may be next to each other but are often a couple or more metres apart.
If we’re going to have conversations with our new digital friends, including Siri, Google, Alexa, Moneypenny, Cortana, Viv, we must be able to talk to them at a distance as well as up close.
By Huw Geddes, Director of Marketing, XMOS
Far-field microphones that can capture individual voices several metres away already exist, but we don’t hear much about them. All the chat is about the software AI engines Viv, home hubs and chatbots, but the key to unlocking the potential of these applications is the voice interface and integrated far-field microphone.
Voice interfaces and microphone arrays
At the root of many far-field microphone solutions is a beamformer, which is created by combining the input from two or more microphones into a lobe that focuses on the voice source while excluding surrounding sounds. The beam can be steered to follow a voice source as it moves around a room or to capture a different voice, without moving the voice interface itself. Sounds great but as usual the devil is in the detail.
Beamformers vary significantly depending on the target product. Many products use a linear or circular array but sometimes the array is more complicated. Generally an array with more microphones delivers a more precise narrower lobe with better gain, but it also means more processing and power.
Microphone arrays involve complicated trigonometry and direction of arrival calculations, running to millions of cycles per second.
The captured analogue PDM signal must be converted to a standard PCM digital signal and sampled down to something useful, using multiple filters and a lot more compute.
Microphone arrays must be able to differentiate between voice sources and music sources in order to capture a clear voice input.
The architecture on which the microphone array is built must have low latency, especially where the product features bi-directional communication. Buffering introduces lag, which affects the overall performance.
And before you think of using an array of any old microphones, you can’t; microphones are calibrated for many different levels of performance, so if you get the wrong microphone initially you’ll spend all your time trying to fix a fundamentally broken solution.
Anything else?
Captured voice streams often suffer a lot of echo and reverberation caused by signals bouncing around the hard surfaces in the surrounding environment, effects that change from one environment to another. Additional echo cancellation and de-reverberation DSP must be implemented before the audio can be passed to the speech engine. Some gain control is probably necessary to boost the signal before it’s passed to the speech engine.
Voice interfaces need to send the audio stream to a Cloud service over a secure robust WiFi or Bluetooth connection, or to an application processor using a standard interface like USB or TDM. In response, the device must be able to playback a reply or questions that the digital assistant needs for clarification.
Above all, voice controllers should only be active when they are explicitly addressed and clearly show the current state. The user must be able to use a keyword to activate the device and get it to lock onto their voice when they want to talk, but also be able to shut the device down when necessary.
Putting it all together
So that’s the shopping list for a voice controller: lots of compute, DSP, very low latency, power efficiency, audio playback, WiFi/Bluetooth/USB connectivity and flexible GPIO support. Each of these features can be implemented using discrete devices, but remember that each device adds to the complexity of the final design - more timing issues, more PCB real-estate, more cost.
When all these features are integrated into a single voice interface, we’ll start to see new product categories of voice-enabled products that unlock the real potential of speech recognition.
Author biography
Huw Geddes has an extensive background in the delivery of technology to designers, developers and engineers. Prior to joining XMOS as a Information/Documentation Manager, Huw worked as Technology Transfer Manager at the 3D graphics company Superscape Ltd, and Technical Author at VideoLogic Ltd. Huw also has a strong background and interest in fine art and exhibition management.