Technology to Make Smart Speakers Even Smarter

Click here if the video doesn't play.

27th February, 2019

Even at the noisiest of social gatherings, people are able to follow a single conversation while filtering out irrelevant chatter. That’s because the human brain has a remarkable ability to focus auditory attention. Using our original "Maisart" artificial intelligence (AI) technology, we developed a tool that can separate speech in a similar fashion, which presents exciting possibilities for voice-activated devices.

One Microphone Can Distinguish Multiple Talkers

Voice-activated devices are steadily making daily life more convenient. The ability to dictate and send text messages, or control music and even surf the web while driving or engaged in other activities saves time and effort. Current speech recognition technologies have a major drawback, however—voice filtering that is suboptimal. For example, if one is in a car and wants to ask a smartphone for directions, everyone else in the car must remain quiet—a tall order for long trips with kids in the back seat. That’s because typical microphones in voice-activated devices hear noises as one conglomeration of sound, including irrelevant sounds and voices in the background.

To address this problem, our Mitsubishi Electric engineers created the world’s first technology that separates, then reconstructs, the speech of up to three people talking into a single microphone at the same time with a high degree of accuracy. In a demonstration with two speakers talking at the same time in quiet conditions, speech recognition accuracy was more than 90% compared to a 51% accuracy rate with conventional technologies. Our technology has a clear competitive edge thanks to its accuracy and the fact that it only requires one microphone, unlike voice recognition devices on the market today, which require multiple microphones.

Deep Learning Mimics the "Cocktail Party Effect"

When developing the technology, our researchers took a cue from the human auditory system, which possesses an extraordinary ability to focus on a single conversation in a noisy throng and filter out other stimuli—a phenomenon known as the "cocktail party effect."

The speech-separation technology uses a method exclusive to Mitsubishi Electric called "Deep Clustering." This method first clusters voices together using deep learning, a common AI technique. Deep Clustering then separates mixed voices by identifying their unique qualities, including different gender and spoken language combinations. Finally, each person’s speech is reconstructed by resynthesizing the previously separated speech components.

We're Making Life More Convenient in a Voice-activated World

While our speech-separation technology is yet to be commercialized, we believe its potential to make lives more convenient is enormous. It could make everyday voice-activated devices more precise—no more asking the kids to keep it down when telling a smart speaker to stream a movie. And it could make transcriptions of business meetings a piece of cake, especially the heated ones with multiple people all talking at once. For those with hearing issues, the technology could help them stay on top of fast-moving conversations.

That’s something to look forward to—voice recognition technologies that make sense of our busy, and sometimes boisterous, lives.

Learn more

Related Contents

Our Stories Top