University of California researchers have used electrodes to read brain signals and developed a virtual vocal tract and machine learning algorithms to translate those signals into speech. Using the system, the researchers were able to generate entire spoken sentences from an individual’s brain activity.
Speech scientists and neurologists at the University of California San Francisco trained a powerful computer system to recognize brain signals associated with speech. The system was developed using epilepsy patients whose skulls had been opened to map the source of their seizures. Electrodes were temporarily implanted as part of that mapping process, and those that had been implanted in areas of the brain associated with language production were used to identify brain signals associated with speech.
The researchers conducted tests on five individuals who read hundreds of sentences. Participants were told to read the sentences but were instructed not to make any specific mouth movements.
While it was possible to generate speech directly from brain activity, to improve the accuracy of the system, the researchers created a virtual vocal tract for each of the patients that was controlled by their brain activity. Audio recordings of the patients speaking were reverse engineered to determine the movements of the vocal tract required to produce the sounds.
The system included two AI machine learning neural networks, a decoder to transform brain activity into movements of the virtual vocal tract, and a speech synthesizer capable of generating natural-sounding speech. The system was far more accurate when a virtual vocal tract was incorporated.
The system is not perfect, but in tests, individuals listening to the machine generated speech were able to determine spoken words 69% of the time when they were provided with lists of 25 alternative words, and accurately transcribed 43% of sentences with complete accuracy. When the list of alternative words was increased to 50, only 47% of words were correctly recognized and 21% of sentences were transcribed accurately.
The technique represents a major step forward. Current systems for generating speech by translating facial or eye movements are slow and laborious and can only generate around 10 words per minute. This system is far faster, although the technology needs a lot of refinement.
“We’re quite good at synthesizing slower speech sounds like ‘sh’ and ‘z’ as well as maintaining the rhythms and intonations of speech and the speaker’s gender and identity, but some of the more abrupt sounds like ‘b’s and ‘p’s get a bit fuzzy,” said Josh Chartier, co-author of the paper. “Still, the levels of accuracy we produced here would be an amazing improvement in real-time communication compared to what’s currently available.”
The proof of concept device created by the researchers demonstrates that the technology currently exists to allow a device to be created that could allow individuals with speech loss to regain their ability to speak. Such a device could be used by people with brain injuries, motor neurone disease, neurodegenerative diseases such as multiple sclerosis, throat cancer victims and many others. However, the research is still in the early stages and it is likely to be some time before a clinically viable device could be created. The researchers are currently working on higher-density electrode arrays and more powerful machine learning algorithms to improve the accuracy of the system.
Tests also need to be performed on individuals who no longer have the ability to speak to determine whether the system could be taught without training it using their own voice.
The research is detailed in the paper – Speech synthesis from neural decoding of spoken sentences – which was recently published in the journal Nature. DOI: 10.1038/s41586-019-1119-1