fbpx
Connect with us

Tech

Meta claims its AI improves speech recognition quality by reading lips

Published

on

Meta claims its AI improves speech recognition quality by reading lips

Division of robot faces using network frames and semiconductors.

Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more


People perceive speech both by listening to it and watching the lip movements of speakers. In fact, studies show that visual cues play a key role in language learning. By contrast, AI speech recognition systems are built mostly — or entirely — on audio. And they require a substantial amount of data to train, typically ranging in the tens of thousands of hours of recordings.

To investigate whether visuals — specifically footage of mouth movement — can improve the performance of speech recognition systems, researchers at Meta (formerly Facebook) developed Audio-Visual Hidden Unit BERT (AV-HuBERT), a framework that learns to understand speech by both watching and hearing people speak. Meta claims that AV-HuBERT is 75% more accurate than the best audiovisual speech recognition systems using the same amount of transcriptions. Moreover, the company says, AV-HuBERT outperforms the former best audiovisual speech recognition system using one-tenth of the labeled data — making it potentially useful for languages with little audio data.

“In the future, AI frameworks like AV-HuBERT could be used to improve the performance of speech recognition technology in noisy everyday conditions — for example, interactions at a party or in a bustling street market,” Meta AI research scientist Abdelrahman Mohamed told VentureBeat in an interview. “And assistants in smartphones, augmented reality glasses, and smart speakers equipped with a camera — e.g., Alexa Echo Show — could benefit from this technology, too.”

AV-HuBERT

Meta isn’t the first to apply AI to the problem of lip-reading. In 2016, researchers at the University of Oxford created a system that was nearly twice as accurate as experienced lip readers in certain tests and could process video in close-to-real-time. And in 2017, Alphabet-owned DeepMind trained a system on thousands of hours of TV shows to correctly translate about 50% of words without errors on a test set, far better than a human expert’s 12.4%.

But the University of Oxford and DeepMind models, as with many subsequent lip-reading models, were limited in the range of vocabulary that they could recognize. The models also required datasets paired with transcripts in order to train, and they couldn’t process the audio of any speakers in the videos.

Somewhat uniquely, AV-HuBERT leverages unsupervised, or self-supervised, learning. With supervised learning, algorithms like DeepMind’s are trained on labeled example data until they can detect the underlying relationships between the examples and particular outputs. For instance, a system might be trained to write the word “dog” (the output) when shown a picture of a Corgi (the example). However, AV-HuBERT teaches itself to classify unlabeled data — processing the data to learn from its inherent structure.

Meta multimodal speech lip model

AV-HuBERT is also multimodal in the sense that it learns to perceive language through a series of audio and lip-movement cues. By combining cues like the movement of the lips and teeth during speaking, along with auditory information, Meta says that AV-HuBERT can capture “nuanced associations” between the two data types.

The initial AV-HuBERT model was trained on 30 hours of labeled English-language TED Talk videos, substantially less than the 31,000 hours on which the previous state-of-the-art model was trained. But despite training on less data, AV-HuBERT’s word error rate (WER), a measure of speech recognition performance, was slightly better at 32.5% versus the old model’s 33.6% in cases where a speaker could be seen but not heard. (WER is calculated by dividing the number of incorrectly-recognized words by the total number of words; 32.5% translates to roughly one error every 30 words.) Training on 433 hours of TED Talks further reduced AV-HuBERT’s WER to 28.6%.

Once AV-HuBERT learned the structure and correlation between the data well, the researchers were able to further train it on unlabeled data: 2,442 hours of English-language videos of celebrities uploaded to YouTube. Not only did this bring the WER down to 26.9%, but Meta says that it demonstrates that only a small amount of labeled data is needed to train the framework for a particular application (e.g., when multiple people are speaking simultaneously) or a different language.

Indeed, Meta claims that AV-HuBERT is about 50% better than audio-only models at recognizing a person’s speech while loud music or noise is playing in the background. And when the speech and background noise are equally loud, AV-HuBERT manages a 3.2% WER versus the previous best multimodal model’s 25.5%.

Potential shortcomings

In many ways, AV-HuBERT is emblematic of Meta’s growing investment in unsupervised, multimodal technology for complex tasks. The company recently detailed a new multimodal system designed to tackle harmful content on its platforms, called Few-Shot Learner, and released models that can learn to recognize speech, segment images, copy the style of text, and recognize objects from unlabeled data. As opposed to supervised systems, unsupervised systems can be significantly more flexible and cheaper to deploy; the labels in labeled datasets come from human annotators who have to painstakingly add each one.

Because it requires less labeled data for training, Meta says that AV-HuBERT could open up possibilities for developing conversational models for “low-resource” languages, like Susu in the Niger Congo family. AV-HuBERT could also be useful in creating speech recognition systems for people with speech impairments, the company suggests, as well as detecting deepfakes and generating realistic lip movements for virtual reality avatars.

But Os Keyes, an AI ethicist at the University of Washington, expressed concerns that AV-HuBERT has limitations around class and disability baked in. “If you’re trying to assess people’s speech patterns from ‘the movement of lips and teeth,’ how does that work for people with distorted facial speech patterns as a result of disability?,” they told VentureBeat via email. “It seems kind of ironic to manage to build software for speech recognition that depends on lip reading, and is likely to have inaccuracies when pointed at … deaf people.”

In a Microsoft and Carnegie Mellon paper proposing a research roadmap toward fairness in AI, the coauthors point out that aspects of facial analysis systems akin to AV-HuBERT may not work well for people with Down syndrome, achondroplasia (which impairs bone growth), and “other conditions that result in characteristic facial differences.” Such systems might also fail for people who’ve had a stroke, the researchers note, or who have Parkinson’s disease, Bell’s Palsy, autism, or Williams syndrome — who may not use (or be able to use) the same facial expressions as neurotypical people.

In an email, Mohamed emphasized that AV-HuBERT only focuses on the lip region to capture lip movements — not the whole face. Similar to most AI models, the performance of AV-HuBERT will be “proportional to the number of representative samples of different populations in the training data,” he added.

“For evaluating our approach, we used the publicly available LRS3 dataset, which consists of TED Talk videos that were made publicly available in 2018 by the University of Oxford researchers. Since this dataset doesn’t represent speakers with disabilities, we do not have a specific percentage for the expected performance degradation,” Mohamed said. “[But this] newly proposed technology is not limited by the current speaker distribution in the training dataset. We anticipate that different training datasets with coverage of broader and diverse populations would bring considerable performance gains.”

Meta says that it will “continue to benchmark and develop approaches that improve audio-visual speech recognition models in everyday scenarios where background noise and speaker overlap are commonplace.” Beyond this, it plans to extend AV-HuBERT — which Meta doesn’t plan to put into production — to multilingual benchmarks beyond English.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Go to Source

Click to comment

Leave a Reply

Tech

Intel tips plans for a Bitcoin mining ‘Bonanza’ chip

Published

on

Intel tips plans for a Bitcoin mining ‘Bonanza’ chip

Metamorworks/Nature/Getty Images

Note: When you purchase something after clicking links in our articles, we may earn a small commission. Read our affiliate link policy for more details.

  • CPUs and Processors
  • Cryptocurrency

As PCWorld’s senior editor, Mark focuses on Microsoft news and chip technology, among other beats. He has formerly written for PCMag, BYTE, Slashdot, eWEEK, and ReadWrite.

Go to Source

Continue Reading

Tech

Xbox Game Pass adds Death’s Door, Danganronpa in second half of January

Published

on

Xbox Game Pass adds Death’s Door, Danganronpa in second half of January

Join gaming leaders, alongside GamesBeat and Facebook Gaming, for their 2nd Annual GamesBeat & Facebook Gaming Summit | GamesBeat: Into the Metaverse 2 this upcoming January 25-27, 2022. Learn more about the event. 


Xbox announced today the second batch of games coming to Game Pass in January. It’s not as exciting as the news that Microsoft is buying Activision Blizzard, but hopefully it’s an indicator of more Game Pass additions to come. January’s games include Death’s Door, Danganronpa: Trigger Happy Havoc, and Pupperazi. Subscribers will also get The Hitman Trilogy and Rainbow Six Extraction this month.

Danganronpa: Trigger Happy Havoc and Nobody Saves the World launched today for Game Pass subscribers. The former is a murder mystery visual novel game, while the latter is a cartoony RPG. Nobody Saves the World is a day-one launch on Game Pass.

Most of the other games launching on Game Pass this month will launch on January 20. These include Death’s Door, the indie action-adventure title, and dog photography game Pupperazzi. Another January 20 launch is Windjammers 2, a disc-throwing game with hand-drawn graphics. Microsoft previously revealed that Rainbow Six Extraction would be a day-one release on Game Pass. The Hitman Trilogy will also launch on Game Pass. Yes, both of those will launch on January 20.

The final Game Pass launch of the month is Taiko no Tatsujin: The Drum Master. This drum-based rhythm game is also the only game in this batch that’s not available for all of Game Pass’s platforms. It’ll be available for console and cloud but not PC.

Event

The 2nd Annual GamesBeat and Facebook Gaming Summit and GamesBeat: Into the Metaverse 2

January 25 – 27, 2022


Learn More

Game Pass is having a banner day today. Microsoft revealed in the announcement of its Activision-Blizzard that Game Pass had surpassed 25 million subscribers. It also added that it would bring “as many Activision Blizzard games as we can within Xbox Game Pass and PC Game Pass, both new titles and games from Activision Blizzard’s incredible catalog.”

The games leaving Game Pass to make way for this batch are Cyber Shadow, Nowhere Prophet, Prison Architect, and Xeno Crisis. They’ll leave the service on January 31.

GamesBeat

GamesBeat’s creed when covering the game industry is “where passion meets business.” What does this mean? We want to tell you how the news matters to you — not just as a decision-maker at a game studio, but also as a fan of games. Whether you read our articles, listen to our podcasts, or watch our videos, GamesBeat will help you learn about the industry and enjoy engaging with it.

How will you do that? Membership includes access to:

  • Newsletters, such as DeanBeat
  • The wonderful, educational, and fun speakers at our events
  • Networking opportunities
  • Special members-only interviews, chats, and “open office” events with GamesBeat staff
  • Chatting with community members, GamesBeat staff, and other guests in our Discord
  • And maybe even a fun prize or two
  • Introductions to like-minded parties

Become a member

Go to Source

Continue Reading

Tech

Samsung’s newest mobile processor has an AMD GPU with ray tracing

Published

on

Samsung’s newest mobile processor has an AMD GPU with ray tracing

Samsung Galaxy Z Fold3

Alex Walker-Todd/IDG

Note: When you purchase something after clicking links in our articles, we may earn a small commission. Read our affiliate link policy for more details.

  • Smartphones

Michael is a former graphic designer who’s been building and tweaking desktop computers for longer than he cares to admit. His interests include folk music, football, science fiction, and salsa verde, in no particular order.

Go to Source

Continue Reading
Home | Latest News | Tech | Meta claims its AI improves speech recognition quality by reading lips
a

Market

Trending