SIT’s Professor Ian McLoughlin and his team have spent the last two years training AI frameworks to cut out noisy ambient sounds from recorded speech, leaving only the sweet sound of clear voice.
Prof McLoughlin (left) and Dr Ding Zhongqiang from SIT’s Infocomm Technology cluster. (SIT Photo: Keng Photography/Tan Eng Keng)
You are riding the train at peak hour, and a colleague calls you on your mobile phone. He has a question, but with the background noise – the announcements playing, the opening and closing of the doors, and the movement of the train – he cannot really hear your reply. He asks you to speak louder and then louder. By the time the call ends, you both have a minor headache, and everyone on the train knows your business.
Not being able to hear someone well on the phone sounds like a minor inconvenience, but it is a universal experience shared by city dwellers and one that plays out across multiple scenarios in daily life.
But Professor Ian McLoughlin, Cluster Director of Infocomm Technology at the Singapore Institute of Technology (SIT), has developed a technology that can literally cut through the noise, with help from artificial intelligence (AI).
Known as deep denoising, the technology builds deep learning models that can identify and track background noises and speech during a call. An AI system is then trained to remove the former and denoise while enhancing the latter.
“Noise corrupts speech. If you know what the noise is, you could remove it using a signal processing technique from 20 years ago, without AI. But the problem comes when the noise is unpredictably changing, such as wind sounds, or when you don’t know what the next noise sounds like,” explained Prof McLoughlin.
This is where AI comes in. “AI is very good at identifying and modelling things, even ones it has never ‘heard’ before. If you train an AI to recognise 100 different bell sounds, the AI can model the sound of almost every possible bell,” he added. “We train our denoising AI with thousands of types of speech and noise”.
Filter Out the Noise
Prof McLoughlin and his team have spent the last two years training multiple AI frameworks to identify different types of noise pollution and remove it from speech recordings.This is how it works. AI models designed by the team are trained by iteratively presenting over 50,000 recordings of speech corrupted by different kinds of noise to a model. Simultaneously, they present the same speech free of noise - the gold standard - which is used to adjust the model output to become a bit more like the clean speech. This process, called back propagation, is repeated millions of times. The output becomes better and better until the AI is able to consistently turn noisy speech into clean speech.
Prof McLoughlin and team developed and implemented real-time denoising AI technology to restore clarity to noisy speech. (SIT Photo: Keng Photography/Tan Eng Keng)
The research was done in collaboration with the Singapore branch of a Taiwanese electroacoustic company and supported by AI Singapore, a national programme aimed at enhancing the country’s AI capabilities.
In all, the team trained nearly a hundred AI frameworks before arriving at the final one. The next challenge was ensuring that the technology could operate on a tiny embedded system, such as that inside portable audio equipment, and work in real-time with no perceptible delay between input and output.
“The AI model worked very well on a desktop computer, which was plugged into a power outlet and had a high-performing GPU (graphics processing unit),” he said. “But we needed to reduce the complexities of the technology and make the algorithm more compact to implement it onto something small.”
Typically, when such technology is pared down, its quality is consequently reduced. But Prof McLoughlin needed the system to continue performing well. This meant they had to rewrite and simplify several mathematical equations within the AI to reduce the number of operating instructions it used to denoise the speech, a process he described as “performing mathematical tricks”.
This allowed the AI’s calculations to perform faster so that when implemented into the small embedded system, it could still denoise speech in real-time while the time delay from input to output, known as latency, was not sacrificed.
Music to the Ears
Having worked on speech and audio technology since 1991 across industries and in several countries, he said advancements in the denoising field have moved rapidly with the AI revolution.
In the 1990s, research in the field primarily focused on ensuring speech was intelligible. For instance, ensuring frontline responders can hear emergency callers well and communicate effectively with their teammates via walkie-talkies.
“Back then, intelligibility was key, not quality of sound. Police officers needed to understand what people were saying when they called in about an emergency. The technology back then was able to do that. But the technology didn’t care if the police officers got a headache after listening to those recordings,” he laughed.
Today, with cutting-edge technology, intelligibility is generally a given, he explained. “And we are now concerned more with quality – how clear something sounds.”
He is in talks with industry partners, ranging from rail companies to the makers of emergency responder communications equipment, to look at how SIT’s proprietary AI-based denoising technology can be licensed for adoption. If this takes off, it could literally be music to customers’ ears.