By Lom Seunbane, Chief Strategy Officer
Most of us remember from high school that sounds are just movements of compressed air that travel to our ears. If we dig a little more into our high school science memories we might even remember that sounds are waves (those aforementioned air compressions) that can travel as long as they have a medium. In the realm of digital technology, much research and development has been made into how to best recreate those real sounds as digital signals. At MBS, we recently ran across a project that required us to take it a step further and ask if it is possible to distinguish and categorize different sounds.
The problem is challenging but there are some precedents that we have to guide us. Many researchers have used machine learning (ML) to identify bird calls, separate human voices, and recognize speech patterns. So how is sound identification done and why does it even need ML?
Before we tackle that problem, let’s first learn a bit about digital sound signals. At its core, sound waves can be broken down into two parts: frequency and amplitude. Those waves then need to be converted into computer (binary) data in order to put them on CDs (if we’re still using them) or any other medium and then have it play and travel through your preferred output (headphones or speakers) into your ears. In essence, you take real sound, convert it into a digital signal, and then convert it back into real sound. For obvious reasons, many people in the music industry want to ensure high fidelity to the original sound source.
Anatomy of a Typical Wave
In terms of computer data, there is the sampling rate (SR), which is how often you measure the amplitude of the sound signal. The higher the sampling rate, the closer the computer data will be to the original audio signal. However, higher SR also equals bigger files and again, people in the music industry (particularly the headphone manufacturers) work on achieving high fidelity without having such big files. Along with SR, there is also audio bit depth (ABD), which determines the number of possible amp values we can record for each sample.
Without getting into the math or theory (look up Nyquist Theorem), I will tell you that audio SR should be at least twice the frequency of the original sound so as to be able to reconstruct the sine wave of that signal. Most humans can recognize sound in the 20Hz to 20 KHz range. Thus, an audio file should have an SR of at least 40KHz so that humans can hear all the recreated sounds from their range of hearing. That is why CDs are recorded in the 44.1KHz range (as for why it’s 44.1KHz instead of just 40KHz, that is a remnant from the video recording days).
We can visualize sound by plotting amplitude with time. In a time-domain (TD) representation, we can see the loudness of audio, but we don’t really know much else about it. To get more information, we can also look at sound through a frequency-domain (FD) representation. This is achieved through a Fourier Transform (FT). In a freq-domain chart, we get not only the frequency presented in the signal, but also the magnitude. So how does this all fit together to achieve sound identification? Let’s take a look at it through a speech recognition task.
A time-domain (TD) representation of sound
A frequency-domain (FD) representation of sound
In a TD chart, we only know the loudness of something while an FD chart has more information but we don’t have the time info. Why do these limitations matter? Let’s imagine a speech recognition task where someone says “how are you?” When someone says “how are you,” our ears pick up the sound signal and then our brain processes the meaning (more accurately though, all of this is processed simultaneously in quite a complex way). When a computer “hears” the words “how are you,” it will need both the time and frequency information in order to output those words to a computer screen. The frequency info would help the computer recognize the words but without the time, it won’t know the order of the words. This is where a spectrogram is important.
A spectrogram is a visual representation that provides the frequency of a sound at a specific time while the colors represent magnitudes (amplitudes). You can think of them as a kind of 3-dimensional chart. The two axes are two dimensions with one axis representing time and the other frequency. A third dimension indicating the amplitude of a particular frequency at a particular time is presented by the intensity of the color at each point in the image. Spectrograms are also known as sonographs, voiceprints, or voicegrams. And in a 3D plot, they may be called “waterfalls.” The most important part in all of this is that you can think of a spectrogram as a fingerprint of sound. At any specific time range, a particular sound will have a specific spectrogram representation. By feeding this sound data through an ML algorithm, we can train the algorithm to recognize specific sounds and identify it. That is how a computer can recognize and output the phrase “how are you.”
In another article, I’ll outline the general steps on how this is achieved, but I think that’s enough theory for this piece. Once that article is written, I’ll include the link down below so navigation between the two articles will be easier. Until next time, I hope you enjoyed this primer on sound theory and how we can use ML to identify specific sounds.