Detailed hotspots endpoint detection of speech detection noise reduction and

As a means of human-computer interaction, speech endpoint detection in the liberation of human hands is significant. Meanwhile, work environment there are all sorts of background noise, the noise will seriously degrade audio quality, which affect the result of voice applications, such as reduced rate. Uncompressed audio data, interactive application network traffic flow in the network, thereby reducing the success rate of speech applications. Therefore, audio detection, noise Terminal speech processing and audio compression is always the focus of is still active research topic.

To be able to work with you to understand basic principles of endpoint detection and noise reduction, take you with a glimpse into the mysteries of audio compression, this hard to create public class guest HKUST flew senior engineer Li Hongliang, will bring us a keynote: details of voice detection technology hot spot–endpoint detection, noise reduction and compression.

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

Guest introduction

Li Hongliang, graduated from the Chinese University of science and technology. HKUST flew senior research engineer, engaged in the speech engine and voice cloud computing development, HKUST flying voice one of the founders of cloud, leading research and development for flying voice speech codec libraries on the cloud platform, we used more than 2 billion. Construction of leading voice of the national standards system, leading, participate in more than one voice formulation of such national standards. He shares today will be divided into two parts, the first part is the endpoint detection and noise reduction, the second part is the audio compression.

▎ Endpoint detection

First look at endpoint detection (Voice Activity Detection, VAD). Audio endpoint detection from continuous speech stream detection and effective speech. It consists of two parts, detected the starting point for effective voice front end point, detected a valid end point after the end of the speech.

Speech endpoint detection in the voice application is necessary, first of all, very simple, voice scenario is in storage or transmission, effectively isolated from continuous speech stream voice, you can reduce the amount of data being stored or transmitted. Secondly, in some scenarios, using the endpoint detection can simplify human-computer interaction, such as in the recording scene, speech endpoint detection after ending recording operation can be omitted.

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

In order to more clearly define the endpoint detection principle, first to analyze an audio. Pictured above is a simple audio only two words, the picture can be seen very intuitive, end the silence acoustic amplitude is very small, but effective speech part of the amplitude is large, the amplitude of a signal a visual indication that the size of the signal energy: silent energy value is smaller, effective voice parts of the energy value is bigger. Speech signals is a one-dimensional continuous function of time as the independent variable, the computer processing of voice data is voice signal chronological sequence sampling value, the size of these samples also said the voice signal at the sampling point of energy.

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

Sampling value in the has comes as and negative, calculation energy value Shi not need consider plus or minus,, from this meaning Shang see, using sampling value of absolute to said energy value is naturally of idea, due to absolute symbol in mathematics processing Shang not convenient, so sampling points of energy value usually using sampling value of square, a contains n a sampling points of voice of energy value can defined for which the sampling value of square and.

In this way, a voice of energy associated with the where sample size, and contains the number of sampling points. In order to investigate the variation of the sound energy, needs to be speech signal segmentation according to a fixed length such as 20 milliseconds, each split units called frames, each frame contains the same number of points, then the voice of energy per frame value.

If front of the audio part of the M0 frame energy values in a row below the energy threshold in advance of E0, the next row M0 frame energy values greater than E0, then in front of the voice is voice of the energy value. Similarly, if the consecutive frames of voice energy values is large, and subsequent smaller frame energy value and last for a certain length of time, you can think that reduced energy value that is a voice endpoint.

The question now is, threshold energy value E0 ready? M0 is how much? Ideal quiet energy value is 0, E0, ideally in the algorithm above is 0. Unfortunately, gathering audio scenarios tend to have a certain intensity of background noise, this pure background sound is muted, but its energy value is not 0, so the collected audio background usually have certain underlying energy values.

We always assume that collected the audio at the beginning a little mute, typically hundreds of milliseconds in length, this little mute is the basis we estimate threshold E0. Yes, always assume that the audio at the start of a short speech was muted, this assumption is very important!!!! In subsequent noise reduction is also used in the introduction to this assumption. In estimating the E0, selected a certain number of frames such as 100 frames before voice data (these are “silent”), calculate the average energy value, then add a value experience or multiplied by a factor greater than 1, and E0. The E0 is our benchmark for judging a frame whether the voice is muted, is larger than this is the effective voice, is less than this value is muted.

As for M0, easier to understand, the magnitude of which determines the endpoint detection sensitivity, M0 is smaller, higher endpoint detection sensitivity, whereas the lower. Apply for different endpoint detection sensitivity should be set to a different value. For example, in the application of voice-activated remote control, because the voice instructions are generally simple control instructions, such as comma or period Middle long stalled is highly unlikely, it is reasonable to increase the sensitivity of detection, M0 is set to a smaller value, the corresponding audio is usually around 200-400 Ms. Voice dictation applications, such as comma or period because there will be pauses for a long time would be preferable to reduce detection sensitivity, at which point the M0 value to a large value, the corresponding audio is usually 1500-3000 in milliseconds. Values of M0, which endpoint detection sensitivity, should in practice be made adjustable, its value according to the voice application scenarios to choose from.

More than just voice activity detection is simply the General principles, practical applications of the algorithm is far more complicated than the above. As a widely used voice processing technology, audio endpoint detection remains an active research. HKUST fly have been using recurrent neural networks (Recurrent Neural Networks, RNN) technology for voice activity detection, the practical effect to concerns fly products. Disney iPhone 6 Case

▎ Noise

Drop noise and said noise inhibit (Noise Reduction), Qian paper mentioned, actual collection to of audio usually will has must strength of background sound, these background sound General is background noise, dang background noise strength larger Shi, will on voice application of effect produced obviously of effect, like voice recognition rate reduced, endpoint detection sensitivity declined,, so, in voice of front-end processing in the, for noise inhibit is is has necessary of.

There are many kinds of noise, white noise spectrum stability and instability of fluctuation noise and impulse noise, speech applications, the steady background noise is the most common, most mature technology, the effect is best. This course discusses the steady white noise, which always assumes that the spectrum of the background noise is steady or quasi-steady.

Earlier voice activity detection is carried out in the time domain, the noise reduction process is carried out in the frequency domain, to this end, we first introduced or review for an important tool for conversion between time domain and frequency domain – Fourier transform.

In order to make it easier to understand, look at Fourier learned advanced mathematics, advanced mathematical theory suggests that a periodic 2T function satisfying Dirichlet conditions f (t), you can expand into a Fourier series:

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class
Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

For General continuous time domain signal f (t), for its domain [0,T], its odd after the extension, the Fourier series is as follows:

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

BN calculation as above, the above shows that any continuous time domain signal f (t), can be represented by a set of linear superposition of trigonometric functions. Or, f (t) can be formed by a Delta function is a linear combination of the infinite sequence of approximations. Signal is a signal of the Fourier series shows the frequency and amplitude of frequencies, therefore, right side of the equation can be seen as a signal f (t) spectrum, put it more starkly, signal spectrum refers to the signal which frequency components and the amplitude of each frequency. From left to right on process is a process of seeking known signal spectrum, from right to left is a signal of the process of reconstruction of the spectrum of the signal.

While signal the Fourier spectrum concept is easy to understand, but in practice to obtain the spectrum of the signal, using a generalized form of Fourier series–Fourier transform.

Fourier transform is a big family, in different fields of application, different forms, here we only give out two forms–continuous Fourier transform and discrete-time Fourier transform:

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

Where j is the imaginary unit, which is j*j=-1, which corresponds to the inverse Fourier transform are:

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

In practical applications, Fourier transform digital signal, the frequency spectrum of the signal can be. Frequency domain processing is completed, you can use the inverse Fourier transform converts the signals in the frequency domain to time domain. Yes, Fourier transform is a complete an important tool for from the time domain to the frequency domain transformations, a signal the Fourier transform, frequency spectrum can be obtained.

Above is the Fourier transform of a brief introduction, mathematical knowledge is not good friends can not read does not matter, just understand that Fourier transform of a time domain signal, you can get the signal spectrum, that is, complete following conversion:

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

Is to the left of the time domain signal corresponds to the right of the spectrum time domain signal is generally concerned about what time what value is the frequency and amplitude of frequency domain signal concern.

With these theories as a basis for understanding principles of noise reduction is much easier, noise reduction is the key to extracting a noise spectrum, noisy speech based on the noise spectrum is then do a reverse compensation operations, resulting in noise reduction of speech. This is important, what is behind is built around these words.

Noise suppressing flow as shown in the following figure:

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

Endpoint detection of similar, assuming that the audio at the beginning a little voice was background noise, this assumption is very important, because this small piece of background and background noise, is the basis of extraction of noise spectrum.

Noise reduction process: first, a little background is divided into frames, and are grouped according to the sequence of frames, each frame can be 10, or other values, the number of groups of not less than 5, then each set of background noise in the data frame using the Fourier transform of the spectrum, then the spectrum averaging background noise spectrum.

Get noise of spectrum Hou, drop noise of process on very simple has, Shang figure following left of figure in the red part that for noise of spectrum, black of line for effective voice signal of spectrum, both common constitute containing noise voice of spectrum, with containing noise voice of spectrum minus noise spectrum Hou get drop noise Hou voice of spectrum, again using FT leaves inverse transform turned returned to Shi domain in the, to get drop noise Hou of voice data.

The figure below shows the noise reduction effect

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

Left picture is the comparison of before and after noise reduction in the time domain, is to the left of noisy speech signal, you can see from the picture noise is very apparent. Noise reduction of speech signals is to the right of, as can be seen, the background noise has been inhibited.

The following comparison of the two images is in the frequency domain

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

Scissa indicates time axis, the vertical axis represents the frequency, is to the left of noisy speech, one of the bright red part is the effective voice, and those purple part is noise like sand. As can be seen from the diagram, the noise is not only “the ever-present”, but also “everywhere”, which are distributed at the various frequencies, noise reduction of speech is on the right side, you can clearly see, before the noise part of purple light a lot like sand, is the effective suppression of noise.

In practical applications, the noise reduction is a variation of the noise spectrum is often used, but with the noise reduction process be amended, which is adaptive to the process of noise reduction. This one is mute on the front length of voice data sometimes isn’t long enough, data isn’t enough background noise to get the noise spectrum is often not accurate enough, on the other hand, the background noise is not absolutely stable, but gradients and even mutate to a steady background noise.

Disney case

These reasons are required during the process of noise reduction using noise correction in a timely manner to get better noise reduction effect. Correction of noise spectrum method is to use the secondary audio mute, repeat the extraction algorithm of noise spectrum, new noise spectrum is obtained and used for correction of noise the noise spectrum, so to use endpoint detection in the noise reduction process used in how to judge the quiet. Noise spectrum method or the old and new spectrum is a weighted average, or use new noise spectrum to fully replace the use of noise spectrum.

Described above is a very simple principle of noise reduction. Practical applications of noise reduction algorithm is more complex than described above, real variety of noise sources, its mechanisms and features are more complex, so noise reduction is still an active area of research today, new technologies are emerging one after another, such as in practical applications have been using microphone array for noise suppression.

▎ Audio compression

The need for audio compression is well known, not repeat them. All audio compression systems are required to have two corresponding algorithms, a coding algorithm is run on the source side (encoding), the other is running at the receiving or decoding algorithm for user terminal (decoding).

Encoding and decoding algorithms show some asymmetry. This asymmetry is reflected in the coding and decoding efficiency can be different. Audio or video data when it is stored, usually be encoded only once, but thousands of times will be decoded, so encoding algorithm is more complex and less efficient, costs can be accepted, but the decoding algorithm must be fast, simple, and cheap. Coding algorithm and decoding algorithm of not symmetric sex also performance in coding and decoding of process usually is not inverse of, that is, decoding Hou get of data and coding zhiqian of original data can is different of, as long as they listening to up or looks is as of can, this series decoding algorithm usually called lossy of, and this corresponds to of is, if decoding Hou get and original data consistent of data, this coding and decoding called lossless of.

Audio and video encoding and decoding algorithms are lossy, because some small amount of loss of information can often be changed to compression ratio increased, coding of audio signals using a data encoding some technologies, such as waveform entropy coding, coding, parameters, coding, coding, and perceptual coding, etc.

This lecture focuses on perceptual coding, coding algorithms relative to the other, perceptual coding based on the characteristics of the human auditory (acoustic) to remove redundancy in audio signals, so as to achieve the purpose of audio compression. Compared to other audio coding algorithms (lossless), ears don’t feel obvious distortion of the conditions, can reach more than 10 times greater compression ratios.

First to introduce the psychoacoustic basis of perceptual coding. Audio compression core is to remove redundancy. So-called redundancy is in speech signal contains information that cannot be perceived by the ear, which humans determine the timbre, pitch and other information without any help, for example, the human ear can hear frequencies in the range of 20-20KHz, unable to perceive frequencies below 20Hz and infrasound frequencies higher than 20KHz ultrasound. For example, the human ear cannot hear a “not” sound. Perceptual coding is the use of such features of the human hearing system, achieve the purpose of removing redundant audio information.

Psychoacoustics in perceptual coding are: frequency masking, temporal masking, audibility threshold.

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

Frequency shield frequency shield in life in the everywhere visible, like you home in the sat in sofa Shang quiet of see TV, suddenly, is decoration of neighbors home a is harsh of drill drill wall of voice came, then you by can heard of only mobile drill issued of is strong of noise, despite at TV by issued of voice still in stimulus with you of eardrums, but you is turned a deaf ear to, that is, a strength is high of voice can completely shield a strength lower of voice, this phenomenon called frequency shield.

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

Temporal shielding to take the previous example, both in drill sound the human ear can not hear the sound of the TV, is the sound of drills just stopped for a short period of time, the human ear can’t hear TV sound, this phenomenon is known as time-domain masking. Temporal shield due to the human auditory system is a system with adjustable gain, when you listen to the sound intensity, lower gain, less noise, higher gains. Sometimes even through external means to change the gain of the auditory system, for example, covered his ears to avoid very much noise damages eardrums, while holding your breath, ear, hand behind ear is listening to common behaviors of the weaker voices. In the above example, strength a lot just disappeared when the auditory system requires a short period of time to increase the gain, it is in this short period of time domain masking.

Audibility threshold below which for audio compression is very important.

Conceived in a quiet room, a speaker can issue a frequency controlled by the computer’s voice, at first speaker less power, at a distance of hearing people hear speakers voice. And then began to gradually increase the speaker’s power, as power increases to the right can be heard when recording speaker power (sound intensity level, DB), this power is the frequency audibility threshold. Disney case

Then change the speaker audio frequencies, repeat the experiment, eventually obtaining the audibility threshold versus frequency curve shown in the following figure:

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

The diagram can be seen clearly, the human hearing system is most sensitive to frequencies in the range 1000-5000Hz voice frequency closer to the sides, human hearing is slow to respond.

Go back and look at frequency shielding case, this experiment in the room to add a frequency 150Hz, 60dB signal strength, and then repeat the experiment, experimental results of audibility threshold curve shown in the following figure:

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

Obviously seen from the figure, the audibility threshold curve around the 150Hz is strongly distorted, is improved a lot. This means, originally located near the audibility threshold 150Hz above a certain frequency of sound, there is probably a stronger signal in the 150Hz the presence of audible, which was blocked.

Perceptual coding the basic rule is, never need encoding the human ear can not hear the signal, simply put, hear signals do not need to be coded, and this nonsense is one focus of the research on speech compression. Nonsense is very easy to understand correctly the meaning of another word. Closer to home, what can’t hear it? Power is below the audible threshold signal or component, blocked the signal or component, the human ear can not hear, were referred to above “redundancy”.

Some of these are acoustic. To be a good understanding of audio compression, you also need to understand the concept of a more important: subband. Band (subband) refers to a frequency range, when the frequencies of the two tones is when a child band, you will hear two tones. More general case, if the frequency distribution in a complex signal when a child band, human feeling is the frequency of the signal is equivalent to a simple signal at the the band centre frequency, which is the core of the. Simply put, band is a frequency range, frequency signal that falls within this range can be replaced by a single frequency component.

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

General equivalent frequency sub band center frequency, amplitude sub band frequency amplitude-weighted and, a simpler method is to add the amplitude of the frequency components directly, as equivalent to the amplitude of the signal, this range of frequency components in a component, you can replace.

Set the frequency spectrum of a signal minimum value W0, W1 maximum. Sub-band coding is the frequency range between W0-W1 is divided into several sub, then each child within the component using an equivalent frequency component to replace. In this way, a complex spectrum of signals can be equated to a spectrum constitute very simple signal-spectrum were greatly simplified, require very little storage.

From the above procedure is not difficult to know, how to divide a great effect on the quality of compressed audio (are equivalent). Band classification is subband coding of a very important research topic, can be roughly divided into a fixed-width subband coding and variable-width encoding, comprehension, or explanation.

After the child with a number of different compression algorithms of different grades. Easy to know, more low bit-rate compression rate is high, with fewer, and poor sound quality. Opposite is also easy to understand.

Understanding of subband coding, audio compression is very easy to understand, a signal through a triangular filter set (the equivalent of a set of child) after being down to a small number of frequency components. Then visit these frequency components, energy or amplitudes are below audibility threshold curve ignore (deletes the component, because not heard). Reinvestigation on the remaining 22 adjacent frequency component, if one is next to the frequency screen, delete. After the above process, the spectrum of a complex signal contains frequency components is very simple, with very little data can be stored or transmitted information.

When decoded using the inverse Fourier transform to refactor the above the simple spectrum time domain, are decoded voice.

Above is the simple principle of audio compression, let’s talk about audio codec library.

Publicly available audio codec open source a lot, its features and capabilities are different, as shown below:

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

From the figure, you can see, AAC and MP3 is “high-end”, used to encode the music with high sampling rate, AMR and SPEEX is the low end of the road, you can handle the 16K sample rate below the speech signal, speech synthesis, speech recognition, speech recognition and other speech applications is enough.

HKUST fly series of speech using SPEEX, information about the algorithm as shown in the following figure:

Detailed hotspots--endpoint detection of speech detection, noise reduction and compression | Hard to create open class

A wide range of compression transform the Speex codec library, compression level a wide range to choose from, so used in network condition in more complex mobile terminal application is appropriate.

Well, that’s the entire contents of this share class.


Audio endpoint detection, noise reduction and voice compression, a lot of people find it mysterious, difficult to understand and difficult to grasp. But the teacher whispered, usually feels big on the voice processing technology, is also easy. Original, not need is advanced of theory Foundation also can understanding these technology of key: audio endpoint detection of key is according to front of mute determine used to tell mute and effective voice of ruler, drop noise of key is using front of a small paragraph background noise extraction out noise of spectrum, audio compression method one of is full using human of heart acoustic, designated molecular with, removal redundant,.

Let us focus on speech processing technology with the latest developments in the above aspects.

(If you interested in the products and technologies of HKUST, fly, you can fly to HKUST’s website to view)


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s