What is Voice Recognition?
Voice recognition is the ability of a machine or program to receive and interpret dictation or to understand and carry out spoken commands. Voice recognition has gained prominence and use with the rise of AI and intelligent assistants, such as Amazon’s Alexa, Apple’s Siri and Microsoft’s Cortana.
1950’s and 1960’s
The first speech recognition systems were focused on numbers, not words. In 1952, Bell Laboratories designed the “Audrey” system which could recognize a single voice speaking digits aloud. Ten years later, IBM introduced “Shoebox” which understood and responded to 16 words in English.
Across the globe other nations developed hardware that could recognize sound and speech. And by the end of the ‘60s, the technology could support words with four vowels and nine consonants.
Speech recognition made several meaningful advancements in this decade. This was mostly due to the US Department of Defense and DARPA. The Speech Understanding Research (SUR) program they ran was one of the largest of its kind in the history of speech recognition. Carnegie Mellon’s “Harpy’ speech system came from this program and was capable of understanding over 1,000 words which is about the same as a three-year-old’s vocabulary.
The ‘80s saw speech recognition vocabulary go from a few hundred words to several thousand words. One of the breakthroughs came from a statistical method known as the “Hidden Markov Model (HMM)”. Instead of just using words and looking for sound patterns, the HMM estimated the probability of the unknown sounds actually being words.
Speech recognition was propelled forward in the 90s in large part because of the personal computer. Faster processors made it possible for software like Dragon Dictate to become more widely used.
BellSouth introduced the voice portal (VAL) which was a dial-in interactive voice recognition system. This system gave birth to the myriad of phone tree systems that are still in existence today.
By the year 2001, speech recognition technology had achieved close to 80% accuracy. For most of the decade there weren’t a lot of advancements until Google arrived with the launch of Google Voice Search. Because it was an app, this put speech recognition into the hands of millions of people. It was also significant because the processing power could be offloaded to its data centers. Not only that, Google was collecting data from billions of searches which could help it predict what a person is actually saying. At the time Google’s English Voice Search System included 230 billion words from user searches.
In 2011 Apple launched Siri which was similar to Google’s Voice Search. The early part of this decade saw an explosion of other voice recognition apps. And with Amazon’s Alexa, Google Home we’ve seen consumers becoming more and more comfortable talking to machines.
Today, some of the largest tech companies are competing to herald the speech accuracy title. In 2016, IBM achieved a word error rate of 6.9 percent. In 2017 Microsoft usurped IBM with a 5.9 percent claim. Shortly after that IBM improved their rate to 5.5 percent. However, it is Google that is claiming the lowest rate at 4.9 percent.
Why are voice assistants female?
Many users would imagine the AI to be a helpful, supportive, trustworthy assistant and female voice matches it. Most women would answer politely no matter the tone or topic. Amazon company chose a female voice because they wanted to bring the «most pleasing» sound in the living room. Amazon also did a study on women’s voice and found it more «sympathetic» and well received.
Как работают системы распознавания речи
В русскоязычном сегменте интернета не так много информации о том, как устроены системы распознавания речи. В этой статье мы, команда проекта Amvera Speech (ООО «Клэрити»), расскажем нюансы технологии и опишем путь создания собственного решения. В конце статьи – бесплатный телеграм-бот для теста системы распознавания речи, построенной на архитектуре, описанной в статье.
Сложности, с которыми сталкиваются разработчики систем распознавания речи:
Есть распространенное мнение, что распознавание речи — давно решенный кейс, но это не совсем так. Действительно, задача решена для определенных ситуаций, но универсального решения пока не существует. Это происходит из-за ряда проблем, с которыми сталкиваются разработчики:
Зависимость от домена
a) Разные дикторы
b) «акустический» канал записи звука: кодеки, искажения
c) Разное окружение: шум в телефоне, в городе, фоновые дикторы
d) Разный темп и подготовленность речи
e) Разная стилистика и тематика речи
Большие и «неудобные» наборы данных
Не всегда интуитивно понятная метрика качества
Качество распознавания измеряется по метрике WER (Word Error Rate).
Insertions – вставки слов, которых нет в исходной аудиозаписи
Substitutions – замены слов на некорректные
Deletions – система слово не распознала и сделала пропуск
Исходные данные: Стационарный (неразборчивая речь) телефон зазвонил поздней ночью
Гипотеза: Стационарный синий i айфон s прозвонил s поздней ночью
WER = 100*(1+2+0)/5 = 60% (т.е. ошибка равна 60%).
При этом есть как простые заблуждения при подсчете метрики, так и более сложные.
Пример простых заблуждений при подсчете метрики
Е/е с точками. Система переводит речь в текст и везде использует букву Е без точек, в то время как эталон, с которым сравнивается транскрипция, содержит Е с точками. Это неправомерно увеличивает количество Substitutions и увеличивает WER на 1%.
разное написание таких слов, как алло, але, алле и т.д.
строчные и заглавные буквы
усреднение по текстам, а не подсчет общего количества слов. Бывает, что в одном тексте WER 0,6, в другом 0,5, а в третьем – 0,8. Неверно будет вывести WER как среднее арифметическое из этих значений. Правильнее – подсчитать общее количество слов на всех текстах и на основе этого рассчитать WER.
Более сложные заблуждения могут быть вызваны тем, что на разных тестовых выборках WER будет разным. Иногда на одной конкретной аудиодорожке испытываемое решение работает лучше или хуже, чем решения конкурентов. Но делать из этого общий вывод о качестве работы решения – некорректно. Необходимы результаты на большом объеме данных.
Типы систем распознавания речи
Системы распознавания речи бывают двух видов – гибридные и end2end. End2end переводят последовательность звуков в последовательность букв. Гибридные системы содержат акустическую и языковую модель, работающие независимо. Решение Amvera Speech построено на гибридной архитектуре.
Устройство гибридной системы распознавания речи
Принцип работы гибридной системы распознавания речи:
Нейронная сеть классифицирует каждый конкретный фрейм звука,
HMM моделирует «динамику», «лексикон», «лексику», опираясь на постериоры NN,
Алгоритм Viterbi (Viterbi decoder, beam-search) занимается поиском по HMM оптимального пути, с учетом постериоров классификатора.
Первым шагом в гибридной модели распознавания речи выделяются признаки. Как правило, это MFCC коэффициенты.
Затем акустическая модель решает задачу классификации фреймов. Далее используется Viterbi-decoder (поиск по лучу). Он использует предсказание акустической модели и статистику языковой модели, которая по ngram показывает вероятность встречаемости звуков и слов. Затем производится рескоринг и выдается наиболее вероятное слово.
Устройство гибридной архитектуры
Ниже – иллюстрация классификации фреймов. Продемонстрированы фонемы в фреймах для слов «да и нет». Вероятность каждой из фонем записана в соответствующую ячейку.
Voice recognition and digitization of speech
In-depth article on Voice recognition and digitization of speech system.
Voice recognition is a technology that enables an acoustic signal coming from a human voice to be automatically recognized as certain words, activating the program that corresponds with those words. It can be used in many fields, such as in accessibility aids for people who are mute or physically disabled, video games, or other computer programs as an automated input device. Moreover, voice is becoming one of the most popular input devices for smartphones and tablets.
It has been widely used in public places such as airports or hospitals to make announcements and solve customer queries about specific services. Many companies also use voice recognition as a way to improve their customer service.
Voice digitization is the process of converting a voice into a digital signal using an analog-to-digital converter so that it can be manipulated by the computer. Speech recognition is the process of analyzing and translating the speech signal into text.
In this sense, the first thing that needs to be done is to record someone’s voice using a microphone. The recording can be using any type of sound recorder that has an output that goes through an analog-to-digital converter (ADC), and it should always be in stereo 16-bit ACM format. To access the microphone directly, the code needs to be written for that particular operating system (OS) and hardware.
Usually, the main part of a voice recognition program is formed of three stages: speech recognition, text synthesis, and text-to-speech (TTS). The first stage is to obtain a numerical value from an input signal that represents its energy, using adaptive filtering. Then, using a Hidden Markov Model (HMM), it converts this energy into a sequence of phonemes that can be interpreted by a finite state machine. Once these phonemes have been obtained, they are converted into words using statistical language models.
With speech recognition, the first stage is to obtain a numerical value from an input signal that represents its energy. With the information obtained in this way, it is possible to determine what type of sound it is: voice, music, or noise. Adaptive filtering and a Hidden Markov Model (HMM) are used to convert this energy into a sequence of phonemes that can be interpreted by a finite state machine. For this reason, the speech recognizer’s processor must have sufficient information on the nature of the sounds used to interpret them accurately. After that, the phonemes are then converted into words using statistical language models. This stage is known as speech recognition, and it’s very similar to language recognition, which is not specific to voice.
The main differences between the two are:
During the word recognition stage, there are several approaches it can take:
A simpler way of getting some results is by using an “n”-gram model. In this model, it works with user-specific dictionaries to obtain statistics about the words in them, such as their frequencies and phone/word transition probabilities. These probabilities are stored in a table called an n-gram model.
Voice recognition has several applications, some more advanced than others. In the case of ADR (Automatic Dialogue Replacement), for example, the objective is to match each word spoken by a character in a movie or TV series with a transcription, which can be text from a script or generated manually by an automated dialogue transcription system. This enables the new audio to be edited and synced automatically to the video footage.
To edit ADR, all that’s needed is an actor’s voice recording and the original video file of the scene being edited. Voice recognition can also be used to digitalize a variety of business processes and statistics, such as financial transactions, customer records, product pricing, and inventory.
Voice recognition may have an important role in the future of home automation. The main purpose of voice recognition systems is to allow users to control home appliances with their voices so that they can benefit from this technology without having to use remote control.
Voice recognition systems were first developed for use in dictation applications. In these speech digitization systems, the user speaks into a microphone to record the original analog audio stream digitally on a computer. It is then converted into text so that it can be edited or processed by other software programs. This type of speech digitization system has many advantages over regular transcription systems:
In the past, there have been several speech recognition systems developed by different researchers. In the 1960s there was a machine called “The Mechanical Brain”. It used a chessboard-like structure with magnets to pass on bits of information and it was programmed with 1000 words. In the 1970s DART (Digital Automated Receiver Transcriber) was invented by IBM and in 1998 Japanese company Lernout & Hauspie created the Kurzweil Voice System. Other systems were developed in Germany, France, and the UK. In 2000, the first commercially-available speech recognition system became available when Microsoft came out with the “Totem Speech” program.
The technology of voice recognition has advanced over the last decade and is now becoming more sophisticated. A simple system with limited capabilities, such as that used previously by mobile phone operators to check phone bills, can now be improved to allow users to take dictation or control their home appliances through voice commands. As recent advances in technology have made it possible for computers to replicate human voices using computer-generated speech synthesis engines, it has become possible for homeowners to use a personal computer without having to learn the use of specialty interfaces or keyboards.
Early voice recognition systems used a small set of pre-programmed words. However, recently, developers have been able to use machine learning techniques to create speech recognition systems that can recognize the words used by real people. The most common techniques used in speech recognition systems are:
For a voice recognition system to work, it must capture a series of words at certain points in time or its output will be unpredictable and unreliable. Usually, the key to capturing those words is recording them on a frame-by-frame basis, but some speech recognizers can also work with sound waves directly from the microphone that is analyzed by dedicated software.
The key to the speech recognition problem is that a speech recognizer must be able to accurately recognize a word in the presence of any background noise. Background noise poses a significant challenge for speech recognizers because it has not been assigned any symbolic value by human beings so humans are unaware of it and therefore cannot speak about it. To this end, it must be possible to recognize words in the presence of background noise that is representative of what a person in normal conversations would say. For a speech recognizer to succeed, its recognition accuracy must be above 70%, and the optimal vocabulary size, that is the number of words needed for accurate performance, should not exceed approximately 1000.
Before speech recognition, it was simply not possible to infer what someone said to a computer using only recordings made over the telephone. For example, a user could not have an interactive conversation with a computer using only recorded phone calls that they had made or received. The technology of voice transcription was introduced in the 1940s and by the 1960s, when PC technology became available and affordable, it allowed users to interact with computers without having to speak into a microphone. Speech synthesis is the process of translating human speech into digital signals for use on electronic devices (such as computers). The process usually is accomplished by converting audio signals into discrete digital samples called “code words” which are then spoken aloud via a speaker.
Therefore, the technology of speech recognition has evolved alongside that of speech synthesis. Some speech recognition systems can even be trained to recognize any voice, as long as the user is willing to take the time to record their voice to create a speech model. This process is similar to creating a telephone voice mail message: one needs only record a brief introduction and then the system automatically records all future incoming messages for you so that you can listen later and reply by phone.
In most cases, a user’s recorded voice must be analyzed by an automated system for it to be converted into text. If a user is required to manually transcribe the words, it can become tedious and may prove to be quite time-consuming. This is why automatic speech recognition technology is useful to mobile phone operators as well. A mobile phone user can dial their phone and talk directly into the machine without having to transcribe what they say on a keyboard. The telephone operator then listens to the conversation and verifies that the caller’s voice matches up with the recorded voice of the customer.
Voice recognition systems are currently in use in various industries such as banking, healthcare, legal, home appliance control, and education. They are also used in many new products such as smart watches, voice-activated home appliances, personal computers, and smartphones. Voice recognition systems can also be useful for blind and visually impaired people to access computer-based services.
Numerous companies have come out with alternative speech recognition software that permits users to type a word or phrase as it is being said and in most cases will transcribe the word or phrase accurately. In 2011, the startup Vocera received $17 million of funding from venture capital firm Accel Partners. It was announced that the app would be available on the iPhone by the end of 2011. This was followed by a similar announcement from Apple Inc. regarding an app called VoiceOver for iOS devices on January 30, 2012.
Google Inc. also offers a system called Voice Recognition by Google that converts spoken words into text using algorithms to understand the spoken words. Google has also provided an API (application programming interface) for developers to access its voice recognition tools. Among these tools are special language models that can process different languages and dialects and include spoken language data from the web, as well as collected through private research of Google products. Another method originates from research conducted at Georgetown University’s Center on Education and Workforce. This system, known as the “Gesture Recognition System” was integrated into Android 4.2 on the Nexus 5 in late 2013. It can also be used in web application development with the JAWS screen reader to read information aloud when its sighted user is not looking at it.
In 2011, Microsoft introduced a technology called “Cortana”, which is an attempt to provide a real-time personal assistant and was announced by Microsoft CEO Steve Ballmer the same year. In 2012, Microsoft acquired the speech recognition firm SpeechSynch, which will allow the technology to be integrated into Skype.
Several other voice recognition products and technologies are available today. In 2013, Samsung introduced a voice recognition feature on their Galaxy S4 smartphone that could be used for functions such as making phone calls or texting, although only in Korean at the time. This was followed by a feature on their newer models, the Galaxy S5.
Notable instances of voice transcription, speech recognition, and voice recognition technology can be found in current media such as television and film:
Both the TV show Buffy the Vampire Slayer and the film The Matrix used advanced feeds to recognize users’ voices. The Minority Report, starring Tom Cruise, used a system called “precognition”. A scene from the movie WALL-E features an automated trash compactor that reacts to human voices. A pilot episode also featured a speech recognition system at Los Alamos National Laboratory; this was also used in “The West Wing”.
Future developments in voice-transcription software will eventually make it easier for technology users to interact with their surroundings. This includes the ability of a computer to understand the meaning of what the user is saying so that it can be used in a more meaningful way. For example, Apple’s Siri acts as a personal assistant to help users manage e-mails and organize calendars in which a human interacts via speech recognition. In 2014, Microsoft introduced Cortana, an intelligent personal assistant that takes note of e-mails and information related to appointments. Google inc’s Google now also provides personal assistance based on speech recognition for its Android OS smartphone users.
Many websites have speech recognition technology such as Amazon, where a user can search for an item by saying the name of what they are looking for. It is not necessary to use a physical keyboard on a computer to speak into the microphone to have the computer understand what you are saying. This is especially useful for blind users who do not have easy access to a physical keyboard. Voice recognition is widely used for online banking and online shopping, as well.
Listening to recorded or spoken voice information or text can be very laborious, especially when using speech recognition software. The process of analyzing what has been spoken needs to be modified to make it easier and more efficient. There is the possibility of doing this by using a deep neural network that can be trained to listen to user speech. This is done by having the dialogue read aloud so it can be analyzed in this way. The ideal method would be to develop a hybrid system that combines both speech recognition and machine learning (deep neural network). This would allow users to interact with their surroundings without having to use a physical keyboard or mouse, which can be very tiring. Deep neural networks are made up of many layers of interconnected nodes that are interconnected using one-way links. Deep learning algorithms have been developed for speech recognition, which is beginning to make voice recognition more common today.
Voice recognition has been criticized for its unreliability and inaccuracy, particularly in automated telephone answering machines, for which it is not expected to perform well. In 2010, Scientific American noted that “a typical speech recognition system… will automatically transcribe a sentence spoken into a computer at the rate of about 25 words per minute,” and that it is better in “some situations” and “less accurate in others”.
However, there are many tasks where voice recognition software is quite good. If the task involves listening to short questions/answers (i.e. a PIN) or even proofreading the text, then good results can be achieved. Speech recognition is also used for processing spoken language in large databases that are not transcribed. Speech recognition software is often used in call centers where the primary task is to type information from voice input.
A common criticism of voice-based communication between a “driver and a car” is that it may distract the driver from the road. According to the Society of Automotive Engineers (SAE), “The use of any hands-free telephony device while driving, whether its handheld or hands-free, has been shown to increase crash risk by at least two and a half times”. Also, “the voice commands that drive-by-wire technology relies upon may place the driver at risk if the vehicle’s systems fail”.
Voice recognition software can also be problematic for elderly people. A large portion of seniors wants to carry out tasks on their computers without having to use a physical keyboard or mouse. A study found that having hands-free access was useful in improving their ability to use a computer, but it did not necessarily provide the same support for tasks that were simply easier to type. This could include finding information and navigating around the web, which one would normally do by typing in URLs or keywords.
The main issue with speech recognition is its inaccuracy when used in an environment where background noises are present. Background noise is a large problem for the accuracy of the recognition process. This can be partially solved by using two microphones (one for each speaker) to capture speech. The second solution may be more efficient since it allows the microphone to use both ears at the same time, as opposed to only one.
Speech recognition also has problems with noisy environments, as well as with accents and voice quality. If a noisy environment is present then acoustic echo cancellation has to be used to improve speech recognition. Acoustic echo cancellation involves using various filters to block out the sound of an echo, which can provide better results than simply ignoring it.
The main issue with a voice recognition system is that if a person says something that the software did not recognize they will be prompted to say it again, where the software needs to hear it more accurately. This can lead to high data entry errors. To avoid this error, the system has to be very accurate. One way of doing this is by using a neural network that has hidden layers of nodes. A neural network can be trained by giving it many examples of good speech and one example of bad or missing speech from the user.
The accuracy of voice recognition depends on many factors including voice quality and accent as well as background noise, which can affect the detection or understanding of what is being said. The accuracy of voice recognition is also affected by the training process of the neural network that is being used. Speech recognition software can be tested to see how well it can distinguish between words, as well as how fast it can recognize them after hearing them only once.
The main issue with speech recognition and learning is that it takes a significant amount of time for a neural network to learn new accents and words in the language. This may cause problems if many different users are speaking at the same time. Another consequence of this inaccuracy from users who speak in a way that the software does not recognize can be prompted to repeat what was said twice before it correctly recognizes what was said.
One problem with a hybrid system is that it can be difficult to recognize whether the user is speaking or listening. If a user is not comfortable with the system then they will probably try to avoid it. To address this issue, an evaluation has to be carried out on how well the system can recognize all types of voices as opposed to recognizing just one person’s voice. A system that allows users to have their phone on silent mode sounds in different ways when not being touched and will likely be better at distinguishing between listening and speaking modes. The ability of voice recognition software to hear the difference between a user who is speaking and eating quietly and one who is only eating quietly may also be important for making the system more accurate.
Another problem with a hybrid system is that it can be difficult to distinguish between different people’s voices and how they pronounce words. Some of the main issues regarding voice recognition are that it has a poor understanding of accents and language, and monotone users are not as easy to distinguish from one another.
One of the most important features of hybrid systems is to make sure that the phone application always listens for the user’s speech, and never just for their touch. The ideal would be for simultaneous hearing on both sides, but this is rarely possible. Most voice recognition software can switch the phone from listening mode to listening mode depending on who is speaking or touching it. However, this is not as accurate as simultaneous mode because it can be difficult to determine who is touching or speaking.
The main issue with voice recognition systems for phones is that the built-in microphone on the phone may not pick up on the user’s voice very well. External microphones can do a much better job of picking up the user’s voice and are easier for users to understand.
One benefit of speech recognition software is that it understands what people are saying regardless of how many different accents and languages there are. Another benefit is that it allows a person to control their tasks without having to use their hands, which may help them avoid carpal tunnel syndrome, repetitive strain injury, or other similar conditions. Some of the other benefits of speech recognition systems are that they learn languages and accents, reducing the likelihood that they will need to be re-trained once a new user begins using them. They are also usually more convenient than typing because it doesn’t require hands and there aren’t any keys to hit, which makes them faster and easier to use.
One benefit for people with physical disabilities, who cannot use a keyboard or mouse, is that speech recognition software does not require arm or hand movement for it to run as opposed to other known alternatives. As well as this, users with mobility disabilities may find wireless technology easier to use than wired ones because there are no wires connecting the computing device to the input device (the keyboard). With wireless technology the user can easily move around while they are typing or listening. They also need to worry less about losing the device because there is not a cord connecting them to it. This means that users can use speech recognition software independently, as opposed to having someone else assist them with it.
A disadvantage of speech recognition software is that they may not be able to understand directions well enough from speech alone. People who have physical disabilities may have trouble seeing the screen on their computer and could get confused by written or spoken directions if they cannot see them well enough to understand them.
One of voice recognition programs’ benefits is that it does not use a keyboard or mouse, simplifying access for people with mobility limitations. However, voice recognition programs are more difficult to use than other alternatives for people who have cognitive impairments. This is because it requires more focus and mental effort to remember what to say and when to say it. If the user has a learning disability, they may find this type of software hard to use because it relies on the user’s memory and concentration skills. If the user is easily distracted by background noise or doesn’t know how to ask for help, then this type of software may not be ideal for them.
Voice recognition software could potentially lead users into trouble because if the users were doing something wrong (such as violating an internet usage policy) then they could be banned from using the device. If they are banned from using the device then it could become more difficult for them to communicate with family, friends, and co-workers.
One of voice recognition programs’ disadvantages is that people who have upper-extremity disabilities cannot use them without a lot of difficulties. This is because it requires good vision and control of hand and arm movements. To use this type of software, someone must be able to see the screen clearly and have good fine motor skills to be able to move the mouse or touch the screen in a way that directs the program. In some cases, this requires input from another user or caregiver. People who do not know how to ask for help may not know how to use voice recognition software.
Another disadvantage of voice recognition software is that it requires some mental effort and concentration to control, which means that people with cognitive impairments may have trouble using it.
One disadvantage of speech recognition over other forms is that the user must direct their energy towards speaking instead of thinking about what they want to say. This method also requires them to be able to speak clearly and audibly so that the microphone can pick up their voice. If someone is not able to speak loudly then it can be difficult or impossible for them to communicate using this type of software.
Another disadvantage is that it could be bad for the user’s health to use voice recognition software, depending on how much they use it. Another potential problem is that if the software is always listening, then someone may be able to hack into it and gain access to personal information or files.
One of speech recognition’s disadvantages is that some people experience difficulty with speech recognition programs when there are background noises present. This may make it hard for them to communicate in environments where other people are talking, such as a classroom or workplace.
Разрабатываем приложения для распознавания речи с помощью Python
Под этим понимают автоматическое распознавание речи или голоса. Данное понятие включает в себя синтез осмысленных речевых сигналов с помощью семплирования, искусственных нейронных сетей и машинного обучения.
Мы хорошо знаем все эти методы. Такие приложения, как Apple Siri, Google Assistant, Amazon Alexa, доступны большинству из нас. Конечно, они опережают себе подобные, в их основе лежит серьезная инженерия. Помимо синтеза осмысленных звуковых сигналов в них также используются алгоритмы NLP (обработки естественного языка).
Преобразование речи в текст — первый шаг к созданию голосового помощника. Итак, во-первых, мы ожидаем, что приложение поймет, о чем мы говорим, не так ли? Если он может синтезировать звуки, которые в его коде преобразуются в текст, то первый шаг в обработке этих звуковых данных уже сделан. В этой статье мы разработаем два приложения на Python с помощью нескольких библиотек и сосредоточимся на проблеме преобразования речи в текст.
1. Приложение для преобразования аудиофайлов в текст
В этом приложении мы постараемся конвертировать аудиофайлы в текст. Преобразование аудиофайлов в текст является одним из предметов, который изучает наука о данных. Например, вы можете создать чат-бота, занимаясь обработкой голоса, или классифицировать запросы, поступающие в центр обработки вызовов, запустив NLP.
Распознавать речь мы будем в основной библиотеке SpeechRecognition. Как можно понять из названия этой библиотеки, принцип ее работы основан на распознавании речи, и она взаимодействует со многими API.
API, поддерживаемые библиотекой:
(работает в автономном режиме)
- Google Speech Recognition (работает в автономном режиме)
Начнем писать код!
1. Для начала установим модуль SpeechRecognition:
2. Теперь подключим нашу библиотеку:
3. Присвоим распознающее устройство переменной, с помощью которой будет происходить процесс распознавания:
4. Создадим аудиофайл.
Прежде чем создавать аудиофайл, необходимо выяснить, какие типы файлов поддерживает нужная нам библиотека. Вот список поддерживаемых форматов:
Также работать можно с mp3, m4a и файлами других типов. Для этого нам придется воспользоваться онлайн-конвертером аудиозаписи. Я выбрал zamzar. Загружаем файл и преобразуем его в формат .wav.
Присвоим созданному нами аудиофайлу значение переменной файлового типа:
5. Теперь мы можем преобразовать звук в текст:
Для распознавания данных мы использовали метод recognize_google, реализованный с помощью Google Cloud Speech API. Кроме того, мы присвоили языку значение language = ‘tr’, чтобы программа лучше воспринимала звуки турецкого языка.
6. Выполняем код.
Мы протестировали код, обработав две секунды короткого аудиофайла под названием “deneme deneme” (турецкий язык). Результат полностью оправдал наши ожидания.
Обратите внимание, что это приложение также может легко преобразовывать в текст более сложные и длинные аудиофайлы. Для этого запишите аудио различной длительности и сложности, присвойте им значение переменной файлового типа и наблюдайте за результатом.
2. Приложение для мгновенного преобразования речи с микрофона в текст
В этом приложении мы постараемся преобразовать звуки в текст в режиме реального времени, используя микрофон компьютера. Для этого мы снова воспользуемся библиотекой SpeechRecognition. Также мы будем использовать новый модуль PyAudio. Он необходим для того, чтобы иметь возможность обрабатывать аудиовход.
Начнем писать код!
1. Установим модули:
Если вы работаете на Mac, вам необходимо установить portaudio. Для этого используйте brew:
Если вы уже скачали SpeechRecognition для предыдущего приложения, вам не нужно устанавливать его снова.
2. Подключим библиотеку и назначим метод для определения данных:
Опять же, если вы выполняли эти шаги в предыдущем приложении, вам не нужно их повторять.
3. Преобразуем звук в текст.
В предыдущем приложении мы использовали аудиофайл в качестве источника звука. Для этого приложения источником является микрофон, для захвата звука мы используем метод sr.Microphone (). Аналогичным образом указываем, что отслеживаемый язык турецкий. При желании можно настроить время прослушивания звука. В этом примере оно составляет пять секунд. Однако, если нужно, время можно продлить.
4. Выполняем код.
Мы повторяли цикл “Merhaba Dünya” (турецкий язык) в течение пяти секунд и получили следующий результат. Программа так же справляется с более длинными предложениями:
Чтобы понять, как работает модуль, посмотрим, как код производит альтернативные выходные данные, используя опцию show_all = True.
Мы разработали два простых приложения, которые наглядно демонстрируют, как происходит преобразование звука. Сделали мы это с помощью Google Cloud Speech API.
Эти приложения можно значительно улучшить. Если вы хотите каким-либо образом применить звуки в своих проектах, вам стоит использовать эти модули, так как с ними легко работать. Особенно полезными они могут быть для специалистов по обработке данных, которые могут использовать эти библиотеки на этапах предварительной обработки данных, если им необходимо обрабатывать аудиофайлы и использовать их в проектах, связанных с машинным обучением.
Надеюсь, что эти два приложения были полезны вам в качестве базовой информации.