Categories
Python Artificial Intelligence

Python: The Complete Guide To Speech Recognition

Speech Recognition is a hit in the market. It’s a flashy technology that is used mainly in voice assistants like Apple’s Siri, Amazon’s Alexa, Microsoft’s Cortana, Google’s Allo, etc. It can massively increase the user interaction in your project and will make it look amazing. Writing a program that uses speech recognition is far easier than you think. We will be covering everything from installing to implementation of speech recognition.
Though this program will also work with Python 2, we recommend using python 3.

There’s a lot that you can do with this package but I’ll try to keep it short and cover only the most important things needed to make a simple program or to implement vocal recognition in your own program.
NOTE: If you are looking just for the code then the final code is in the Conclusion section.
You can view the code for Working with microphone and Working with Audio Files directly. But if you want to know about the working of the code then you should read the article.

Introduction to Speech Recognition

Exactly What is speech recognition? In simple words, it is a technology that takes voice as an input and converts it into computer understandable language.

Setting Up

There are many modules that can be used for speech recognition like google cloud speech, apiai, SpeechRecognition, watson-developer-cloud, etc., but we will be using Speech Recognition Module for this tutorial because it is easy to use since you don’t have to code scripts for accessing audio devices also, it comes pre-packaged with many well-known API’s so you don’t have to signup for any kind of service which you may have to while using any other module. And, it gets the job done pretty well.

Requirements

  • Python 3.3+
  • Speech Recognition
  • *PyAudio 0.2.11
  • *PocketSphinx (offline use)
  • FLAC encoder (required only if the system is not x86-based Windows/Linux/OS X)

We will be using SpeechRecognition and PyAudio Module.
*PyAudio: This module is only required if you want to take the user’s voice as an input and not use pre-recorded audio files.
*PocketSphinx: Only use PocketSphinx if you have to use your program offline. I don’t personally recommend using PocketSphinx because of low accuracy.

The Speech Recognition Module

The Speech Recognition engine has support for various APIs. The most common API is Google Speech Recognition because of its high accuracy. In this tutorial though, we will be making a program using both Google Speech Recognition and CMU Sphinx so that you will have a basic idea as to how offline version works as well. However, if you want to use any other API, its pretty easy to switch, you just have to change the recognizer method(we will discuss it later in this tutorial)

Installation

I’m assuming you have python 3 properly installed. Open command prompt and type

­­
pip install speechrecognition
­pip install pyaudio
pip install pocketsphinx
­

NOTE: PyAudio is not available for python versions greater than 3.6.
If you are using python 3.7 or greater then download PyAudio wheel from here.

After downloading you can install wheel file as:

­
pip install C:/some-dir/some-file.whl
­

here some-dir and some-file.whl is the directory and the file name of that wheel respectively.
After installing, its always better to check the installation, open python terminal, and type

>>import speech_recognition as sr
>>sr.__version__
'3.6.0'

If it returns ‘3.6.0’ or higher then you are good to go (3.6.0 is the version that I was using because some of the libraries were not compatible with 3.7+ that I needed in my other project)

Working with Speech Recognition

After installing all the packages we are finally ready to start writing our first voice-based program. We will first be discussing the working of the speech recognition package and then we will start coding the program. You can skip to the code directly but I recommend reading the working as well because that will give you the idea about the options available that you can experiment with.

How Does it Work?

Speech Recognition has an instance named recognizer and as the name suggests it recognizes the speech(whether from an audio file or microphone). You can create a recognizer instance as follows:

import speech_recognition as sr
r = sr.Recognizer()

Note that we imported the Speech Recognition package as speech_recognition whereas we installed it as SpeechRecognition. This is a common mistake that many users make while importing this package.

Now Each Recognizer instance has eight methods by which it can recognize speech those are:

Out of these if you want to make a trigger word for your program so that it only starts and stops listening speech at a particular word(like Hey Siri or Ok Google) then you can go for Snowboy Hotword Detection.
Now with speech recognition, you can do two things:

  1. Take voice as an input from the user or
  2. Use a pre-recorded audio file

Now for most of the people option two would be pretty useless but we will be discussing both of them in this tutorial. You can skip to the desired part of the article from the sidebar.

Using Microphone

The most awaited and most fun section finally. Writing the above part was kind of boring but it was a complete guide afterall so i had to do it.

Now for working with microphones, you need PyAudio as I mentioned above.
After installing everything, you just need to type in this code.

import speech_recognition as sr
r = sr.Recognizer()

with sr.Microphone() as source:
	print("start")
	audio = r.listen(source)

	try:
		data = r.recognize_google(audio)
		print(data)
	except:
		print("Please try again")

Note: If you are using microphones other than the default one then please refer to configuring microphone settings at the bottom.

Advertisements

Now what we did here is we made the microphone as a source using the Microphone() method. I print start so that I would know when to start speaking. Though a symbol of mic appears in the taskbar when the program is using microphones, I did it just out of convenience.

Then we declared new variable audio that records the input. Till now the audio is only recorded through the microphone and not converted into text.

Now here comes the recognize_google() method. The recognize google method sends the audio data to the google web speech server and retrieves response.

Please note that recognize_google() uses google web speech api which is free but it is provided for testing purposes only. So I advice you not to use it for real world project.

Now there might be a time when the audio recorded just couldn’t be converted to text because of noise or other problems. If that happens then you will encounter a traceback error like this.

Traceback (most recent call last):
speech_recognition.UnknownValueError

To prevent this from happening we put a try and except block. This checks if there is some data in the audio variable that can be converted into text. If there is some data then it completes the try block and skips the except block. If there is no data or there is some problem in parsing the data then it runs the except block.

Also, you can raise multiple exception blocks for different errors but the detailed working of try and except blocks is beyond the scope of this tutorial, so we are going to skip it.

If everything is clear to this point then changing methods is simple like if you want to use CMU Sphinx instead of google then all you have to do is replace the recognize_google() with the recognize_sphinx() method. The code would be

import speech_recognition as sr
r = sr.Recognizer()

with sr.Microphone() as source:
	print("start")
	audio = r.listen(source)

	try:
		data = r.recognize_sphinx(audio)
		print(data)
	except:
		print("Please try again")

The main reason why I recommend google web speech instead of CMU sphinx is because of the accuracy

For the same sentence “test speech recognition” both outcomes are different and it is clear the google does it better.

Using Audio Files

Working with audio files is quite similar to working with microphones. The only difference is that instead of taking voice input from the users all you have to do is pass an audio file as a source.

We will be using the sample audio that you can download from here

After downloading, place the audio file in the same folder as your program file. After that, all you have to do is instead of sr.Microphone() use sr.AudioFile(‘filename.extension’) and change the r.listen() method to r.record() method.
That’s it, that’s all there is to it. You can check out the code below

import speech_recognition as sr
r = sr.Recognizer()

with sr.AudioFile('english.wav') as source:
	audio = r.record(source)
	try:
		data = r.recognize_google(audio)
		print(data)
	except:
		print("Please try again")

Conclusion

Congratulations on making it this far! This has been a long journey( at least for me haha :D)
Below is the quick recap of what we did so far( cause who’s got the time to read the freakin thing again and again)

If PyAudio doesn’t work properly, read the installation part again

The Speech Recognition has an instance named recognizer and as the name suggests it recognizes the speech(whether from an audio file or microphone). Depending on what method you use, it process audio data accordingly. There are both online and offline methods. In online methods like recognize_google() it sends data to Google Web Speech Server whereas in recognize_sphinx() method it process data within your device.

In order to take user’s voice as an input, you need PyAudio installed. Below is the complete code for processing audio data with microphone. Note that it uses sr.Microphone() and r.listen() methods

import speech_recognition as sr
r = sr.Recognizer()

with sr.Microphone() as source:
        print("start")
        r.adjust_for_ambient_noise(source, duration=0.5)­
	audio = r.listen(source)
	try:
		data = r.recognize_google(audio)
		print(data)
	except:
		print("Please try again")

Working with Audio files is similar to working with Microphones. Place the audio file that you want to process in the same directory as the program file.Below is the complete code for processing audio file. Note that it uses sr.AudioFile() and r.record() methods. Suppose you have a file name ‘english.wav’ then the code would be

import speech_recognition as sr
r = sr.Recognizer()
 
with sr.AudioFile('english.wav') as source:
    audio = r.record(source)
    try:
        data = r.recognize_google(audio)
        print(data)
    except:
        print("Please try again")

I tried to keep it as short and informative as possible but if I missed anything please tell me in the comments section below also if you have any doubts you can comment them down as well. I will try my best to answer you!

Advertisements

Miscellaneous

Configuring Microphone Settings

Note: If you are going to use your system’s default microphones then you can skip this section as there is no need to configure any microphone settings. If you want to use the microphone other than the default one then you have to select it manually.To get a list of microphones available type:

>>sr.Microphone.list_microphone_names() 
['Microsoft Sound Mapper - Input', 
'Microphone (2- Realtek Audio)',
 'Microsoft Sound Mapper - Output', 
'Speakers / Headphones (2- Realt', 'Primary Sound Capture Driver',
 'Microphone (2- Realtek Audio)', 
'Primary Sound Driver',
 'Speakers / Headphones (2- Realtek Audio)',
 'Realtek ASIO', 'Speakers / Headphones (2- Realtek Audio)',
 'Microphone (2- Realtek Audio)',
 'Stereo Mix (Realtek HD Audio Stereo input)',
 'Microphone (Realtek HD Audio Mic input)',
 'Speakers (Realtek HD Audio output)']

Note that the list of available microphones may vary. Now that you have a list of microphones all you have to do is select the index number of microphone. For example if you want to use Microsoft Sound Mapper which is at index number zero then all you have to do is

import speech_recognition as sr
r = sr.Recognizer()

with sr.Microphone(device_index=0) as source:
	print("start")
	audio = r.listen(source)

Reducing Noise Levels

Though you can not remove the noise completely from the background, there is adjust_for_ambient_noise() which can help you lower the noise levels in audio. For Example:

with sr.Microphone() as source: #You can also use this with audio files
	r.adjust_for_ambient_noise()
	print("start")
	audio = r.listen(source)

adjust_for_ambient_noise() reads the audio file for one second and adjusts the recognizer for the noise of that level. You can reduce the time it takes to read the audio file but it is recommended to let it read file for at least one second. But if one second seems like too much time then you can lower it by using the duration keyword. Example:

­
r.adjust_for_ambient_noise(source, duration=0.5)
­

In the above example, I reduced the duration to 0.5 seconds. You can increase or decrease according to your needs. The more time you give it the better it does its work. But if you are reducing the time limit then keep it at least 0.5 seconds (the SpeechRecognition documentation recommends that)

Working with Multiple Languages

If you are using Google Web Speech then you don’t need to do anything else as it already works for almost all languages. But if you are using CMU Sphinx and need any other language pack installed then you can install addition language packs from here.
All instructions for installation are mentioned on that page as well.

Writing to an Audio File

If you want to record and store user’s voice for further use(for building neural networks or maybe some other purposes) then this option is very useful for you. And this step is really easy. All you have to do is add the write command after using the r.listen() method. For example:

import speech_recognition as sr
r = sr.Recognizer()

with sr.Microphone() as source:
	print("start")
	audio = r.listen(source)

with open("microphone-results.wav","wb") as f:
	f.write(audio.get_wav_data())

This will create a file named microphone-results.wav in the same directory as the program file. With some string formatting, you can save different audio files with different names but we will cover string formatting in some other articles.

So this sums it up. Thank you for reading it till here. This article took a lot of time and energy so if you like it please tell me in the comments below. I’ll be posting more awesome articles so in order to get notified sign up for our newsletter below.

Have a nice day!

By Suyash Agarwal

Entrepreneur | Blogger | Programmer
I help people learn how to code and scale business with content marketing.
Check out my AI Writing SaaS - https://flackedai.com