How to Build a Voice Assistant with OpenAI GPT-3 and Google Speech API

Introduction

Voice assistants have become increasingly popular in recent years, with companies like Amazon, Google, and Apple releasing their own voice assistant devices. These assistants can perform a wide range of tasks, such as playing music, setting reminders, answering questions, and much more. In this tutorial, we will learn how to build our own voice assistant using OpenAI GPT-3 and the Google Speech API.

Prerequisites

To follow along with this tutorial, you will need the following:

Basic knowledge of programming and Python
Access to the Google Speech API – you can sign up for an API key on the Google Cloud Platform
An OpenAI GPT-3 API key – you can obtain this key by signing up for the OpenAI GPT-3 API

Setting Up the Project

To start, create a new Python project and set up a virtual environment. This will ensure that our dependencies are isolated from the global Python installation.

$ mkdir voice-assistant
$ cd voice-assistant
$ python3 -m venv env
$ source env/bin/activate

Next, we need to install the required packages. We will be using the google-cloud-speech package to interact with the Google Speech API and the openai package to use the OpenAI GPT-3 API.

$ pip install google-cloud-speech openai

Using the Google Speech API

The Google Speech API allows us to convert spoken language into written text. To use the Google Speech API, you will need to sign up for an API key on the Google Cloud Platform and enable the Speech-to-Text API.

Once you have your API key, create a new Python script, speech_to_text.py, and import the necessary modules.

from google.cloud import speech
import os
import io

Next, we need to authenticate with the Google Cloud Platform using our API key.

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/your/api/key.json'

Replace '/path/to/your/api/key.json' with the actual path to your API key JSON file.

Now, let’s create a function that will convert spoken language into written text using the Google Speech API.

def speech_to_text(audio_file):
    client = speech.SpeechClient()

    with io.open(audio_file, 'rb') as audio_file:
        content = audio_file.read()

    audio = speech.RecognitionAudio(content=content)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code='en-US'
    )

    response = client.recognize(config=config, audio=audio)

    return response.results[0].alternatives[0].transcript

The above function takes an audio file path as input and returns the transcribed text. We create a SpeechClient instance and read the contents of the audio file into a byte buffer. We then create a RecognitionAudio instance and specify the audio encoding, sample rate, and language code. Finally, we call the recognize method with the configuration and audio, and return the transcript.

Using the OpenAI GPT-3 API

OpenAI GPT-3 is a powerful language model that can generate human-like text based on prompts. To use the OpenAI GPT-3 API, you will need an API key. You can obtain this key by signing up for the OpenAI GPT-3 API.

Once you have your API key, create a new Python script, text_generation.py, and import the necessary modules.

import openai

Next, we need to authenticate with the OpenAI GPT-3 API using our API key.

openai.api_key = 'your_openai_api_key'

Replace 'your_openai_api_key' with your actual API key.

Now, let’s create a function that will generate text based on a given prompt using the OpenAI GPT-3 API.

def generate_text(prompt):
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        max_tokens=100,
        temperature=0.8,
        n=1,
        stop=None,
        temperature=0.8,
        frequency_penalty=0.0,
        presence_penalty=0.0,
    )

    return response.choices[0].text.strip()

The above function takes a prompt as input and returns the generated text. We use the Completion.create method to generate text based on the given prompt. We specify the engine, prompt, max tokens, temperature, and other parameters to control the behavior of the model. Finally, we return the generated text from the response.

Building the Voice Assistant

Now that we have set up the Google Speech API and the OpenAI GPT-3 API, we can start building our voice assistant.

Create a new Python script, voice_assistant.py, and import the necessary modules.

import os
import tempfile
import subprocess
import playsound
from gtts import gTTS

Next, let’s define a function that will record audio using the microphone.

def record_audio():
    temp_file = tempfile.NamedTemporaryFile(suffix=".wav")
    temp_file_path = temp_file.name

    subprocess.call(f"arecord -D hw:0,0 -f cd -t wav -d 5 -r 16000 {temp_file_path} 2> /dev/null", shell=True)

    return temp_file_path

The above function uses the arecord command-line tool to record audio from the microphone. It saves the recorded audio to a temporary WAV file and returns the file path.

Next, let’s define a function that will convert text to speech using the Google Text-to-Speech API.

def text_to_speech(text, language='en'):
    tts = gTTS(text=text, lang=language)
    temp_file = tempfile.NamedTemporaryFile(suffix=".mp3")
    temp_file_path = temp_file.name

    tts.save(temp_file_path)

    return temp_file_path

The above function uses the gTTS module to generate an MP3 audio file from the given text. It saves the audio file to a temporary file and returns the file path.

Now, let’s define the main function of our voice assistant.

def voice_assistant():
    while True:
        audio_file = record_audio()
        text = speech_to_text(audio_file)
        os.remove(audio_file)

        response = generate_text(text)

        temp_file_path = text_to_speech(response)
        playsound.playsound(temp_file_path)

        os.remove(temp_file_path)

The above function runs in an infinite loop. It records audio, converts it to text using the Google Speech API, generates a response using the OpenAI GPT-3 API, converts the response to speech using the Google Text-to-Speech API, and plays the response using the playsound module. Finally, it removes the temporary audio files.

Conclusion

In this tutorial, we have learned how to build a voice assistant using OpenAI GPT-3 and the Google Speech API. We have seen how to transcribe spoken language into written text, generate text based on prompts, and convert text to speech. With these capabilities, we can create our own voice assistant that can perform a wide range of tasks based on voice inputs.