Developing your own voice assistant

Developing your own voice assistant? Yes, it is possible!

As with the iPhone in 2007, which was the precursor to the advent of smartphones, voice assistants are in the process of changing the daily lives of millions of people.

A small, discreet device that slips into our living rooms, it is rapidly becoming attractive because of its ease of use. With a simple voice command, it becomes easy to order or ask for information from your assistant.

As voice is a natural medium of communication for humans, this makes this technology easily accessible even for non-technophiles.

As a result, for many years, we have seen the emergence of multiple players in this market: Google, Apple and Amazon. All of them are trying to leverage these assistants to bring and capture value for their customers.

Between connected speakers and smartphones, it’s easy to use your voice assistant, but how do you find your way around all these assistants available on the market? How do I make my connected object compatible? Should I trust and accept giving my data to the big guy?

All these questions should help guide you in designing your connected object.

To help you understand the world of voice assistants, we offer you a few tips to guide you.

How does a voice assistant work?

To fully understand the ins and outs, it is important to know the basic principles that make up a voice assistant. To do this, we will divide this operation into 4 main stages:

1. Capturing the Voice

As you will have gathered, the voice assistant is the first step in the process. This phase is very important because the understanding of your voice assistant will depend above all on the audio quality it receives. Today, the most frequent case is to use your connected speaker or your smartphone – two products specially designed around voice. So, if tomorrow your product itself is to have the capacity of listening to and recording voice, your audio system will have to be perfectly mastered. To activate your voice assistant, you need to use a keyword, for example, “OK Google” or “Hey Siri”. It is this activation that then allows us to reach our second step, the analysis of audio streams.

2. Turning Audio into Text

This step transforms the recorded audio into text. In effect, the audio is broken down into text, which allows for further analysis.

3. Natural Language Understanding (NLU)

The NLU brick allows incoming text messages to be analysed and decrypted to define a meaning and extract structured information that a machine can understand and analyse. The NLU produces one or more semantic analyses based on the meaning of the words and then a trusted annotator identifies the best analysis.

4. The Business Application

Thanks to the structuring provided by the NLU, we can then develop a business application that will react to the different inputs.

For example: “Show me the flights from Lyon to Paris for February 6th”.

The business application, through the NLU, will receive structured messages of the following form.

Area: Flight
Intent: Show me
Entity 1 city of departure: Lyon
Entity 2 arrival city: Paris
Entity 3 date: 06/02/2020

This structuring allows the developer to understand the user’s needs and to manage actions accordingly.

Here is a brief diagram that recalls our different steps.

Source: https://linto.ai/fr/

All voice assistants are mainly based on this type of operation. However, we will see that, depending on the different players, not everything is necessarily accessible or freely available. If the bases are common, it is not always easy to find one’s way around, because the development methods are not the same.

What are the platforms for developing your assistant?

The two main players in the assistant world are Google and Amazon. Both offer low-cost connected speakers to look after your daily life.

Their offers are solely based on the cloud, i.e. objects with a voice interface must be connected to the Internet to function. Indeed, as seen above, all the Speech to Text, NLU and business application parts are deported to the cloud. This makes it possible to provide extremely powerful voice assistants. Being entirely in the cloud allows the assistants to directly query other servers and the Internet to be as relevant as possible. Google and Amazon provide a number of APIs to easily interface with the assistants.

Consequently, the platforms appear as two black boxes capable of decoding the voice and providing developers with the associated actions. To simplify the process, Google and Amazon simply ask developers to add the skill to their assistants to be able to manipulate an object. At Amazon, it is a matter of developing skills, i.e. the developer must teach Alexa a new skill to interact with an object. It’s the same with Google with the term Action.

The fact that everything is in the cloud gives these players great power because more and more data is collected to analyse our behaviour. Thus, advice concerning the private life of users regularly surfaces (speaker that listens all the time, employee that listens, etc.). However, their ease of use, their ubiquity and their marketing impact often make them an unavoidable choice.

However, there are some more privacy-friendly alternatives, sometimes open source, and in some cases, they work without an Internet connection. Depending on your project constraints and the degree of privacy and security required, the choice of a customised assistant for your object may seem essential.

>> Read the article on IoT platform comparisons

In this case, it is necessary to be accompanied. At Rtone we have already implemented solutions of this type, in particular with the Snips platform recently acquired by Sonos, in order to distance ourselves as much as possible from the GAFA.

Here is a (non-exhaustive) list of interesting platforms in voice:

https://developer.amazon.com/fr-FR/alexa
https://developers.google.com/assistant/?hl=fr
https://linto.ai/fr/ (Open source)
https://snips.ai
https://mycroft.ai/ (Open source)

Never mind the assistant: why is it a good idea to be accompanied?

While voice seems easy and natural to humans, it creates some difficulties in developing an application. Let’s take the basic example of the need to turn off a light. In the physical world or on a mobile application, the proposed user interface will be a button. So in their experience, the user will either press a button or not, it is a simple binary choice that is very easy to interpret.

Now consider the same need to control this light, but by voice. Depending on the user, the way of asking may be different and the assistant must be able to understand and interpret it. Thus, we could have the following messages: “Turn off the light”, “Turn off the light”, “I want it to be dark”, “Put it in night mode”, etc. It soon becomes clear that even with a simple case, the experience can become complicated. Add to this the context and position. If, for example, your switch is fixed and always corresponds to the kitchen switch, how can you make your assistant understand which light to turn off? Not always easy.

Today, voice assistants and their functionalities are developing rapidly. Many assistants are now capable of recognising voices, locating you in space (with several assistants in multi-room) and making purchases for you. The user experience is becoming richer and more complex. This is why we believe it is essential to be accompanied, whatever the choice of platform.