The Power of Speech

Share this...

Amazon Dot - Inactive
Amazon Dot – Inactive

I recently invested in an Amazon Dot, and therefore in the AI software that makes the Dot interesting – Alexa, Amazon’s virtual assistant. But I’m not going to write about the cool stuff that this little gizmo can do, so much as what it led me to think about AI and conversation.

The ability to interact with a computer by voice consistently, effectively, and on a wide range of topics is seen by the major industry players as the next big milestone. Let’s briefly look back at the history of this.

Punched card with Fortran programming - I started with that language, long ago... (Wiki)
Punched card with Fortran programming – I started with that language, long ago… (Wiki)

Once upon a time all you could use was a highly artificial, structured set of commands passed in on punched cards, or (some time later) via a keyboard. If the command was wrong, the machine would not do what you expected. There was no latitude for variation, and among other things this meant that to use a computer needed special training.

Early IBM PC (Wiki)
Early IBM PC (Wiki)

The first breakthrough was to separate out the command language from the user’s options. User interfaces were born: you could instruct the machine what you wanted to do without needing to know how it did it. You could write documents or play games without knowing a word of computer language, simply by typing some letters or clicking with a mouse pointer. Somewhere around this time it became possible to communicate easily with machines in different locations, and the Internet came into being.

Touchscreen on early model iPhone (WIki)
Touchscreen on early model iPhone (WIki)

The next change appeared on phones first – the touch screen. At first sight there’s not a lot of change from using a mouse to click, or your finger to tap. But actually they are worlds apart. You are using your body directly to work with the content, rather than indirectly through a tool. Also, the same interface – the screen – is used to communicate both ways, rather than the machine sending output through the screen and receiving input via movements of a gadget on an entirely different surface. Touch screens have vastly extended the extent to which we can access technology and information: advanced computers are quite literally in anyone’s pocket. But touch interfaces have their problems. It’s not especially easy to create passages of text. It’s not always obvious how to use visual cues to achieve what you want. It doesn’t work well if you’re making a cake and need to look up the next stage with wet and floury hands!

Which brings us to the next breakthrough – speech. Human beings are wired for speech, just as we are wired for touch. The human brain can recognise and interpret speech sounds much faster than other noises. We learn the ability in the womb. We respond differently to different speakers and different languages before birth, and master the act of communicating needs and desires at a very early age. We infer, and broadcast, all kinds of social information through speech – gender, age, educational level, occupation, emotional state, prejudice and so on. Speech allows us to explain what we really wanted when we are misunderstood, and has propelled us along our historical trajectory. Long before systematic writing was invented, and through all the places and times where writing has been an unknown skill to many, talking has still enabled us to make society.

Timing Kindle cover
Timing Kindle cover

Enter Alexa, and Alexa’s companions such as Siri, Cortana, or “OK Google”. The aim of all of them is to allow people to find things out, or cause things to happen, simply by talking. They’re all at an early stage still, but their ability to comprehend is seriously impressive compared to a few short years ago. None of them are anywhere near the level I assume for Slate and the other “personas” in my science fiction books, with whom one can have an open-ended dialogue complete with emotional content, plus a long-term relationship.

What’s good about Alexa? First, the speech recognition is excellent. There are times when the interpreted version of my words is wrong, sometimes laughably so, but that often happens with another person. The system is designed to be open-ended, so additional features and bug fixes are regularly applied. It also allows capabilities (“skills”) to be developed by other people and added for others to make use of – watch this space over the next few months! So the technology has definitely reached a level where it is ready for public appraisal.

Hidden Markov model - an algorithm often used in speech recognition (Wiki)
Hidden Markov model – an algorithm often used in speech recognition (Wiki)

What’s not so good? Well, the conversation is highly structured. Depending on the particular skill in use, you are relying either on Amazon or on a third-party developer, to anticipate and code for a good range of requests. But even the best of these skills is necessarily quite constrained, and it doesn’t take long to reach the boundaries of what can be managed. There’s also very little sense of context or memory. Talking to a person, you often say “what we were talking about yesterday...” or “I chatted to Stuart today…” and the context is clear from shared experience. Right now, Alexa has no memory of past verbal transactions, and very little sense of the context of a particular request.

But also, Alexa has no sense of importance. A human conversation has all kinds of ways to communicate “this is really important to me” or “this is just fun”. Lots of conversations go something like “you know what we were talking about yesterday…“, at which the listener pauses and then says, “oh… that“. Alexa, however, cannot distinguish at present between the relative importance of “give me a random fact about puppies“, “tell me if there are delays on the Northern Line today“, or “where is the nearest doctor’s surgery?

These are, I believe, problems that can be solved over time. The pool of data that Alexa and other similar virtual assistants work with grows daily, and the algorithms that churn through that pool in order to extract meaning are becoming more sensitive and subtle. I suspect it’s only a matter of time until one of these software constructs is equipped with an understanding of context and transactional history, and along with that, a sense of relative importance.

Amazon Dot - Active
Amazon Dot – Active

Alexa is a long way removed from Slate and her associates, but the ability to use unstructured, free-form sentences to communicate is a big step forward. I like to think that subsequent generations of virtual assistants will make other strides, and that we’ll be tackling issues of AI rights and working partnerships before too long.

Meanwhile, back to writing my own Alexa skill…


Share this...

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.