Category Archives: Alexa

More about AI and voice technology

A couple of weeks ago I went to a day event put on by Amazon showcasing their web technologies. My own main interests were – naturally – in the areas of AI and voice, but there was plenty there if instead you were into security, or databases, or the so-called “internet of things”.

Amazon Dot - Active
Amazon Dot – Active

Readers of this blog will know of my enthusiasm for Alexa, and perhaps will also know about the range of Alexa skills I have been developing (if you’re interested, go to the UK or the US sites). So I thought I’d go a little bit more into both Alexa and the two building blocks which support Alexa – Lex for language comprehension, and Polly for text-to-speech generation.

Alexa does not in any substantial sense live inside your Amazon Echo or Dot – that simply provides the equivalent of your ears and mouth. Insofar as the phrase is appropriate, Alexa lives in the cloud, interacting with you by means of specific convenient devices. Indeed, Amazon are already moving the focus away from particular pieces of hardware, towards being able to access the technology from a very wide range of devices including web pages, phones, cars, your Kindle, and so on. When you interact with Alexa, the flow of information looks a bit like this (ignoring extra bits and pieces to do with security and such like).

Alexa information flows (simplified)
Alexa information flows (simplified)

And if you tease that apart a little bit then this is roughly how Lex and Polly fit in.

Lex and Polly information flows (simplified)
Lex and Polly information flows (simplified)

 

So for today I want to look a bit more at the two “gateway” parts of the jigsaw – Lex and Polly. Lex is there to sort out what it is you want to happen – your intent – given what it is you said. Of course, given the newness of the system, every so often Lex gets it wrong. What entertains me is not so much those occasions when you get misunderstood, but the extremity of some people’s reaction to this. Human listeners make mistakes just like software ones do, but in some circles each and every failure case of Lex is paraded as showing that the technology is inherently flawed. In reality, it is simply under development. It will improve, but I don’t expect that it will ever get to 100% perfection, any more than people will.

Anyway, let’s suppose that Lex has correctly interpreted your intent. Then all kinds of things may happen behind the scenes, from simple list lookups through to complex analysis and decision-making. The details of that are up to the particular skill, and I’m not going to talk about that.

Instead, let’s see what happens on the way back to the user. The skill as a whole has decided on some spoken response. At the current state of the art, that response is almost certainly defined by the coder as a block of text, though one can imagine that in the future, a more intelligent and autonomous Alexa might decide for herself how to frame a reply. But however generated, that body of text has to be transformed into a stream of spoken words – and that is Polly’s job.

A standard Echo or Dot is set up to produce just one voice. There is a certain amount of configurability – pitch can be raised or lowered, the speed of speech altered, or the pronunciation of unusual words defined. But basically Alexa has a single voice when you use one of the dedicated gadgets to access her. But Polly has a lot more – currently 48 voices (18 male and 30 female), in 23 languages. Moreover, you can require that the speaker language and the written language differ, and so mimic a French person speaking English. Which is great if what you want to do is read out a section of a book, using different voices for the dialogue.

Timing Kindle cover
Timing Kindle cover

That’s just what I have been doing over the last couple of days, using Timing (Far from the Spaceports Book 2) as a test-bed. The results aren’t quite ready for this week, but hopefully by next week you can enjoy some snippets. Of course, I rapidly found that even 48 voices are not enough to do what you want. There is a shortage of some languages – in particular Middle Eastern and Asian voices are largely absent – but more will be added in time. One of the great things about Polly (speaking as a coder) is that switching between different voices is very easy, and adding in customised pronunciation is a breeze using a phonetic alphabet. Which is just as well. Polly does pretty well on “normal” words, but celestial bodies such as Phobos and Ceres are not, it seems, considered part of a normal vocabulary! Even the name Mitnash needed some coaxing to get it sounding how I wanted.

The world of Far from the Spaceports and Timing (and the in preparation Authentication Key) is one where the production of high quality and emotionally sensitive speech by artificial intelligences (personas in the books) taken for granted. At present we are a very long way from that – Alexa is a very remote ancestor of Slate, if you like – but it’s nice to see the start of something emerging around us.

Friday June 30th was International Asteroid Day!

Artist's impression of asteroid (NASA/JPL)
Artist’s impression of asteroid (NASA/JPL)

And no, I hadn’t realised this myself until a couple of days before… but NASA and others around the world had a day’s focus on asteroids. Now, to be sure most of that focus was looking at the thorny question of Near Earth Objects, both asteroids and comets, and what we might be able to do if one was on a collision course.

Far from the Spaceports cover
Far from the Spaceports cover

But it seemed to me that this was as good a time as any to celebrate my fictional Scilly Isle asteroids, as described in Far from the Spaceports and Timing (and the work in progress provisionally called The Authentication Key). In those stories, human colonies have been established on some of the asteroids, and indeed on sundry planets and moons. These settlements have gone a little beyond mining stations and are now places that people call home. A scenario well worth remembering on International Asteroid Day!

Kindle Cover - Half Sick of Shadows
Kindle Cover – Half Sick of Shadows

While on the subject of books, some lovely reviews for Half Sick of Shadows have been coming in.

Hoover Reviews said:
“The inner turmoil of The Lady, as she struggles with the Mirror to gain access to the people she comes in contact with, drives the tale as the Mirror cautions her time and again about the dangers involved.  The conclusion of the tale, though a heart rending scene, is also one of hope as The Lady finally finds out who she is.”

The Review said:
“Half Sick of Shadows is in a genre all its own, a historical fantasy with some science fiction elements and healthy dose of mystery, it is absolutely unique and a literary sensation. Beautifully written, with an interesting storyline and wonderful imagery, it is in a realm of its own – just like the Lady of Shalott… It truly is mesmerising.”

Find out for yourself at Amazon.co.uk or Amazon.com.

Half Sick of Shadows Alexa skill icon
Half Sick of Shadows Alexa skill icon

Or chat about the book with Alexa by enabling the skill at the UK or US stores.

Language and pronunciation

Half Sick of Shadows Alexa skill icon
Half Sick of Shadows Alexa skill icon

I’ve been thinking these last few days, once again, about language and pronunciation. This was triggered by working on some more Alexa skills to do with my books. For those who don’t know, I have such things already in place for Half Sick of Shadows, Far from the Spaceports, and Timing. That leaves the Bronze Age series set in Kephrath, in the hill country of Canaan. And here I ran into a problem. Alexa does pretty well with contemporary names – I did have a bit of difficulty with getting her to pronounce “Mitnash” correctly, but solved that simply by changing the spelling of the text I supplied. If instead of Mitnash I wrote Mitt-nash, the text-to-speech engine had enough clues to work out what I meant.

So far so good, but you can only go part of the way down that road. You can’t keep fiddling around with weird spellings just to trick the code into doing what you want. Equally, it’s hardly reasonable to suppose that the Alexa coding team would have considered how to pronounce ancient Canaanite or Egyptian names. Sure enough the difficulties multiplied with the older books. Even “Kephrath” came out rather mangled, and things went downhill from there.
Amazon Dot - Inactive
Amazon Dot – Inactive

So I took a step back, did some investigation, and found that you can define the pronunciation of unusual words by using symbols from the phonetic alphabet. Instead of trying to guess how Alexa might pronounce Giybon, or Makty-Rasut, or Ikaret, I can simply work out what symbols I need for the consonants and vowels, and provide these details in a specific format. Instead of Mitnash, I write mɪt.næʃ. Ikaret becomes ˈIk.æ.ˌɹɛt.

So that solved the immediate problem, and over the next few days my Alexa skills for In a Milk and Honeyed Land, Scenes from a Life, and The Flame Before Us will be going live. Being slightly greedy about such things, of course I now want more! Ideally I want the ability to set up a pronunciation dictionary, so that I can just set up a list of standard pronunciations that Alexa can tap into at need – rather like having a custom list of words for a spelling checker. Basically, I want to be able to teach Alexa how to pronounce new words that aren’t in the out-of-the-box setup. I suspect that such a thing is not too far away, since I can hardly be the only person to come across this. In just about every specialised area of interest there are words which aren’t part of everyday speech.

Amazon Dot - Active
Amazon Dot – Active

But also, this brought me into contact with the perennial issue of UK and US pronunciation. Sure, a particular phonetic symbol means whatever it means, but the examples of typical words vary considerably. As a Brit, I just don’t pronounce some words the same as my American friends, so there has to be a bit of educated guesswork going into deciding what sound I’m hoping for. Of course it’s considerably more complicated than just two nations – within those two there are also large numbers of regional and cultural shifts. And of course there are plenty of countries which use English but sound quite different to either “standard British” or “standard American”.

That’s for some future, yet to be invented, dialect-aware Alexa! Right now it’s enough to code for two variations, and rely on the fact that the standard forms are recognisable enough to get by. But wouldn’t it be cool to be able to insert some extra tags into dialogue in order to get one character’s speech as – say – Cumbrian, and another as from Somerset.