A short blog today as I get back into blog writing after a very busy Easter. And it’s something a little bit different for me – a friend and former work colleague interviewed me for his podcast series over the weekend, and it has now gone live.
Now, I’ve never really got into podcasts, and Marks’ normal focus for his series is to do with business (as you can tell from his series title, Absolute Business Mindset), but we both managed to make something of the interaction.
Different people use different podcast software, but this site https://gopod.me/1340548096 gives you a list of different options through which you can access the interview. Alternatively, search for Mark’s series by its title, Absolute Business Mindset.
In it, you can hear me talking with Mark about all kinds of stuff, largely focused around maths, artificial intelligence, Alexa and so on, ultimately touching on science fiction. The whole thing takes about an hour, and Alexa takes more of a central role in the second half. Enjoy!
My science fiction books – Far from the Spaceports and Timing, plus two more titles in preparation – are heavily built around exploring relationships between people and artificial intelligences, which I call personas. So as well as a bit of news about one of our present-day AIs – Alexa – I thought I’d talk today about how I see the trajectory leading from where we are today, to personas such as Slate.
Before that, though, some news about a couple of new Alexa skills I have published recently. The first is Martian Weather, providing a summary of recent weather from Elysium Planitia, Mars, courtesy of a public NASA data feed from the Mars Insight Lander. So you can listen to reports of about a week of temperature, wind, and air pressure reports. At the moment the temperature varies through a Martian day between about -95 and -15° Celsius, so it’s not very hospitable. Martian Weather is free to enable on your Alexa device from numerous Alexa skills stores, including UK, US, CA, AU, and IN. The second is Peak District Weather, a companion to my earlier Cumbria Weather skill but – rather obviously – focusing on mountain weather conditions in England’s Peak District rather than Lake District. Find out about weather conditions that matter to walkers, climbers and cyclists. This one is (so far) only available on the UK store, but other international markets will be added in a few days.
Current AI research tends to go in one of several directions. We have single-purpose devices which aim to do one thing really well, but have no pretensions outside that. They are basically algorithms rather than intelligences per se – they might be good or bad at their allotted task, but they aren’t going to do well at anything else. We have loads of these around these days – predictive text and autocorrect plugins, autopilots, weather forecasts, and so on. From a coding point of view, it is now comparatively easy to include some intelligence in your application, using modular components, and all you have to do is select some suitable training data to set the system up (actually, that little phrase “suitable training data” conceals a multitude of difficulties, but let’s not go into that today).
Then you get a whole bunch of robots intended to master particular physical tasks, such as car assembly or investigation of burning buildings. Some of these are pretty cute looking, some are seriously impressive in their capabilities, and some have been fashioned to look reasonably humanoid. These – especially the latter group – probably best fit people’s idea of what advanced AI ought to look like. They are also the ones closest to mankind’s long historical enthusiasm for mechanical assistants, dating back at least to Hephaestus, who had a number of automata helping him in his workshop. A contemporary equivalent is Boston Dynamics (originally a spin-off from MIT, later taken over by Google) which has designed and built a number of very impressive robots in this category, and has attracted interest from the US military, while also pursing civilian programmes.
Then there’s another area entirely, which aims to provide two things: a generalised intelligence rather than one targeted on a specific task, and one which does not come attached to any particular physical trappings. This is the arena of the current crop of digital assistants such as Alexa, Siri, Cortana and so on. It’s also the area that I am both interested in and involved in coding for, and provides a direct ancestry for my fictional personas. Slate and the others are, basically, the offspring – several generations removed – of these digital assistants, but with far more autonomy and general cleverness. Right now, digital assistants are tied to cloud-based sources of information to carry out speech recognition. They give the semblance of being self-contained, but actually are not. So as things stand you couldn’t take an Alexa device out to the asteroid belt and hope to have a decent conversation – there would be a minimum of about half an hour between each line of chat, while communication signals made their way back to Earth, were processed, and then returned to Ceres. So quite apart from things like Alexa needing a much better understanding of human emotions and the subtleties of language, we need a whole lot of technical innovations to do with memory and processing.
As ever, though, I am optimistic about these things. I’ve assumed that we will have personas or their equivalent within about 70 or 80 years from now – far enough away that I probably won’t get to chat with them, but my children might, and my grandchildren will. I don’t subscribe to the theory that says that advanced AIs will be inimical to humankind (in the way popularised by Skynet in the Terminator films, and picked up much more recently in the current Star Trek Discovery series). But that’s a whole big subject, and one to be tackled another day.
Meanwhile, you can enjoy my latest couple of Alexa skills and find out about the weather on Mars or England’s Peak District, while I finish some more skills that are in progress, and also continue to write about their future.
In my science fiction stories, I write about artificial intelligences called personas. They are not androids, nor robots in the sense that most people recognise – they have no specialised body hardware, are not able to move around by themselves, and don’t look like imitation humans. They are basically – in today’s terminology – computers, but with a level of artificial intelligence substantially beyond what we are used to. Our current crop of virtual assistants, such as Alexa, Cortana, Siri, Bixby, and so on, are a good analogy – it’s the software running on them that matters, not the particular hardware form. They have a certain amount of built-in capability, and can also have custom talents (like Alexa skills) added on to customise them in an individual way. “My” Alexa is broadly the same as “yours”, in that both tap into the same data store for understanding language, but differs in detail because of the particular combination of extra skills you and I have enabled (in my case, there’s also a lot of trial development code installed). So there is a level of individuality, albeit at a very basic level. They are a step towards personas, but are several generations away from them.
Now, one of the main features that distinguishes personas from today’s AI software is an ability to recognise and appropriately respond to emotion – to empathise. (There’s a whole different topic to do with feeling emotion, which I’ll get back to another day). Machine understanding of emotion (often called Sentiment Analysis) is a subject of intense research at the moment, with possible applications ranging from monitoring drivers to alert about emotional states that would compromise road safety, through to medical contexts to provide early warning regarding patients who are in discomfort or pain. Perhaps more disturbingly, it is coming into use during recruitment, and to assess employees’ mood – and in both cases this could be without the subject knowing or consenting to the study. But correctly recognising emotion is a hard problem… and not just for machine learning.
Humans also often have problems recognising emotional context. Some people – by nature or training – can get pretty good at it, most people are kind of average, and some people have enormous difficulty understanding and responding to emotions – their own, often, as well as those of other people. There are certain stereotypes we have of this -the cold scientist, the bullish sportsman, the loud bore who dominates a conversation – and we probably all know people whose facility to handle emotions is at best weak. The adjacent picture is taken from an excellent article questioning whether machines will ever be able to detect and respond to emotion – is this man, at the wheel of his car, experiencing road rage, or is he pumped that the sports team he supports has just scored? It’s almost impossible to tell from a still picture.
From a human perspective, we need context – the few seconds running up to that specific image in which we can listen to the person’s words, and observe their various bodily clues to do with posture and so on. If instead of a still picture, I gave you a five second video, I suspect you could give a fairly accurate guess what the person was experiencing. Machine learning is following the same route. One article concerning modern research reads in part, “Automatic emotion recognition is a challenging task… it’s natural to simultaneously utilize audio and visual information“. Basically, the inputs to their system consist of a digitised version of the speech being heard, and four different video feeds focusing on different parts of the person’s face. All five inputs are then combined, and tuned in proprietary ways to focus on details which are sensitive to emotional content. At present, this model is said to do well with “obvious” feelings such as anger or happiness, and struggles with more weakly signalled feelings such as surprise, disgust and so on. But then, much the same is true of many people…
A fascinating – and unresolved – problem is whether emotions, and especially the physical signs of emotions, are universal human constants, or alternatively can only be defined in a cultural and historical context. Back in the 1970s, psychological work had concluded that emotions were shared in common across the world, but since then this has been called into question. The range of subjects used for the study was – it has been argued – been far too narrow. And when we look into past or future, the questions become more difficult and less answerable. Can we ever know whether people in, say, the Late Bronze Age experienced the same range of emotions as us? And expressed them with the same bodily features and movements? We can see that they used words like love, anger, fear, and so on, but was their inward experience the same as ours today? Personally I lean towards the camp that emotions are indeed universal, but the counter-arguments are persuasive. And if human emotions are mutable over space and time, what does that say about machine recognition of emotions, or even machine experience of emotions?
One way of exploring these issues is via games, and as I was writing this I came across a very early version of such a game. It is called The Vault, and is being prepared by Queen Mary University, London. In its current form it is hard to get the full picture, but it clearly involves a series of scenes from past, present and future. Some of the descriptive blurb reads “The Vault game is a journey into history, an immersion into the experiences and emotions of those whose lives were very different from our own. There, we discover unfamiliar feelings, uncanny characters who are like us and yet unlike.” There is a demo trailer at the above link, which looks interesting but unfinished… I tried giving a direct link to Vimeo of this, but the token appears to expire after a while and the link fails. You can still get to the video via the link above.
Meanwhile, my personas will continue to respond to – and experience – emotions, while I wait for software developments to catch up with them! And, of course, continue to develop my own Alexa skills as a kind of remote ancestor to personas.
This week has been busy, with tidying up one Alexa skill, and getting another two ready for release. Of which more later. But first, some space news I caught this week which links to my thoughts about looking for life in the upper atmosphere of Venus. It’s much easier – comparatively speaking – to look at the upper atmosphere of Earth, and that’s just what scientists have been doing.
When you fly on a long-haul flight, you’re at roughly 35,000′ (say 6 1/2 miles, rather higher than Mount Everest). If, like me, you keep an eye on the information readouts about speed, temperature, and so on, you’ll know that it is ferociously cold outside the little bubble of the cabin. In fact, it’s not only cold, but also at a tiny fraction of the air pressure at the surface, with hardly any water vapour, and subject to huge amounts of ultraviolet light. For humans, it is totally inhospitable.
But some microbes flourish here. We don’t exactly know how many, as the study of such things is in its infancy. Certainly life is less dense up there than it is down in our comfort zone. But the total number of organisms living up there in the stratosphere, added up across the whole planet, is truly prodigious.
It’s important for a few reasons. The first, and most relevant to this blog, is that the living conditions are not unlike those on Mars. If we’re able to understand how life works in our own upper atmosphere, we have a better chance of identifying it as and when we come across it elsewhere. Also, it helps us assess the risk of taking microbial life with us by accident, as our rockets leave Earth. If we take Earth-based life with us, we need to be sure we don’t then mistake it for an alien organism when we find it! And conversely, we can decide if there is any serious risk of bringing something home that we weren’t expecting. All very exciting.
Right, back to some quick notes on Alexa skills before finishing. My latest published skill is Jung North West, promoting an occasional experiential training course in Jungian thought. This takes place in Grasmere, Cumbria: the first in the series was earlier this year and was a great success.
The next course is in March 2019, looking at Dreaming and Dream States. But don’t ask me… ask Alexa… “Alexa, open Jung North West… tell me about the next course…” And if you wanted a regular web page version, you could look at jungnorthwest.uk.
After that, there are a couple more skills in the pipeline, including a game (something of a departure for me). And I am in the process of overhauling some of my existing skills to keep them up to date. Some of the Cumbria ones need to be brought in line with the latest hardware changes and opportunities. Coding life never stands still!
I thought that this week I would have a quick break from the Inklings, King Arthur, and such like, and report some space news which I came across a few days ago.
But first, an update on my latest Alexa skill – Polly Reads. This showcases the ability of Alexa’s “big sister”, Polly, to read text in multiple voices and accents. So this skill is a bit like a podcast, letting you step through a series of readings from my novels. Half Sick of Shadows is there, of course, plus some readings from Far from the Spaceports and Timing. So far the skill is available only on the UK Alexa Skills site, but it’s currently going through the approval process for other sites world-wide. **update on Wednesday morning – I just heard that it has gone live world-wide now! ** Here is the Amazon US link **
Now the space news, and specifically about the asteroid Ceres (or dwarf planet if you prefer). Quite apart from their general interest, this news affects how we write about the outer solar system, so is particularly relevant to my near future series.
Many readers will know that the NASA Dawn spacecraft has been orbiting Ceres for some time now – nearly three years. This has provided us with some fascinating insights into the asteroid, especially the mountains on its surface, and the bright salt deposits found here and there. But the sheer length of time accumulated to date – something like 1500 orbits, at different elevations – means that we can now follow changes as they happen on the surface.
Now the very fact of change is something of a surprise. Not all that long ago, it was assumed that such small objects, made of rock and ice, had long since ceased to evolve. Any internal energy would have leaked away millennia ago, and the only reason for anything to happen would be if there was a collision with some other external object like a meteorite. We knew that the gas giant planets were active, with turbulent storms and hugely powerful prevailing winds, but the swarms of small rocky moons, asteroids, and dwarf planets were considered static.
But what Dawn has shown us is that this is wrong. Repeated views of the same parts of the surface show how areas of exposed ice are constantly growing and shrinking, even over just a few months. This could be because new water vapour is oozing out of surface cracks and then freezing, or alternatively because some layer of dust is slowly settling, and so exposing ice which was previously hidden. At this stage, we can’t tell for sure which of those (or some third explanation) is true.
The evidence now suggests that Ceres once had a liquid water ocean – most of this has frozen into a thick crust of ice, with visible mineral deposits scattered here and there.
Certainly Ceres – and presumably many other asteroids – is more active than we had presumed. Such members of our solar system remain chemically and geologically active, rather than being just inert lumps drifting passively around our sun. As and when we get out there to take a look, we’re going to find a great many more surprises. Meanwhile, we can always read about them…
Well, a couple of weeks have passed and it’s time to get back to blogging. And for this week, here is the Alexa post that I mentioned a little while ago, back in December last year.
First, to anticipate a later part of this post, is the extract of Alexa reciting the first few lines of Wordsworth’s Daffodils…
It has been a busy time for Alexa generally – Amazon have extended sales of various of the hardware gizmos to many other countries. That’s well and good for everyone: the bonus for us developers is that they have also extended the range of countries into which custom skills can be deployed. Sometimes with these expansions Amazon helpfully does a direct port to the new locale, and other times it’s up to the developer to do this by hand. So when skills appeared in India, everything I had done to that date was copied across automatically, without me having to do my own duplication of code. From Monday Jan 8th the process of generating default versions for Australia and New Zealand will begin. And Canada is also now in view. Of course, that still leaves plenty of future catch-up work, firstly making sure that their transfer process worked OK, and secondly filling in the gaps for combinations of locale and skill which didn’t get done. The full list of languages and countries to which skills can be deployed is now
English (Australia / New Zealand)
Based on progress so far, Amazon will simply continue extending this to other combinations over time. I suspect that French Canadian will be quite high on their list, and probably other European languages – for example Spanish would give a very good international reach into Latin America. Hindi would be a good choice, and Chinese too, presupposing that Amazon start to market Alexa devices there. Currently an existing Echo or Dot will work in China if hooked up to a network, but so far as I know the gadgets are not on sale there – instead several Chinese firms have begun producing their own equivalents. Of course, there’s nothing to stop someone in another country accessing the skill in one or other of the above languages – for example a Dutch person might consider using either the English (UK) or German option.
To date I have not attempted porting any skills in German or Japanese, essentially through lack of necessary language skills. But all of the various English variants are comparatively easy to adapt to, with an interesting twist that I’ll get to later.
So my latest skill out of the stable, so to speak, is Wordsworth Facts. It has two parts – a small list of facts about the life of William Wordsworth, his family, and some of his colleagues, and also some narrated portions from his poems. Both sections will increase over time as I add to them. It was interesting, and a measure of how text-to-speech technology is improving all the time, to see how few tweaks were necessary to get Alexa to read these extract tolerably well. Reading poetry is harder than reading prose, and I was expecting difficulties. The choice of Wordsworth helped here, as his poetry is very like prose (indeed, he was criticised for this at the time). As things turned out, in this case some additional punctuation was needed to get these sounding reasonably good, but that was all. Unlike some of the previous reading portions I have done, there was no need to tinker with phonetic alphabets to get words sounding right. It certainly helps not to have ancient Egyptian, Canaanite, or futuristic names in the mix!
And this brings me to one of the twists in the internationalisation of skills. The same letter can sound rather different in different versions of English when used in a word – you say tomehto and I say tomarto, and all that. And I necessarily have to dive into custom pronunciations of proper names of characters and such like – Damariel gets a bit messed up, and even Mitnash, which I had assumed would be easily interpreted, gets mangled. So part of the checking process will be to make sure that where I have used a custom phonetic version of someone’s name, it comes out right.
Wordsworth Facts is live across all of the English variants listed above – just search in your local Amazon store in the Alexa Skills section by name (or to see all my skills to date, search for “DataScenes Development“, which is the identity I use for coding purposes. If you’re looking at the UK Alexa Skills store, this is the link.
The next skill I am planning to go live with, probably in the next couple of weeks, is Polly Reads. Those who read this blog regularly – or indeed the Before The Second Sleep blog (see this link, or this, or this) – may well think of Polly as Alexa’s big sister. Polly can use multiple different voices and languages rather than a fixed one, though Polly is focused on generating spoken speech rather than interpreting what a user might be saying (the module in Amazon’s suite that does the comprehension bit is called Lex). So Polly Reads is a compendium of all the various book readings I have set up using Polly, onto which I’ll add a few of my own author readings where I haven’t yet set Polly up with the necessary text and voice combinations. The skill is kind of like a playlist, or maybe a podcast, and naturally my plan is to extend the set of readings over time. More news of that will be posted before the end of the month, all being well.
The process exposed a couple of areas where I would really like Amazon to enhance the audio capabilities of Alexa. The first was when using the built-in ability to access music (ie not my own custom skill). Compared to a lot of Alexa interaction, this feels very clunky – there is no easy way to narrow in on a particular band, for example – “The band is Dutch and they play prog rock but I can’t remember the name” could credibly come up with Kayak, but doesn’t. There’s no search facility built in to the music service. And you have to get the track name pretty much dead on – “Alexa, Play The Last Farewell by Billy Boyd” gets you nowhere except for a “I can’t find that” message, since it is called “The Last Goodbye“. A bit more contextual searching would be good. Basically, this boils down to a shortfall in what technically we call context, and what in a person would be short-term memory – the coder of a skill has to decide exactly what snippets of information to remember from the interaction so far – anything which is not explicitly remembered, will be discarded.
That was a user-moan. The second is more of a developer-moan. Playing audio tracks of more than a few seconds – like a book extract, or a decent length piece of music – involves transferring control from your own skill to Alexa, who then manages the sequencing of tracks and all that. That’s all very well, and I understand the purpose behind it, but it also means that you have lost some control over the presentation of the skill as the various tracks play. For example, on the new Echo Show (the one with the screen) you cannot interleave the tracks with relevant pictures – like a book cover, for example. Basically the two bits of capability don’t work very well together. Of course all these things are very new, but it would be great to see some better integration between the different pieces of the jigsaw. Hopefully this will be improved with time…
This is the third and final part of Left Behind by Events, in which I take a look at my own futuristic writing and try to guess which bits I will have got utterly wrong when somebody looks back at it from a future perspective! But it’s also the first of a few blogs in which I will talk a bit about some of the impressions I got of technical near-future as seen at the annual Microsoft Future Decoded conference that I went to the other day.
So I am tolerably confident about the development of AI. We don’t yet have what I call “personas” with autonomy, emotion, and gender. I’m not counting the pseudo-gender produced by selecting a male or female voice, though actually even that simple choice persuades many people – how many people are pedantic enough to call Alexa “it” rather than “she”? But at the rate of advance of the relevant technologies, I’m confident that we will get there.
I’m equally confident, being an optimistic guy, that we’ll develop better, faster space travel, and have settlements of various sizes on asteroids and moons. The ion drive I posit is one definite possibility: the Dawn asteroid probe already uses this system, though at a hugely smaller rate of acceleration than what I’m looking for. The Hermes, which features in both the book and film The Martian, also employs this drive type. If some other technology becomes available, the stories would be unchanged – the crucial point is that intra-solar-system travel takes weeks rather than months.
I am totally convinced that financial crime will take place! One of the ways we try to tackle it on Earth is to share information faster, so that criminals cannot take advantage of lags in the system to insert falsehoods. But out in the solar system, there’s nothing we can do about time lags. Mars is between 4 and 24 minutes from Earth in terms of a radio or light signal, and there’s nothing we can do about that unless somebody invents a faster-than-light signal. And that’s not in range of my future vision. So the possibility of “information friction” will increase as we spread our occupancy wider. Anywhere that there are delays in the system, there is the possibility of fraud… as used to great effect in The Sting.
Something I have not factored in at all is biological advance. I don’t have cyborgs, or genetically enhanced people, or such things. But I suspect that the likelihood is that such developments will occur well within the time horizon of Far from the Spaceports. Biology isn’t my strong suit, so I haven’t written about this. There’s a background assumption that illness isn’t a serious problem in this future world, but I haven’t explored how that might happen, or what other kinds of medical change might go hand-in-hand with it. So this is almost certainly going to be a miss on my part.
Moving on to points of contact with the conference, there is the question of my personas’ autonomy. Right now, all of our current generation of intelligent assistants – Alexa, Siri, Cortana, Google Home and so on – rely utterly on a reliable internet connection and a whole raft of cloud-based software to function. No internet or no cloud connection = no Alexa.
This is clearly inadequate for a persona like Slate heading out to the asteroid belt! Mitnash is obviously not going to wait patiently for half an hour or so between utterances in a conversation. For this to work, the software infrastructure that imparts intelligence to a persona has to travel along with it. Now this need is already emerging – and being addressed – right now. I guess most of us are familiar with the idea of the Cloud. Your Gmail account, your Dropbox files, your iCloud pictures all exists somewhere out there… but you neither know nor care where exactly they live. All you care is that you can get to them when you want.
But with the emerging “internet of things” that is having to change. Let’s say that a wildlife programme puts a trail camera up in the mountains somewhere in order to get pictures of a snow leopard. They want to leave it there for maybe four months and then collect it again. It’s well out of wifi range. In those four months it will capture say 10,000 short videos, almost all of which will not be of snow leopards. There will be mountain goats, foxes, mice, leaves, moving splashes of sunshine, flurries of rain or snow… maybe the odd yeti. But the memory stick will only hold say 500 video clips. So what do you do? Throw away everything that arrives after it gets full? Overwrite the oldest clips when you need to make space? Arrange for a dangerous and disruptive resupply trip by your mountaineer crew?
Or… and this is the choice being pursued at the moment… put some intelligence in your camera to try to weed out non-snow-leopard pictures. Your camera is no longer a dumb picture-taking device, but has some intelligence. It also makes your life easier when you have recovered the camera and are trying to scan through the contents. Even going through my Grasmere badger-cam vids every couple of weeks involves a lot of deleting scenes of waving leaves!
So this idea is now being called the Cloud Edge. You put some processing power and cleverness out in your peripheral devices, and only move what you really need into the Cloud itself. Some of the time, your little remote widgets can make up their own minds what to do. You can, so I am told, buy a USB stick with trainable neural network on it for sifting images (or other similar tasks) for well under £100. Now, this is a far cry from an independently autonomous persona able to zip off to the asteroid belt, but it shows that the necessary technologies are already being tackled.
I’ve been deliberately vague about how far into the future Far from the Spaceports, Timing, and the sequels in preparation are set. If I had to pick a time I’d say somewhere around the one or two century mark. Although science fact notoriously catches up with science fiction faster than authors imagine, I don’t expect to see much of this happening in my lifetime (which is a pity, really, as I’d love to converse with a real Slate). I’d like to think that humanity from one part of the globe or another would have settled bases on other planets, moons, or asteroids while I’m still here to see them, and as regular readers will know, I am very excited about where AI is going. But a century to reach the level of maturity of off-Earth habitats that I propose seems, if anything, over-optimistic.
That’s it for today – over the next few weeks I’ll be talking about other fun things I learned…
Today’s blog is primarily about the latest addition to book readings generated using Amazon’s Polly text-to-speech software, but before getting to that it’s worth saying goodbye to the Cassini space probe. This was launched nearly twenty years ago, has been orbiting Saturn and its moons since 2004, and is now almost out of fuel. By the end of the week, following a deliberate course change to avoid polluting any of the moons, Cassini will impact Saturn and break up in the atmosphere there.
So, Half Sick of Shadows and Polly. Readers of this blog, or the Before the Second Sleep blog (first post and second post) will know that I have been using Amazon’s Polly technology to generate book readings. The previous set were for the science fiction book Timing, Far from the Spaceports 2. Today it is the turn of Half Sick of Shadows.
Without further ado, and before getting to some technical stuff, here is the result. It’s a short extract from late on in the book, and I selected it specifically because there are several speakers.
OK. Polly is a variation of the text-to-speech capability seen in Amazon Alexa, but with a couple of differences. First, it is geared purely to voice output, rather than the mix of input and output needed for Alexa to work.
Secondly, Polly allows a range of gender, voice and language, not just the fixed voice of Alexa. The original intention was to provide multi-language support in various computer or mobile apps, but it suits me very well for representing narrative and dialogue. For this particular reading I have used four different voices.
If you want to set up your own experiment, you can go to this link and start to play. You’ll need to set up some login credentials to get there, but you can extend your regular Amazon ones to do this. This demo page allows you to select which voice you want and enter any desired text. You can even download the result if you want.
But the real magic starts when you select the SSML tab, and enter more complex examples. SSML is an industry standard way of describing speech, and covers a whole wealth of variations. You can add what are effectively stage directions with it – pauses of different lengths, directions about parts of speech, emphasis, and (if necessary) a phonetic letter by letter description. You can speed up or slow down the reading, and raise or lower the pitch. Finally, and even more usefully for my purposes, you can select the spoken language as well as the language of the speaker. So you can have an Italian speaker pronouncing an English sentence, or vice versa. Since all my books are written in English, that means I can considerably extend the range of speakers. Some combinations don’t work very well, so you have to test what you have specified, but that’s fair enough.
If you’re comfortable with the coding effort required, you can call the Polly libraries with all the necessary settings and generate a whole lot of text all at once, rather than piecemeal. Back when I put together the Timing extracts, I wrote a program which was configurable enough that now I just have to specify the text concerned, plus the selection of voices and other sundry details. It still takes a little while to select the right passage and get everything organised, but it’s a lot easier than starting from scratch every time. Before too much longer, there’ll be dialogue extracts from Far from the Spaceports as well!
Just a short post today to highlight a YouTube video based around one of the Polly conversations from Timing that I have been talking about recently. This one is of Mitnash, Slate, Parvati and Chandrika talking on board Parvati’s spaceship, The Parakeet, en route to Phobos. The subject of conversation is the recent wreck of Selif’s ship on Tean, one of the smaller asteroids in the Scilly isles group…
The link is: https://youtu.be/Uv5L0yMKaT0
While we’re in YouTube, here is the link to the conversation with Alexa about Timing… https://youtu.be/zLHZSOF_9xo
It’s slow work, but gradually all these various conversations and readings will get added to YouTube and other video sharing sites.
A couple of weeks ago I went to a day event put on by Amazon showcasing their web technologies. My own main interests were – naturally – in the areas of AI and voice, but there was plenty there if instead you were into security, or databases, or the so-called “internet of things”.
Readers of this blog will know of my enthusiasm for Alexa, and perhaps will also know about the range of Alexa skills I have been developing (if you’re interested, go to the UK or the US sites). So I thought I’d go a little bit more into both Alexa and the two building blocks which support Alexa – Lex for language comprehension, and Polly for text-to-speech generation.
Alexa does not in any substantial sense live inside your Amazon Echo or Dot – that simply provides the equivalent of your ears and mouth. Insofar as the phrase is appropriate, Alexa lives in the cloud, interacting with you by means of specific convenient devices. Indeed, Amazon are already moving the focus away from particular pieces of hardware, towards being able to access the technology from a very wide range of devices including web pages, phones, cars, your Kindle, and so on. When you interact with Alexa, the flow of information looks a bit like this (ignoring extra bits and pieces to do with security and such like).
And if you tease that apart a little bit then this is roughly how Lex and Polly fit in.
So for today I want to look a bit more at the two “gateway” parts of the jigsaw – Lex and Polly. Lex is there to sort out what it is you want to happen – your intent – given what it is you said. Of course, given the newness of the system, every so often Lex gets it wrong. What entertains me is not so much those occasions when you get misunderstood, but the extremity of some people’s reaction to this. Human listeners make mistakes just like software ones do, but in some circles each and every failure case of Lex is paraded as showing that the technology is inherently flawed. In reality, it is simply under development. It will improve, but I don’t expect that it will ever get to 100% perfection, any more than people will.
Anyway, let’s suppose that Lex has correctly interpreted your intent. Then all kinds of things may happen behind the scenes, from simple list lookups through to complex analysis and decision-making. The details of that are up to the particular skill, and I’m not going to talk about that.
Instead, let’s see what happens on the way back to the user. The skill as a whole has decided on some spoken response. At the current state of the art, that response is almost certainly defined by the coder as a block of text, though one can imagine that in the future, a more intelligent and autonomous Alexa might decide for herself how to frame a reply. But however generated, that body of text has to be transformed into a stream of spoken words – and that is Polly’s job.
A standard Echo or Dot is set up to produce just one voice. There is a certain amount of configurability – pitch can be raised or lowered, the speed of speech altered, or the pronunciation of unusual words defined. But basically Alexa has a single voice when you use one of the dedicated gadgets to access her. But Polly has a lot more – currently 48 voices (18 male and 30 female), in 23 languages. Moreover, you can require that the speaker language and the written language differ, and so mimic a French person speaking English. Which is great if what you want to do is read out a section of a book, using different voices for the dialogue.
That’s just what I have been doing over the last couple of days, using Timing (Far from the Spaceports Book 2) as a test-bed. The results aren’t quite ready for this week, but hopefully by next week you can enjoy some snippets. Of course, I rapidly found that even 48 voices are not enough to do what you want. There is a shortage of some languages – in particular Middle Eastern and Asian voices are largely absent – but more will be added in time. One of the great things about Polly (speaking as a coder) is that switching between different voices is very easy, and adding in customised pronunciation is a breeze using a phonetic alphabet. Which is just as well. Polly does pretty well on “normal” words, but celestial bodies such as Phobos and Ceres are not, it seems, considered part of a normal vocabulary! Even the name Mitnash needed some coaxing to get it sounding how I wanted.
The world of Far from the Spaceports and Timing (and the in preparation Authentication Key) is one where the production of high quality and emotionally sensitive speech by artificial intelligences (personas in the books) taken for granted. At present we are a very long way from that – Alexa is a very remote ancestor of Slate, if you like – but it’s nice to see the start of something emerging around us.