I thought that this week I would have a quick break from the Inklings, King Arthur, and such like, and report some space news which I came across a few days ago.
But first, an update on my latest Alexa skill – Polly Reads. This showcases the ability of Alexa’s “big sister”, Polly, to read text in multiple voices and accents. So this skill is a bit like a podcast, letting you step through a series of readings from my novels. Half Sick of Shadows is there, of course, plus some readings from Far from the Spaceports and Timing. So far the skill is available only on the UK Alexa Skills site, but it’s currently going through the approval process for other sites world-wide. **update on Wednesday morning – I just heard that it has gone live world-wide now! ** Here is the Amazon US link **
Now the space news, and specifically about the asteroid Ceres (or dwarf planet if you prefer). Quite apart from their general interest, this news affects how we write about the outer solar system, so is particularly relevant to my near future series.
Many readers will know that the NASA Dawn spacecraft has been orbiting Ceres for some time now – nearly three years. This has provided us with some fascinating insights into the asteroid, especially the mountains on its surface, and the bright salt deposits found here and there. But the sheer length of time accumulated to date – something like 1500 orbits, at different elevations – means that we can now follow changes as they happen on the surface.
Now the very fact of change is something of a surprise. Not all that long ago, it was assumed that such small objects, made of rock and ice, had long since ceased to evolve. Any internal energy would have leaked away millennia ago, and the only reason for anything to happen would be if there was a collision with some other external object like a meteorite. We knew that the gas giant planets were active, with turbulent storms and hugely powerful prevailing winds, but the swarms of small rocky moons, asteroids, and dwarf planets were considered static.
But what Dawn has shown us is that this is wrong. Repeated views of the same parts of the surface show how areas of exposed ice are constantly growing and shrinking, even over just a few months. This could be because new water vapour is oozing out of surface cracks and then freezing, or alternatively because some layer of dust is slowly settling, and so exposing ice which was previously hidden. At this stage, we can’t tell for sure which of those (or some third explanation) is true.
The evidence now suggests that Ceres once had a liquid water ocean – most of this has frozen into a thick crust of ice, with visible mineral deposits scattered here and there.
Certainly Ceres – and presumably many other asteroids – is more active than we had presumed. Such members of our solar system remain chemically and geologically active, rather than being just inert lumps drifting passively around our sun. As and when we get out there to take a look, we’re going to find a great many more surprises. Meanwhile, we can always read about them…
Well, a couple of weeks have passed and it’s time to get back to blogging. And for this week, here is the Alexa post that I mentioned a little while ago, back in December last year.
First, to anticipate a later part of this post, is the extract of Alexa reciting the first few lines of Wordsworth’s Daffodils…
It has been a busy time for Alexa generally – Amazon have extended sales of various of the hardware gizmos to many other countries. That’s well and good for everyone: the bonus for us developers is that they have also extended the range of countries into which custom skills can be deployed. Sometimes with these expansions Amazon helpfully does a direct port to the new locale, and other times it’s up to the developer to do this by hand. So when skills appeared in India, everything I had done to that date was copied across automatically, without me having to do my own duplication of code. From Monday Jan 8th the process of generating default versions for Australia and New Zealand will begin. And Canada is also now in view. Of course, that still leaves plenty of future catch-up work, firstly making sure that their transfer process worked OK, and secondly filling in the gaps for combinations of locale and skill which didn’t get done. The full list of languages and countries to which skills can be deployed is now
English (Australia / New Zealand)
Based on progress so far, Amazon will simply continue extending this to other combinations over time. I suspect that French Canadian will be quite high on their list, and probably other European languages – for example Spanish would give a very good international reach into Latin America. Hindi would be a good choice, and Chinese too, presupposing that Amazon start to market Alexa devices there. Currently an existing Echo or Dot will work in China if hooked up to a network, but so far as I know the gadgets are not on sale there – instead several Chinese firms have begun producing their own equivalents. Of course, there’s nothing to stop someone in another country accessing the skill in one or other of the above languages – for example a Dutch person might consider using either the English (UK) or German option.
To date I have not attempted porting any skills in German or Japanese, essentially through lack of necessary language skills. But all of the various English variants are comparatively easy to adapt to, with an interesting twist that I’ll get to later.
So my latest skill out of the stable, so to speak, is Wordsworth Facts. It has two parts – a small list of facts about the life of William Wordsworth, his family, and some of his colleagues, and also some narrated portions from his poems. Both sections will increase over time as I add to them. It was interesting, and a measure of how text-to-speech technology is improving all the time, to see how few tweaks were necessary to get Alexa to read these extract tolerably well. Reading poetry is harder than reading prose, and I was expecting difficulties. The choice of Wordsworth helped here, as his poetry is very like prose (indeed, he was criticised for this at the time). As things turned out, in this case some additional punctuation was needed to get these sounding reasonably good, but that was all. Unlike some of the previous reading portions I have done, there was no need to tinker with phonetic alphabets to get words sounding right. It certainly helps not to have ancient Egyptian, Canaanite, or futuristic names in the mix!
And this brings me to one of the twists in the internationalisation of skills. The same letter can sound rather different in different versions of English when used in a word – you say tomehto and I say tomarto, and all that. And I necessarily have to dive into custom pronunciations of proper names of characters and such like – Damariel gets a bit messed up, and even Mitnash, which I had assumed would be easily interpreted, gets mangled. So part of the checking process will be to make sure that where I have used a custom phonetic version of someone’s name, it comes out right.
Wordsworth Facts is live across all of the English variants listed above – just search in your local Amazon store in the Alexa Skills section by name (or to see all my skills to date, search for “DataScenes Development“, which is the identity I use for coding purposes. If you’re looking at the UK Alexa Skills store, this is the link.
The next skill I am planning to go live with, probably in the next couple of weeks, is Polly Reads. Those who read this blog regularly – or indeed the Before The Second Sleep blog (see this link, or this, or this) – may well think of Polly as Alexa’s big sister. Polly can use multiple different voices and languages rather than a fixed one, though Polly is focused on generating spoken speech rather than interpreting what a user might be saying (the module in Amazon’s suite that does the comprehension bit is called Lex). So Polly Reads is a compendium of all the various book readings I have set up using Polly, onto which I’ll add a few of my own author readings where I haven’t yet set Polly up with the necessary text and voice combinations. The skill is kind of like a playlist, or maybe a podcast, and naturally my plan is to extend the set of readings over time. More news of that will be posted before the end of the month, all being well.
The process exposed a couple of areas where I would really like Amazon to enhance the audio capabilities of Alexa. The first was when using the built-in ability to access music (ie not my own custom skill). Compared to a lot of Alexa interaction, this feels very clunky – there is no easy way to narrow in on a particular band, for example – “The band is Dutch and they play prog rock but I can’t remember the name” could credibly come up with Kayak, but doesn’t. There’s no search facility built in to the music service. And you have to get the track name pretty much dead on – “Alexa, Play The Last Farewell by Billy Boyd” gets you nowhere except for a “I can’t find that” message, since it is called “The Last Goodbye“. A bit more contextual searching would be good. Basically, this boils down to a shortfall in what technically we call context, and what in a person would be short-term memory – the coder of a skill has to decide exactly what snippets of information to remember from the interaction so far – anything which is not explicitly remembered, will be discarded.
That was a user-moan. The second is more of a developer-moan. Playing audio tracks of more than a few seconds – like a book extract, or a decent length piece of music – involves transferring control from your own skill to Alexa, who then manages the sequencing of tracks and all that. That’s all very well, and I understand the purpose behind it, but it also means that you have lost some control over the presentation of the skill as the various tracks play. For example, on the new Echo Show (the one with the screen) you cannot interleave the tracks with relevant pictures – like a book cover, for example. Basically the two bits of capability don’t work very well together. Of course all these things are very new, but it would be great to see some better integration between the different pieces of the jigsaw. Hopefully this will be improved with time…
This is the third and final part of Left Behind by Events, in which I take a look at my own futuristic writing and try to guess which bits I will have got utterly wrong when somebody looks back at it from a future perspective! But it’s also the first of a few blogs in which I will talk a bit about some of the impressions I got of technical near-future as seen at the annual Microsoft Future Decoded conference that I went to the other day.
So I am tolerably confident about the development of AI. We don’t yet have what I call “personas” with autonomy, emotion, and gender. I’m not counting the pseudo-gender produced by selecting a male or female voice, though actually even that simple choice persuades many people – how many people are pedantic enough to call Alexa “it” rather than “she”? But at the rate of advance of the relevant technologies, I’m confident that we will get there.
I’m equally confident, being an optimistic guy, that we’ll develop better, faster space travel, and have settlements of various sizes on asteroids and moons. The ion drive I posit is one definite possibility: the Dawn asteroid probe already uses this system, though at a hugely smaller rate of acceleration than what I’m looking for. The Hermes, which features in both the book and film The Martian, also employs this drive type. If some other technology becomes available, the stories would be unchanged – the crucial point is that intra-solar-system travel takes weeks rather than months.
I am totally convinced that financial crime will take place! One of the ways we try to tackle it on Earth is to share information faster, so that criminals cannot take advantage of lags in the system to insert falsehoods. But out in the solar system, there’s nothing we can do about time lags. Mars is between 4 and 24 minutes from Earth in terms of a radio or light signal, and there’s nothing we can do about that unless somebody invents a faster-than-light signal. And that’s not in range of my future vision. So the possibility of “information friction” will increase as we spread our occupancy wider. Anywhere that there are delays in the system, there is the possibility of fraud… as used to great effect in The Sting.
Something I have not factored in at all is biological advance. I don’t have cyborgs, or genetically enhanced people, or such things. But I suspect that the likelihood is that such developments will occur well within the time horizon of Far from the Spaceports. Biology isn’t my strong suit, so I haven’t written about this. There’s a background assumption that illness isn’t a serious problem in this future world, but I haven’t explored how that might happen, or what other kinds of medical change might go hand-in-hand with it. So this is almost certainly going to be a miss on my part.
Moving on to points of contact with the conference, there is the question of my personas’ autonomy. Right now, all of our current generation of intelligent assistants – Alexa, Siri, Cortana, Google Home and so on – rely utterly on a reliable internet connection and a whole raft of cloud-based software to function. No internet or no cloud connection = no Alexa.
This is clearly inadequate for a persona like Slate heading out to the asteroid belt! Mitnash is obviously not going to wait patiently for half an hour or so between utterances in a conversation. For this to work, the software infrastructure that imparts intelligence to a persona has to travel along with it. Now this need is already emerging – and being addressed – right now. I guess most of us are familiar with the idea of the Cloud. Your Gmail account, your Dropbox files, your iCloud pictures all exists somewhere out there… but you neither know nor care where exactly they live. All you care is that you can get to them when you want.
But with the emerging “internet of things” that is having to change. Let’s say that a wildlife programme puts a trail camera up in the mountains somewhere in order to get pictures of a snow leopard. They want to leave it there for maybe four months and then collect it again. It’s well out of wifi range. In those four months it will capture say 10,000 short videos, almost all of which will not be of snow leopards. There will be mountain goats, foxes, mice, leaves, moving splashes of sunshine, flurries of rain or snow… maybe the odd yeti. But the memory stick will only hold say 500 video clips. So what do you do? Throw away everything that arrives after it gets full? Overwrite the oldest clips when you need to make space? Arrange for a dangerous and disruptive resupply trip by your mountaineer crew?
Or… and this is the choice being pursued at the moment… put some intelligence in your camera to try to weed out non-snow-leopard pictures. Your camera is no longer a dumb picture-taking device, but has some intelligence. It also makes your life easier when you have recovered the camera and are trying to scan through the contents. Even going through my Grasmere badger-cam vids every couple of weeks involves a lot of deleting scenes of waving leaves!
So this idea is now being called the Cloud Edge. You put some processing power and cleverness out in your peripheral devices, and only move what you really need into the Cloud itself. Some of the time, your little remote widgets can make up their own minds what to do. You can, so I am told, buy a USB stick with trainable neural network on it for sifting images (or other similar tasks) for well under £100. Now, this is a far cry from an independently autonomous persona able to zip off to the asteroid belt, but it shows that the necessary technologies are already being tackled.
I’ve been deliberately vague about how far into the future Far from the Spaceports, Timing, and the sequels in preparation are set. If I had to pick a time I’d say somewhere around the one or two century mark. Although science fact notoriously catches up with science fiction faster than authors imagine, I don’t expect to see much of this happening in my lifetime (which is a pity, really, as I’d love to converse with a real Slate). I’d like to think that humanity from one part of the globe or another would have settled bases on other planets, moons, or asteroids while I’m still here to see them, and as regular readers will know, I am very excited about where AI is going. But a century to reach the level of maturity of off-Earth habitats that I propose seems, if anything, over-optimistic.
That’s it for today – over the next few weeks I’ll be talking about other fun things I learned…
Today’s blog is primarily about the latest addition to book readings generated using Amazon’s Polly text-to-speech software, but before getting to that it’s worth saying goodbye to the Cassini space probe. This was launched nearly twenty years ago, has been orbiting Saturn and its moons since 2004, and is now almost out of fuel. By the end of the week, following a deliberate course change to avoid polluting any of the moons, Cassini will impact Saturn and break up in the atmosphere there.
So, Half Sick of Shadows and Polly. Readers of this blog, or the Before the Second Sleep blog (first post and second post) will know that I have been using Amazon’s Polly technology to generate book readings. The previous set were for the science fiction book Timing, Far from the Spaceports 2. Today it is the turn of Half Sick of Shadows.
Without further ado, and before getting to some technical stuff, here is the result. It’s a short extract from late on in the book, and I selected it specifically because there are several speakers.
OK. Polly is a variation of the text-to-speech capability seen in Amazon Alexa, but with a couple of differences. First, it is geared purely to voice output, rather than the mix of input and output needed for Alexa to work.
Secondly, Polly allows a range of gender, voice and language, not just the fixed voice of Alexa. The original intention was to provide multi-language support in various computer or mobile apps, but it suits me very well for representing narrative and dialogue. For this particular reading I have used four different voices.
If you want to set up your own experiment, you can go to this link and start to play. You’ll need to set up some login credentials to get there, but you can extend your regular Amazon ones to do this. This demo page allows you to select which voice you want and enter any desired text. You can even download the result if you want.
But the real magic starts when you select the SSML tab, and enter more complex examples. SSML is an industry standard way of describing speech, and covers a whole wealth of variations. You can add what are effectively stage directions with it – pauses of different lengths, directions about parts of speech, emphasis, and (if necessary) a phonetic letter by letter description. You can speed up or slow down the reading, and raise or lower the pitch. Finally, and even more usefully for my purposes, you can select the spoken language as well as the language of the speaker. So you can have an Italian speaker pronouncing an English sentence, or vice versa. Since all my books are written in English, that means I can considerably extend the range of speakers. Some combinations don’t work very well, so you have to test what you have specified, but that’s fair enough.
If you’re comfortable with the coding effort required, you can call the Polly libraries with all the necessary settings and generate a whole lot of text all at once, rather than piecemeal. Back when I put together the Timing extracts, I wrote a program which was configurable enough that now I just have to specify the text concerned, plus the selection of voices and other sundry details. It still takes a little while to select the right passage and get everything organised, but it’s a lot easier than starting from scratch every time. Before too much longer, there’ll be dialogue extracts from Far from the Spaceports as well!
Just a short post today to highlight a YouTube video based around one of the Polly conversations from Timing that I have been talking about recently. This one is of Mitnash, Slate, Parvati and Chandrika talking on board Parvati’s spaceship, The Parakeet, en route to Phobos. The subject of conversation is the recent wreck of Selif’s ship on Tean, one of the smaller asteroids in the Scilly isles group…
The link is: https://youtu.be/Uv5L0yMKaT0
While we’re in YouTube, here is the link to the conversation with Alexa about Timing… https://youtu.be/zLHZSOF_9xo
It’s slow work, but gradually all these various conversations and readings will get added to YouTube and other video sharing sites.
A couple of weeks ago I went to a day event put on by Amazon showcasing their web technologies. My own main interests were – naturally – in the areas of AI and voice, but there was plenty there if instead you were into security, or databases, or the so-called “internet of things”.
Readers of this blog will know of my enthusiasm for Alexa, and perhaps will also know about the range of Alexa skills I have been developing (if you’re interested, go to the UK or the US sites). So I thought I’d go a little bit more into both Alexa and the two building blocks which support Alexa – Lex for language comprehension, and Polly for text-to-speech generation.
Alexa does not in any substantial sense live inside your Amazon Echo or Dot – that simply provides the equivalent of your ears and mouth. Insofar as the phrase is appropriate, Alexa lives in the cloud, interacting with you by means of specific convenient devices. Indeed, Amazon are already moving the focus away from particular pieces of hardware, towards being able to access the technology from a very wide range of devices including web pages, phones, cars, your Kindle, and so on. When you interact with Alexa, the flow of information looks a bit like this (ignoring extra bits and pieces to do with security and such like).
And if you tease that apart a little bit then this is roughly how Lex and Polly fit in.
So for today I want to look a bit more at the two “gateway” parts of the jigsaw – Lex and Polly. Lex is there to sort out what it is you want to happen – your intent – given what it is you said. Of course, given the newness of the system, every so often Lex gets it wrong. What entertains me is not so much those occasions when you get misunderstood, but the extremity of some people’s reaction to this. Human listeners make mistakes just like software ones do, but in some circles each and every failure case of Lex is paraded as showing that the technology is inherently flawed. In reality, it is simply under development. It will improve, but I don’t expect that it will ever get to 100% perfection, any more than people will.
Anyway, let’s suppose that Lex has correctly interpreted your intent. Then all kinds of things may happen behind the scenes, from simple list lookups through to complex analysis and decision-making. The details of that are up to the particular skill, and I’m not going to talk about that.
Instead, let’s see what happens on the way back to the user. The skill as a whole has decided on some spoken response. At the current state of the art, that response is almost certainly defined by the coder as a block of text, though one can imagine that in the future, a more intelligent and autonomous Alexa might decide for herself how to frame a reply. But however generated, that body of text has to be transformed into a stream of spoken words – and that is Polly’s job.
A standard Echo or Dot is set up to produce just one voice. There is a certain amount of configurability – pitch can be raised or lowered, the speed of speech altered, or the pronunciation of unusual words defined. But basically Alexa has a single voice when you use one of the dedicated gadgets to access her. But Polly has a lot more – currently 48 voices (18 male and 30 female), in 23 languages. Moreover, you can require that the speaker language and the written language differ, and so mimic a French person speaking English. Which is great if what you want to do is read out a section of a book, using different voices for the dialogue.
That’s just what I have been doing over the last couple of days, using Timing (Far from the Spaceports Book 2) as a test-bed. The results aren’t quite ready for this week, but hopefully by next week you can enjoy some snippets. Of course, I rapidly found that even 48 voices are not enough to do what you want. There is a shortage of some languages – in particular Middle Eastern and Asian voices are largely absent – but more will be added in time. One of the great things about Polly (speaking as a coder) is that switching between different voices is very easy, and adding in customised pronunciation is a breeze using a phonetic alphabet. Which is just as well. Polly does pretty well on “normal” words, but celestial bodies such as Phobos and Ceres are not, it seems, considered part of a normal vocabulary! Even the name Mitnash needed some coaxing to get it sounding how I wanted.
The world of Far from the Spaceports and Timing (and the in preparation Authentication Key) is one where the production of high quality and emotionally sensitive speech by artificial intelligences (personas in the books) taken for granted. At present we are a very long way from that – Alexa is a very remote ancestor of Slate, if you like – but it’s nice to see the start of something emerging around us.
And no, I hadn’t realised this myself until a couple of days before… but NASA and others around the world had a day’s focus on asteroids. Now, to be sure most of that focus was looking at the thorny question of Near Earth Objects, both asteroids and comets, and what we might be able to do if one was on a collision course.
But it seemed to me that this was as good a time as any to celebrate my fictional Scilly Isle asteroids, as described in Far from the Spaceports and Timing (and the work in progress provisionally called The Authentication Key). In those stories, human colonies have been established on some of the asteroids, and indeed on sundry planets and moons. These settlements have gone a little beyond mining stations and are now places that people call home. A scenario well worth remembering on International Asteroid Day!
While on the subject of books, some lovely reviews for Half Sick of Shadows have been coming in.
Hoover Reviews said:
“The inner turmoil of The Lady, as she struggles with the Mirror to gain access to the people she comes in contact with, drives the tale as the Mirror cautions her time and again about the dangers involved. The conclusion of the tale, though a heart rending scene, is also one of hope as The Lady finally finds out who she is.”
The Review said: “Half Sick of Shadows is in a genre all its own, a historical fantasy with some science fiction elements and healthy dose of mystery, it is absolutely unique and a literary sensation. Beautifully written, with an interesting storyline and wonderful imagery, it is in a realm of its own – just like the Lady of Shalott… It truly is mesmerising.”
I’ve been thinking these last few days, once again, about language and pronunciation. This was triggered by working on some more Alexa skills to do with my books. For those who don’t know, I have such things already in place for Half Sick of Shadows, Far from the Spaceports, and Timing. That leaves the Bronze Age series set in Kephrath, in the hill country of Canaan. And here I ran into a problem. Alexa does pretty well with contemporary names – I did have a bit of difficulty with getting her to pronounce “Mitnash” correctly, but solved that simply by changing the spelling of the text I supplied. If instead of Mitnash I wrote Mitt-nash, the text-to-speech engine had enough clues to work out what I meant.
So far so good, but you can only go part of the way down that road. You can’t keep fiddling around with weird spellings just to trick the code into doing what you want. Equally, it’s hardly reasonable to suppose that the Alexa coding team would have considered how to pronounce ancient Canaanite or Egyptian names. Sure enough the difficulties multiplied with the older books. Even “Kephrath” came out rather mangled, and things went downhill from there.
So I took a step back, did some investigation, and found that you can define the pronunciation of unusual words by using symbols from the phonetic alphabet. Instead of trying to guess how Alexa might pronounce Giybon, or Makty-Rasut, or Ikaret, I can simply work out what symbols I need for the consonants and vowels, and provide these details in a specific format. Instead of Mitnash, I write mɪt.næʃ. Ikaret becomes ˈIk.æ.ˌɹɛt.
So that solved the immediate problem, and over the next few days my Alexa skills for In a Milk and Honeyed Land, Scenes from a Life, and The Flame Before Us will be going live. Being slightly greedy about such things, of course I now want more! Ideally I want the ability to set up a pronunciation dictionary, so that I can just set up a list of standard pronunciations that Alexa can tap into at need – rather like having a custom list of words for a spelling checker. Basically, I want to be able to teach Alexa how to pronounce new words that aren’t in the out-of-the-box setup. I suspect that such a thing is not too far away, since I can hardly be the only person to come across this. In just about every specialised area of interest there are words which aren’t part of everyday speech.
But also, this brought me into contact with the perennial issue of UK and US pronunciation. Sure, a particular phonetic symbol means whatever it means, but the examples of typical words vary considerably. As a Brit, I just don’t pronounce some words the same as my American friends, so there has to be a bit of educated guesswork going into deciding what sound I’m hoping for. Of course it’s considerably more complicated than just two nations – within those two there are also large numbers of regional and cultural shifts. And of course there are plenty of countries which use English but sound quite different to either “standard British” or “standard American”.
That’s for some future, yet to be invented, dialect-aware Alexa! Right now it’s enough to code for two variations, and rely on the fact that the standard forms are recognisable enough to get by. But wouldn’t it be cool to be able to insert some extra tags into dialogue in order to get one character’s speech as – say – Cumbrian, and another as from Somerset.