The convergence of artificial intelligence, smartphone adoption and availability of a huge amount of consumer data is leading to a new generation of virtual assistants. Wearables have a crucial role as well: speech recognition is now built into every major operating system, allowing users to speak with the machine.
Despite an army of scientists devoted their lives to this challenge for decades, if you think about Siri app, it seems we are still far away from the dream of speaking conversationally to a machine. The good news is that technology is improving fast and the future virtual assistants will be able to put your words into the proper context and reply accordingly.
The task is much more complex than you think. In this post I’m going to explain why and envision future developments. Machines talking with people come from far away. In 2003 DARPA hugely invested in a five year, 500 person project aimed at building a virtual assistant. The government wanted to develop software to help military commanders with communication optimization. This helper was named CALO, the Cognitive Assistant that Learns and Organizes. Siri is then the progeny of the largest artificial intelligence project in U.S. history and has been brought to life by 3 scientists who launched a standalone iPhone app called Siri in early 2010. Several weeks after launch, they received a phone call which, I assume, sounded like this: “Hey, it’s Steve. What are you doing tomorrow? Want to come over to my house?” It was Steve Jobs and Apple acquired the technology for a reported $150 to $250 million in 2011. The problem is that Siri is also the orphan of Steve… he died the day after Siri debuted.
So how does the Siri app work? Why it’s so difficult to talk with a machine? And what’s the potential for the future?
Phase 1: voice recognition
It’s apparently the easy part, but it’s where everything begins, so it can’t be trivial. When you give Siri a command, your device collects your analog voice, convert it in an audio file (it’s translated into binary code) and send it to Apple servers. The nuances of your voice, the noise around and the local expressions make difficult to get it done right. It’s called Human User Interface versus the standard Graphical User Interface we are used to. It’s important here that, everyday, Apple collects millions of queries of people speaking multiple languages, in many accents, while living on different continents. In other words with their actions and mistakes, people are contributing to the largest crowd sourced speech recognition experiment ever tried on earth. Siri app today receives roughly a billion requests per week and Apple states its speech recognition capability has just a 5 percent word error rate. Last year Apple acquired the speech recognition company Novauris Technologies, a spinoff of Dragon Systems and also hired several speech recognition experts, to get to this point.
Phase 2: send everything to Apple servers in the cloud
Siri does not process your speech input locally on your phone. This is clearly a problem if you’re not connected for any reason, but this way Apple gets two major benefits:
- offload much of the work to powerful computers rather than eating the limited resources of the mobile device
- use the data it collects to continuously improve the service
The algorithm identifies the keywords and starts taking you down the flowchart branches related to those keywords to retrieve your answer. If it fails in this exercise, because a piece of the communication does not work, it goes down the wrong flowchart branch. If it happens just once, the whole query is ruined and ends into the “Would you like to search the web for that?” result. Google Now and Cortana are no different.
You understand this is far from the concept of human conversation. Siri app is still built with a logic of pre-programming all the possible set of questions and rules to answer. This was even more evident when, in October 2015, Apple honored “Back to the Future” day by updating the Siri app with at least ten humorous responses related to the popular movie Back to the Future. My favorite “be careful who you date today, or you could start disappearing from photos…” is just one answer it picks up randomly from the list.
|If you like my post, please share it 🙂|
Phase 3: understand the meaning
The process of understanding what the user is asking for, relies on an area of science called natural language processing. People have dozens of ways of asking the same thing. We can express a concept using endless combinations of words. “I’m in the mood for a pizza”, “Is there an Italian restaurant nearby?”, “I’d love a Margherita today”. Humans can easily understand what I mean, it’s obvious that Margherita is not a person, but an algorithm must be sophisticated to reach the same conclusion. Sometimes is just because words have a similar sound or are mispronounced: oyster and ostrich, school and skull, byte and bite, sheep and ship and many others make the task complicated.
To simplify its life, Siri app software, models linguistic concepts. It analyzes how the subject keyword is connected to an object and a verb. In other words it looks at the syntactical structure of the text. The decision to go down a branch of the flowchart or another, depends upon nouns, adjectives, verbs, as well as the general intonation of the sentences. On top of it, Siri can make sense of questions and follow up commands. This is not exactly what a human would call “a conversation”, but it means it understands the context and it’s the starting point for future developments.
Phase 4: transform the meaning into actionable instructions
We know that Siri is here to help us, not just to understand what we say. In “The story behind Siri”, the founder Adam Cheyer says “I remember the first time we loaded these data sources into Siri, I typed “start over” into the system, and Siri came back saying, “Looking for businesses named ‘Over’ in Start, Louisiana.” “Oh, boy,” I thought.”.
When the Siri app understands what you want, she has to dialogue with other apps to make it happen. And every app is different and partially has its own “language”. The system must have what is called domain knowledge, it must know the subject area you’re talking about. In a human conversation, this happens every time we talk with experts in a certain field and they use specialized words that we hardly understand. It’s obvious when we speak with a doctor, an architect or a finance person, for example. For the Siri app it’s the same. When it has to give a direction, book a flight or send a text it has to dialogue with other apps… and understand their context. This is crucial as well. If the protocol does not work, Siri can give instructions to other apps to perform actions you didn’t require and expect or can be even potentially dangerous to you.
Last but not least, once a request has been processed, Siri must convert the result back into text that can be spoken to the user. While not as hard as processing a user’s command, this task, known as natural language generation, still presents some challenges. Today Siri speaks with the American voice of as “Samantha”, provided by Susan Bennett in July 2005, the same person that voiced Tillie the All-Time Teller. But after Apple purchased Siri, they had to extend the capability to hundreds of languages; and that’s another reason why Siri app is not growing as fast as the original expectation.
The future of Siri app: adaptive and predictive
If the goal of a virtual assistant is enhance human capabilities, supplement the limitations of our minds and free us from mundane and tedious tasks… Siri app is just not enough. The next generation of personal assistants like Viv and VIQ will learn just like a human child does. It’s about proceeding step by step. Today the machine basically argues the best outcome for the user, tomorrow it will candidly admit “I don’t know what it is”. After your explanation it learns and puts everything in correlation with other words. So if you ask for an Italian restaurant it will drag a list from the web, fine. But if you ask for a Margherita, you have to explain it’s a simple pizza with just tomato, mozzarella and basil. Then it will make the link (Margherita, pizza, Italian restaurant)… forever.
This way your virtual assistance will begin to know you. Facts, habits, routines, tastes, preferences and many more. Every command will be read into historical and contextual perspective. It will adapt to your own behavior. You will have the same artificial intelligence I have, but they will develop and grow differently from one another, because my behavior is different than yours. Adapting to the user’s individual language and individual preferences (with continuing use), it will return results that are individualized.
Then following step is that virtual assistant will become proactive, rather than just reactive, which means doing useful things before you’ve prompted it to. For example, when you enter in a meeting, it will silent your phone. It does not sound as a major improvement? And what about putting together a sequence of actions to have a task completed? If you fix a meeting in London next week on your calendar, it will scan the web services and prompt you a suitable combination of flights, hotels and limo services, based on your habits. Or do you prefer to spend time on endless and boring menus on a website?
I recently wrote the post “Artificial Intelligence, virtual assistants and giant screens” to give a flavor of what virtual assistants are going to do, if you’re curious. The point here is simple, who controls the virtual assistants, controls the purchasing habits of millions of people… because when you ask a question to Siri app… it stays there, potentially forever.
I suggest additional interesting sources if you want to know more about Siri app origin
Newsletter: because there’s more than Siri app here!
The Futurist Hub Newsletter is the greatest thing after the Big Bang. Once per month, only the news, free of spam. And with a free ebook as a bonus.