offers data science lesson videos made simple!

Sign up or log in to Magoosh Data Science.

How Does Siri Work? The Science Behind Siri

How Does Siri Work? The Science Behind Siri -magoosh

Editorial credit: Soze Soze / Shutterstock.com

Hey Siri, set an alarm for 7 AM tomorrow.
Done – your alarm is set for 7 AM tomorrow.

We are all quite familiar with Siri – the voice assistant by Apple. It helps you simplify navigating through your iPhone by simply listening to your voice and performing the task you want it to do. For instance, you can ask Siri to set an alarm or a reminder for you at a specific time of a specific day. You can ask Siri to call your friend. You can ask Siri to tell you a joke.

Siri is fun and it is highly convenient to use. As the end users, we don’t bother to worry about the tech. However, have you ever wondered what’s happening behind the scene when you talk to Siri?

Technology behind Siri

Siri works on primarily 2 technologies – Speech Recognition and Natural Language Processing. Speech Recognition is the task of converting a human speech into its corresponding textual form. For instance, when you trigger Siri by saying “Hey Siri”, in the back-end, a powerful speech recognition system by Apple kicks off and converts your audio into its corresponding textual form – “Hey Siri.” This is an extremely challenging task simply because we humans have a highly diverse set of tones as well as accents. The accents vary not only across countries, but also across states/cities within a country. Some people speak fast, some speak slowly. Characteristics of male and female voices are also very different.

The engineers at Apple train Machine Learning models on large, transcribed datasets in order to create efficient speech recognition models for Siri. These models are trained with highly diverse datasets that comprise of the voice samples of a large group of people. This way, Siri is able to cater to various accents.

In the recent years, deep learning has proven to produce phenomenal results in speech recognition. The word error rate of speech recognition engines has drastically gone down to less than 10%. This has been possible due to the availability of not only large datasets, but also powerful hardware using speech recognition algorithms that can be trained on the datasets.

Once Siri has understood what you are saying, the converted text is sent to Apple servers for further processing. Apple servers then run Natural Language Processing (NLP) algorithms on this text to understand the intent of what the user is trying to say. For instance, the NLP engines are able to differentiate that when a user is saying “set an alarm for 7AM tomorrow,” the user is asking about setting an alarm and not about making a call. This is challenging because different users speak the same sentence in different ways. For instance, one can say the same thing in the following ways:

  • Hey Siri, can you set me an alarm for 7AM tomorrow?
  • Siri, can you wake me up tomorrow at 7AM?
  • Siri, please set an alarm for tomorrow at 7AM.
  • Siri, please wake me up tomorrow at 7AM.

These are just a few right ways of telling Siri to set an alarm. Some people may speak grammatically incorrect sentences – “Siri alarm set me tomorrow at 7AM”. As a result, the intent analysis becomes very challenging. Just like speech recognition, intent analysis also requires a lot of data in order to train Natural Language Processing algorithms. Only when the dataset provided is huge is it the case that Siri is able to generalize and capture the variations of the same sentence that it has never seen. This makes the whole processes an extremely difficult task. In order to accomplish such mammoth tasks, Apple hires top-notch software engineers that have years of experience in Artificial Intelligence, Machine Learning, and Natural Language Processing.

These are just 2 of the most fundamental challenges. Another important technology behind Siri that employs Machine Learning is that of contextual understanding. You can talk to Siri like you are talking to a human:
You: Hey Siri, set an alarm.
Siri: What time do you want me to set an alarm?
You: 7 AM.

In the last sentence, when you said “7 AM”, Siri was able to understand and correlate that this 7 AM is a continuation of the last message where you asked it to set an alarm.

One final technology that Siri employs in this whole process is that of entity extraction. When you ask Siri to set an alarm for tomorrow at 7AM, Siri not only understands the meaning of your sentence, but also it automatically picks up entities from the sentence – 7AM and tomorrow.

Final Words

Overall, Siri is based on large-scale Machine Learning systems that employ 2 main aspects of data science – Speech Recognition and Natural Language Processing.

Amazing isn’t it? In our technology-heavy world, we tend to take things around us for granted. However, when we try to uncover them, we realize that there is a lot of technological magic happening behind the scene.

Comments are closed.


Magoosh blog comment policy: To create the best experience for our readers, we will only approve comments that are relevant to the article, general enough to be helpful to other students, concise, and well-written! 😄 Due to the high volume of comments across all of our blogs, we cannot promise that all comments will receive responses from our instructors.

We highly encourage students to help each other out and respond to other students' comments if you can!

If you are a Premium Magoosh student and would like more personalized service from our instructors, you can use the Help tab on the Magoosh dashboard. Thanks!