Voice Recognition Attack Vector

An Attack Vector design based on the idea of a viral marketing campaign by Burger King

The Theory
Software advances in Voice Recognition continue to improve as does the list of applications from commercial answering machines with help menus, mobile phones; even banks are now using various Voice Recognition programs.  From a security standpoint, I was very interested in possible ways that mobile phones might be compromised using this technology as it is most commonly installed in many common mobile devices (Smart Phones). 

A recent television advertisement by “Burger King” failed horribly when the actor asked Google what is the whopper burger?  Google’s voice service read the first thing which happened to be from “Wikipedia” which suggested the burger was made from 100% pure children.

This made me think even more about Voice Recognition vulnerabilities. Hereafter, I shall reference Google a lot because I predominately use an android device, but even though I am an android mobile user my results are relevant to the broad spectrum of Voice Recognition software.

In case you missed the original fifteen-second advertisement here is it with subtitles:

The Software
Presently the major market software operators of Voice Recognition are Google Voice and Siri.   Bother systems are always listening but Siri does currently offers more functionality/control features. 
On Android, the software is normally active by voice command on the home screen saying, “Ok Google” followed by a command such as; Call, Search, Find, Tell me.

For example “Ok, Google should I bring an umbrella today” and the command response is either in text format or by a computer-generated voice if you should or should not bring an umbrella with you.

Apple software works very similarly but its operational methodology tends to favor a button to be pressed before issuing a command.  However, depending on the phone’s functionality and user configurations it is possible for Siri to respond directly to voice commands.

Windows Cortana phone uses a software platform with similar functionality and like Google and Siri it is always listening for its opening phrase to become active, so in the theory of the design, it could also be attacked using the same proof of concept.

The Software Used in this Attack
The only tool really needed for this attack was audacity for recording the Google voice command, and combining the audio with the right portion of the song needed. To get the best result I used the windows text to speech function, but this is not required.

The Hardware Used in this Attack
For this attack, it required an Mp3 player with repeat function and a small portable speaker.

The Theory behind the Attack
The general idea of my attack was to echo its own technology against itself.  The potential market for opportunity is huge.  It is estimated that 500,000 people pass through Dublin City center each day and many of them will have a smartphone.  Statistics tell us that Android accounts for 49.58% of the market share and Apple the majority of what remains. So based on that information we can roughly estimate that there are 247,900 android mobile devices passing through Dublin City Centre every day and if even 200,000 devices fall victim to this attack it could net a potential attacker €200,000 for a single days attack. Since this attack is unique and the method does not know it could take 3 weeks or more of surveillance for a law enforcement agency to get clothes to catch the attacker, of our above estimate netting a possible €4,200,000 money for 7 days of work for 3 weeks.

The Attack Idea

So let’s assume you want to call a phone number from an android mobile device without pressing keys, they made it simple for and user friendly, that you would only have to say “Ok Google, call 1234567890”,   The phone will then dial the given number for you to be connected, very simple. 

Hypothetically if you set up a premium phone number, and each incoming call had a charge of €1 on the premium rate per minute, you could then ask your Google device to call this number,  Of course, it would not be practicable nor advisable to walk up to someone and shout loudly, “OK Google, call ### ### ####”. 

Attack Method
What is required to make this attack viable is a system to covertly and simultaneously instruct multiple devices, to call the premium number that you have set up, well the simple solution would be, you can create an audio track using text to speech and play it via a speaker.  It will work when within hearing distance of listening devices, but again the device owners are aware of your actions.

My solution was to embed the instruction within a song. To overcome the music confusing the Voice Recognition the instruction needs to be placed at a null point of low base or beats, and no audible vocal tones. 

Further research and I found the perfect song by singer AronChupa titled “I’m an Albatraoz” because between the times of 1:37 – 1:44 there is a null in the song.  Furthermore, psychology studies have shown that after 1 minute of listening to a common song most people tend to drown out the lyrics and lose focus on what is being said, only focusing on the beat.  So decided to add the text to the speech clip to the song at the low beat section where you would expect lyrics.  It Looked Like This

For the purposes of this attack the inserted speech said “OK Google, What’s the Weather”.
I was now ready to test my theory.  Sitting on a crowded morning bus I used a portable Bluetooth speaker and the modified track sitting on my iPod.  When I knocked out the tune to my surprise I heard multiple replies of “It is currently 9 degrees in Dublin”.  There were several confusing faces including myself as my phone also reacted, and at this point was no 100% expected as the only prior testing was in a quiet environment with the modified audio track.

After further testing, I found even If I sped up the voice commands as long as they were audibly legibly it would be picked up by the Google voice assistant.  I was able to break the time down to 2 seconds, meaning I could make more commands to be given between normal drops of the song, meaning that the song would not need to be repeated and it could be added to multiple other tracks, but this could only be successful with a computer-based voice as accents and dialects sped up tended to confuse the Google device.

Now if I were deviously inclined I could have gone into the center of Dublin, let’s say Grafton street where loud music is very common, with a custom EDM music track with a bunch of random drops with sped up commands in between.  To anybody not paying attention this could seem like random stuff being said during the drops of a normal EDM song.  Additionally, it is technically feasible with multiple embedded commands to target Androids, Apple, and Windows phones in the one music track.

Upon even further testing I was able to modify the pitch of the track to ultra-sonic sounds trough a mix of plugins in audacity to around 20 to 23 kHz, I got this idea after seeing a passer-by blow on a dog whistle, I had large amounts of success with this method as for some reason the device was able to recognize the sound from around 20 to 23KHz, this level is un-audible to the human ear, so mixing the random drops, ultrasonic converted 2-second track, it is next to undetectable.

Other Platforms
This method would also be effective against the upcoming Google home devices, Alexa devices, and so on, as the specifications state they have straight voice activation for certain features, this could be used to possibility manipulate the device to either call a number or add a tampered item to the shopping basket to be purchased as per feature listings. Unfortunately at this time, I am unable to test this attack vector as I currently do not own a Google Home device and will hopefully be able to test this method when I get one on release.

The Fix
The only fixes I know are to require manual input from the user prior to each Voice Recognition command.  Alternatively, turn the Voice Recognition feature off until required.  Another fix would be to change the microphone sensitivity so only audible audio is heard by the device.