Voice Recognition POC Attack

A proof of concept attack design based off the idea of a viral marketing campaign by Burger King

April 2017

The Theory
Software advances in Voice Recognition continue to improve as does the list of applications from commercial answering machines with help menus, mobile phones; even banks are now using various Voice Recognition programmes.  From a security stand point I was very interested in possible ways that mobile phones might be compromised. 
A recent television advertisement by "Burger King" failed horribly when the actor asked Google what is the whopper burger?  Google's voice service read the first thing which happened to be from "Wikipedia" which suggested the burger was made from 100% pure children.
This made me think even more about Voice Recognition vulnerabilities. Hereafter, I shall reference Google a lot because I predominately use an android device, but even though I am an android mobile user my results are relevant to the broad spectrum of Voice Recognition software.
In case you missed the original fifteen second advertisement here is it with subtitles.

The Software
Presently the major market software operators of Voice Recognition are Google Voice and Siri.   Bother systems are always listening but Siri does currently offers more functionality / control features. 
On Android, the software is normally active by voice command on the home screen saying, "Ok Google" followed by a command such as; Call, Search, Find, Tell me.
For example "Ok, Google should I bring an umbrella today" and the command response is either in text format or by a computer generated voice, if you should or should not bring an umbrella with you.
Apple software works very similar but its operational methodology tends to favour a button to be pressed before issuing a command.  However depending on the phones functionality and user configurations it is possible for Siri to respond directly to voice commands.
Windows Cortana phone uses a software platform with similar functionality and like Google and Siri it is always listening for its opening phrase to become active, so in theory of the design it could also be attacked using the same proof of concept.

The Software Used in this Attack
The only tool really needed for this attack was audacity for recording the Google voice command, and combining the audio with the right portion of the song needed. To get the best result I used the windows text to speech function, but this is not required.

The Hardware Used in this Attack
For this attack it required an Mp3 player with repeat function and a small portable speaker.

The Theory behind the Attack
The general idea of my attack was to echo its own technology against itself.  The potential market of opportunity is huge.  It is estimated that 500,000 people pass through Dublin City centre each day and many of them will have a smart phone.  Statistics tells us that Android accounts for 49.58% of the market share and Apple the majority of what remains. So based off that information we can roughly estimate that there are 247,900 android mobile devices passing through Dublin City Centre every day and if even 200,000 devices fall victim to this attack it could net a potential attacker €200,000 for a single days attack. Since this attack is unique and the method is not know it could take 3 week or more of surveillance for a law enforcement agency to get clothes to catching the attacker, of our above estimate netting a possible €4,200,000 money for 7 days of work for 3 weeks.

The Attack Idea

So let’s assume you want to call a phone number from an android mobile device without pressing keys, they made it simple for and user friendly, that you would only have to say "Ok Google, call 1234567890",   The phone will then dial the given number for you to be connected, very simple. 
Hypothetically if you set up a premium phone number, and each incoming call had a charge of €1 on the premium rate per minute, you could then ask your Google device to call this number,  Of course it would not be practicable nor advisable to walk up to someone and shout loudly, “OK Google, call ### ### ####”. 

Attack Method
What is required to make this attack viable is a system to covertly and simultaneously instruct multiple devices, to call your premium number that you have set up, well the simple solution would be, you can create an audio track using text to speech and play it via a speaker.  It will work when within hearing distance of listening devices, but again the device owners are aware of your actions.
My solution was to embed the instruction within a song. To overcome the music confusing the Voice Recognition the instruction needs to be placed at a null point of low base or beats, and no audible vocal tones. 
Further research and I found the perfect song by singer AronChupa titled “I'm an Albatraoz” because between the times of 1:37 - 1:44 there is a null in the song.  Furthermore psychology studies have shown that after 1 minute of listening to a common song most people tend to drown out the lyrics and lose focus on what is being said, only focusing on the beat.  So decided to add the text to speech clip to the song at the low beat section where you would expect lyrics.  It Looked Like This
For the purposes of this attack the inserted speech said "OK Google, What’s the Weather".
I was now ready to test my theory.  Sitting on a crowded morning bus I used a portable Bluetooth speaker and the modified track sitting on my iPod.  When I knocked out the tune to my surprise I heard multiple replies of "It is currently 9 degrees in Dublin".  There were several confused faces including myself as my phone also reacted, and at this point was no 100% expected as the only prior testing was in a quite environment with the modified audio track.

After further testing I found even If I speeded up the voice commands as long as they were audibly legibly it would be picked up by the Google voice assistant.  I was able to break the time down to 2 seconds, meaning I could make more commands to be given between normal drops of the song, meaning that the song would not need to be repeated and it could be added to multiple other tracks, but this could only be successful with a computer based voice as accents and dialects sped up tended to confused the Google device.

Now if I were deviously inclined I could have went into the centre of Dublin, let's say Grafton street where loud music is very common, with a custom EDM music track with a bunch of random drops with speeded up commands in between.  To anybody not paying attention this could seem like random stuff being said during the drops of a normal EDM song.  Additionally it is technically feasible with multiple embedded commands to target Androids, Apple and Windows phones in the one music track.

Upon even further testing I was able to modify the pitch of the track to ultra-sonic sounds trough a mix of plugins in audacity to around 20 to 23 KHz, I got this idea after seeing a passer-by blow on a dog whistle, I had large amounts of success with this method as for some reason the device was able to recognise the sound from around 20 to 23KHz, this level is un-audible to the human ear, so mixing the random drops, ultrasonic converted 2 second track, it is next to undetectable.

Other Platforms
This method would also be effective against the upcoming Google home devices, Alexa devices and so on, as the specifications state they have straight voice activation for certain features, this could be used to possibility manipulate the device to either call a number or add a tampered item to the shopping basket to be purchased as per feature listings. Unfortunately at this time I am unable to test this attack vector as I currently do not own a Google home device and will hopefully be able to test this method when I get one on release.

The Fix
The only fixes I know are to require a manual input from the user prior each Voice Recognition command.  Alternatively turn the Voice Recognition feature off until required.  Another fix would be to change the microphone sensitivity so only audible audio is heard by the device.