The most merciful thing in the world is the inability of the human mind to correlate all its contents.
H. P. Lovecraft
Defeating audio (voice) captchas

This article is old and here as a historic reference. For more up to date information about breaking audio captchas see for example Elie Burszstein, who builds on my previous work but in a much more academic fashion.

Introduction

For some years semi turing tests under the name of "captchas" can be found on the web, to prevent bots from filling in forms. When I first saw the visual variant I thought recognizing the characters with a computer algoritm should be easy. A bit of surfing and searching on the internet learned me that I was right, most were broken already. Reinventing the wheel is not very useful, so I left the topic alone.

Later I found a post about voice captchas. Since there was not too much information about this on the net and I was bored (ill at home), I decided to give it a shot. I started easy, willing to enhance the used algoritms to those used in speech recognition (like hmm, viterbi, baum-welch, entropy coding, etc.) when needed. This proved not to be necessary, the first feature complete (segmentation and matching) code worked relatively well on microsofts captchas. Later I tweaked it a bit to also work on google captchas.

On this page you can find proof of concept code to break voice captchas. Do not expect advanced software (pattern recnognition science is so much further) or code that can be used in other projects, I quitted the project when it worked. Initially (february 2006) I kept the code on my harddisk, but later (may 2006) I published it (see disclosure motivation).

How does it work

This is not a complete guide, but some pointers to the source (read it luke). As a starting point, consider the configtype struct:

typedef struct {
    int samplerate;
    int byterate;
    int winsize;
    int band_cnt;
    int word_length;
    int word_overlap;
    int threshold_energy;
    int file_offset;
    char trainfile[255];
} configtype;

The program starts with reading the audio file (in the header it could read the samplerate and byterate, but I am lazy). file_offset bytes are skipped in the beginning of the file, because google starts with a bell. The first step is that all samples are treated with a hamming window (arbitrary choice, most window types should do). The winsize is in samples (eg 512 samples on 8000 Hz provides a 64 ms window). Now the blocks are transformed into the frequency domain with a DFT After that the frequencies are put in band_cnt bins. These bins are not equally wide, the higher the frequency, the larger the band (this has to do with human hearing (mel/bark scale), but I doubt this is actually useful at the current incarnation of the program).

Now the program looks at the highest frequency bin. Every block that has more energy in a window than threshold_energy is considered a peak, and these peaks are used the segment the input file in the different spoken words. The word_length tells the program how many windows long a word is (so all words are considered the same length which is a current weakness of devoicecaptcha). word_overlap helps in localizing the peaks. When the locations of the words are know all frequency bins are written for word_length windows around the peaks. This is called the profile of the word.

The profiles for know words are put in trainfile. When a guess has to be made, the profiles for the words in the file are subtracted from those in the trainfile and the smallest deviation is chosen as the proper word. That is all.

The algoritms in devoicecaptcha are at this moment really naive. There are a lot of possible improvements. Perhaps in the future I will enhance the program a bit, for now I think the 33% (as on googles captchas) is good enough (and I am too lazy to reimplement htk, which should do the trick also (I guess)).

Proof of concept

The code is rather messy, but since this applies to most p0f code consider that 1337 ;-). Download devoicecaptcha.c and compile it with it:

gcc -lfftw3 -std=c99 devoicecaptcha.c

As you can see you need fftw, an allround fourier transform library, which is packaged for most distributions, so you can be lazy (apt-get install fftw3-dev or similar).

When started with ./a.out captcha.wav you also need a data set (a msn one and a google one are available. If you have downloaded the same captchas (see links) as I have, it will print a guess on stdout.

As said before, devoicecaptcha works with a comparison to a trained set. To build up a training set and test the effectiveness of various parameters you can start devoicecaptcha with a third bogus argument, eg ./a.out captcha.wav --print.

What I did was download a large set of captchas with lwp and transcribed them with the proper words with something like:

for i in google/*.wav; do aplay $i &> /dev/null &; read j; mv $i google/$j.wav; done

I ended up with a directory with filenames like "123456.wav" where 123456 are the digits spoken in the captcha. On this directory I unleased a small ruby script, which splits the files in a training and testing set, builds a training set and tests the rest. This script can be found under train.rb.

If you have broken other voice captchas with my code (or with an addition to my code), please let me know, so I can update this page.

MSN

MSN (passport) audio captchas are really weak. Only digits are used, there are always ten digits and the noise is weak and constant. The distance between the words is relatively constant. Devoicecaptcha guesses all ten digits correct on around the 75% of all cases, with a training set of about 40 files.

A data set which can be used for the english MSN (aka passport, aka msn live) voice captchas (which I got from passport.net) can be downloaded under the name msn.txt. It is also possible to create your own training data (see above).

Google

Googles voice captchas are more difficult to break than the captchas by microsoft. Google employs different speakers, uses better noise artifacts and a random number of words. The dictionary is (as microsofts) limited to digits only. The devoicecaptcha program scores around 33% on these voice captchas with a training set of 60 files. This is high enough for use in a bot.

A data set which can be used for the google captchas (in english, google also provides captchas in multiple languages) can be found under google.txt. The files were found at Google new account.

These captchas are also in use by blogger and blogspot for comments

Others

If you know other voice captcha systems, let me know. Perhaps I will have some time to look into them (and perhaps not). I will at least add them to the links section on this page, so together with the provided source other people can try to beat them.

Disclosure motivation

I did not release the source code on this page without hesitation, because it might help spammers in their goals. And I hate spam. However there are some reasons I released the code anyway:

Some people might ask what kind of solutions I do suggest for solving the spam problem. Spamassasin catches thousands of spam mails for me; it is expensive in cpu cycles (so putting spammers in jail is preferred), but the multi-tiered approach (neural network detection together with several lists of wrong-doers) works relatively well and can be applied to other forms of spam.

Playing the cat/mouse game with more difficult captchas, when the previous challenge is broken will work, but is not satisfactory in the end. I encounter more human unsolvable captchas everyday. I do understand that corporations play this game however; in the real world thresholds do help.

Links

Information

Broken voice captchas

Working on voice captchas

Not working on voice captchas

Do you know different implementations of audio captchas? Please contact me.


Posted by jochem on 2006-02-20, last update on 2012-06-05