The most merciful thing in the world is the inability of the human mind to correlate all its contents.
H. P. Lovecraft
S/G541 (A flat major)

Back from my holiday in Belgium and Luxembourg. I had a good time, turned 30; now I am trying to get back to business.

-- Filed under:

Posted by jochem on 24th August 2006, last update on 24th August 2006
Faraway voice

Hacked a bit more on audio captchas lately, but the source is not in releasable form right now.. Anyway, I now recognize the audio captchas from microsoft 95% correct and from google (also blogger/blogspot) 60%+ by tweaking the segmentation. captchas.net (35%) and paypal.com (10%) are also doable, but some improvements are still needed.

Time to add some neural network learning.

-- Filed under:

Posted by jochem on 14th June 2006, last update on 14th June 2006
Danza Ritual del Fuego

By me Back from a long week Andalusia. It took me a while to get used to the weather (dropping from 27 C to -2 C!) and updating this page. The atmosphere (and the trip) were great. Nature was beautiful (where can I donate for bringing mountains to Holland ;-) and the (medieval) buildings special (especially the mix between christian and moorish architecture). The romans also had discovered the good weather and fertile ground early and left lots of trails (some villages still have a roman ground-map and small white 3-level 'concrete flats').

Just before I left, I bought a new Canon 350D. One of the most mainstream (SLR) camera's. I had some positive experience with the analog version and reviews of the digital thing were ok. I am very happy with it. It works fine with gphoto2. (Sorry for the analog camera lovers).

These two items combined gives some new photo albums on this site. Enjoy.

-- Filed under:

Posted by jochem on 19th April 2006, last update on 19th April 2006
Forever may not be long enough

Some rights reserved by nailbender (http://www.flickr.com/photos/nailbender)For years I have struggled to find a proper music player. Most players are too playlist- and metatag oriented for my taste. My files are all stored in a nice directory hierarchy with proper filenames. The tags are very incomplete and buggy (a lot my music file predate ID3 tag standards, i8n is a disaster in most tags).

XMMS and winamp were pretty usable and fast (no meta-data reading in advance, but newer versions tried to do the same database building as other new programs (and failed)). One feature was particularly missing from these players: the ability to play random albums. This should (imho) be the standard setting of any music player: You are working and want to listen to an album in the background. Also it would be nice if you would be able to give a subset of your collection (for example all jazz music) and than the software picks an album for you in this genre. XMMS and winamp lack these features.

Even newer music players (banshee, amarok, rhythmbox, XMMS2, beep, windows media player, mpd, etc.) cannot do this, although they come closer nowadays. However I have lots of trouble trying them. For years they crashed on loading my large (100Gb+) collection. Now they usually do not crash anymore, but start crunching for hours (sometimes days) and when they finish loading the library and you restart the application; it just crunches again on their index for minutes. And more often than not, they load the index (sometimes 100MB+) to memory. This is intolerable if you just want to listen to one song.

I realized pretty quick that just whining is not going to help, so I wrote my own player in a few lines of Perl. It is far from perfect: I really want to:

Despite these drawbacks I use it for months already and it works flawlessly . Therefore I release the script to the world. It sets your album cover in the background and just starts playing random directories (=albums in my library).

download randomalbum.

-- Filed under:

Posted by jochem on 21st February 2006, last update on 21st February 2006
Defeating audio (voice) captchas

Introduction

For some years semi turing tests under the name of "captchas" can be found on the web, to prevent bots from filling in forms. When I first saw the visual variant I thought recognizing the characters with a computer algoritm should be easy. A bit of surfing and searching on the internet learned me that I was right, most were broken already. Reinventing the wheel is not very useful, so I left the topic alone.

Later I found a post about voice captchas. Since there was not too much information about this on the net and I was bored (ill at home), I decided to give it a shot. I started easy, willing to enhance the used algoritms to those used in speech recognition (like hmm, viterbi, baum-welch, entropy coding, etc.) when needed. This proved not to be necessary, the first feature complete (segmentation and matching) code worked relatively well on microsofts captchas. Later I tweaked it a bit to also work on google captchas.

On this page you can find proof of concept code to break voice captchas. Do not expect advanced software (pattern recnognition science is so much further) or code that can be used in other projects, I quitted the project when it worked. Initially (february 2006) I kept the code on my harddisk, but later (may 2006) I published it (see disclosure motivation).

How does it work

This is not a complete guide, but some pointers to the source (read it luke). As a starting point, consider the configtype struct:

typedef struct {
    int samplerate;
    int byterate;
    int winsize;
    int band_cnt;
    int word_length;
    int word_overlap;
    int threshold_energy;
    int file_offset;
    char trainfile[255];
} configtype;

The program starts with reading the audio file (in the header it could read the samplerate and byterate, but I am lazy). file_offset bytes are skipped in the beginning of the file, because google starts with a bell. The first step is that all samples are treated with a hamming window (arbitrary choice, most window types should do). The winsize is in samples (eg 512 samples on 8000 Hz provides a 64 ms window). Now the blocks are transformed into the frequency domain with a DFT After that the frequencies are put in band_cnt bins. These bins are not equally wide, the higher the frequency, the larger the band (this has to do with human hearing (mel/bark scale), but I doubt this is actually useful at the current incarnation of the program).

Now the program looks at the highest frequency bin. Every block that has more energy in a window than threshold_energy is considered a peak, and these peaks are used the segment the input file in the different spoken words. The word_length tells the program how many windows long a word is (so all words are considered the same length which is a current weakness of devoicecaptcha). word_overlap helps in localizing the peaks. When the locations of the words are know all frequency bins are written for word_length windows around the peaks. This is called the profile of the word.

The profiles for know words are put in trainfile. When a guess has to be made, the profiles for the words in the file are subtracted from those in the trainfile and the smallest deviation is chosen as the proper word. That is all.

The algoritms in devoicecaptcha are at this moment really naive. There are a lot of possible improvements. Perhaps in the future I will enhance the program a bit, for now I think the 33% (as on googles captchas) is good enough (and I am too lazy to reimplement htk, which should do the trick also (I guess)).

Proof of concept

The code is rather messy, but since this applies to most p0f code consider that 1337 ;-). Download devoicecaptcha.c and compile it with it:

gcc -lfftw3 -std=c99 devoicecaptcha.c

As you can see you need fftw, an allround fourier transform library, which is packaged for most distributions, so you can be lazy (apt-get install fftw3-dev or similar).

When started with ./a.out captcha.wav you also need a data set (a msn one and a google one are available. If you have downloaded the same captchas (see links) as I have, it will print a guess on stdout.

As said before, devoicecaptcha works with a comparison to a trained set. To build up a training set and test the effectiveness of various parameters you can start devoicecaptcha with a third bogus argument, eg ./a.out captcha.wav --print.

What I did was download a large set of captchas with lwp and transcribed them with the proper words with something like:

for i in google/*.wav; do aplay $i &> /dev/null &; read j; mv $i google/$j.wav; done

I ended up with a directory with filenames like "123456.wav" where 123456 are the digits spoken in the captcha. On this directory I unleased a small ruby script, which splits the files in a training and testing set, builds a training set and tests the rest. This script can be found under train.rb.

If you have broken other voice captchas with my code (or with an addition to my code), please let me know, so I can update this page.

MSN

MSN (passport) audio captchas are really weak. Only digits are used, there are always ten digits and the noise is weak and constant. The distance between the words is relatively constant. Devoicecaptcha guesses all ten digits correct on around the 75% of all cases, with a training set of about 40 files.

A data set which can be used for the english MSN (aka passport, aka msn live) voice captchas (which I got from passport.net) can be downloaded under the name msn.txt. It is also possible to create your own training data (see above).

Google

Googles voice captchas are more difficult to break than the captchas by microsoft. Google employs different speakers, uses better noise artifacts and a random number of words. The dictionary is (as microsofts) limited to digits only. The devoicecaptcha program scores around 33% on these voice captchas with a training set of 60 files. This is high enough for use in a bot.

A data set which can be used for the google captchas (in english, google also provides captchas in multiple languages) can be found under google.txt. The files were found at Google new account.

These captchas are also in use by blogger and blogspot for comments

Others

If you know other voice captcha systems, let me know. Perhaps I will have some time to look into them (and perhaps not). I will at least add them to the links section on this page, so together with the provided source other people can try to beat them.

Disclosure motivation

I did not release the source code on this page without hesitation, because it might help spammers in their goals. And I hate spam. However there are some reasons I released the code anyway:

Some people might ask what kind of solutions I do suggest for solving the spam problem. Spamassasin catches thousands of spam mails for me; it is expensive in cpu cycles (so putting spammers in jail is preferred), but the multi-tiered approach (neural network detection together with several lists of wrong-doers) works relatively well and can be applied to other forms of spam.

Playing the cat/mouse game with more difficult captchas, when the previous challenge is broken will work, but is not satisfactory in the end. I encounter more human unsolvable captchas everyday. I do understand that corporations play this game however; in the real world thresholds do help.

Links

Information

Broken voice captchas

Working on voice captchas

Not working on voice captchas

Do you know different implementations of audio captchas? Please contact me.


-- Filed under:

Posted by jochem on 20th February 2006, last update on 20th February 2006

<- older entries | newer entries ->