Identifying unknown composers, automatically recognizing languages, finding the origin of new strains of viruses. These are just a few examples of the many possibilities of the CompLearn Toolkit, a compression based pattern recognition program made by CWI researcher Rudi Cilibrasi, available since October.
CompLearn is based on research on compression based learning by Cilibrasi, Paul Vitányi and Ronald de Wolf (see also the Latest News article below, `New Scientist: software to unzip identity unknown composers'). Using a standard zip tool, the CWI researchers can uncover similarities between data sets without any prior knowledge of their type. Zip programs look for patterns to compress data. More regularities means a higher compression rate. By calculating the compressed size of two data files combined and comparing it to the compressed size of the individual files, a simple formula can determine how similar the files are.
CompLearn is available at complearn.sourceforge.net. CompLearn is distributed as open source software.
More information can be found at homepages.cwi.nl/~cilibrar or homepages.cwi.nl/~paulv. The work of Cilibrasi and colleagues has attracted much attention in the popular press. Several articles are available on the above websites.