CWI researcher Rudi Cilibrasi studied and extended a new method for statistical inference, based on data compression. It is based on determining the similarity between two random files by using the concept of a universal information metric. Cilibrasi will defend his PhD thesis - called 'Statistical Interference Through Data Compression' - on 23 February at the Universiteit van Amsterdam.
The more files look alike, the smaller their 'distance'. Considering files with a small mutual distance as a group creates a whole new kind of cluster analysis. The method can be used for, e.g., literary texts, DNA sequences and music files. Contrary to earlier methods no a priori knowledge on the domain is required. All kinds of files are analysed with the same computer program, based on simple data compression methods like gzip or bzip. The results are often surprising and their quality is often comparible to those obtained by specialised, much more complicated software.
Cilibrasi further shows in his dissertation that a variant of the universal information metric can be based on the World Wide Web. Concepts - like 'food' or 'love' - can automatically be clustered and classified by means of their context on the web. This can lead to intriguing results. During four years Cilibrasi explored the connection between information theory, artificial intelligence, pattern recognition, and machine learning. He performed his research at the Centrum for Wiskunde en Informatica (CWI) in Amsterdam.
More information can be found on INS4's website, the webpages of Rudi Cilibrasi, the webpages of advisor Paul Vitanyi or the webpages of co-advisor Peter Grunwald