Statistical methods improved by data compression techniques

Publication date
3 Dec 2010


Computer programs like WinZip are able to reduce the size of electronic files by applying data compression techniques from computer science. In his thesis "When Data Compression and Statistics Disagree: Two Frequentist Challenges for the Minimum Description Length Principle" Tim van Erven from the Centrum Wiskunde & Informatica in Amsterdam investigated how the same techniques can be applied in the field of statistics. His research shows that commonly used statistical methods can be improved by examining their data compressions properties. Van Erven obtained his PhD from Leiden University on November 23. His results are of interest to, amongst others, medical science, astronomy and physics.

Data compression is based on a search for patterns. When a program like WinZip encounters a computer file that contains a 1000 times the letter 'A':"AAAAAAAAAAAAAA... AAA", it turns it into a smaller file that only contains "1000 times A". That is, it uses the pattern that is present in the original file to build up a shorter description of its contents. Even more complicated patterns, like the letter 'E' being more common than the letter 'X', can be used to shrink computer files. State-of-the-art compression programs even use advanced probability models from statistics to describe patterns.

The patterns found with data compression turn out to be very useful in statistics as well. In his thesis Tim van Erven describes a practical way to improve so-called Bayesian methods, which are standard in statistics. Through mathematical proofs it is established that this improvement leads to better predictions on the basis of less data. This allows us to draw more reliable conclusions. Bayesian methods are widely used in the life sciences, for example in genetics, but also in studies of the brain in neuroscience, in astronomy and in physics. It is expected that over the next few years the results in Van Erven's thesis will find their way into applied research.

Van Erven's investigations also touch upon philosophical questions about the foundations of statistics and even science in general. Using the connection with data compression, it has been mathematically established that the simplest explanation is often best. This tenet, which is called ‘Occam's razor’, is commonly applied throughout all of science. In this context Van Erven for example shows that by applying Occam's razor, the famous Grue paradox from philosophy gets a simple solution.