Computer
programs like WinZip are able to reduce the size of electronic files by
applying data compression techniques from computer science. In his
thesis "When Data Compression and Statistics Disagree: Two Frequentist
Challenges for the Minimum Description Length Principle" Tim van Erven
from the Centrum Wiskunde & Informatica in Amsterdam investigated
how the same techniques can be applied in the field of statistics. His
research shows that commonly used statistical methods can be improved by
examining their data compressions properties. Van Erven obtained his
PhD from Leiden University on November 23. His results are of interest
to, amongst others, medical science, astronomy and physics.
Data
compression is based on a search for patterns. When a program like
WinZip encounters a computer file that contains a 1000 times the letter
'A':"AAAAAAAAAAAAAA... AAA", it turns it into a smaller file that only
contains "1000 times A". That is, it uses the pattern that is present in
the original file to build up a shorter description of its contents.
Even more complicated patterns, like the letter 'E' being more common
than the letter 'X', can be used to shrink computer files.
State-of-the-art compression programs even use advanced probability
models from statistics to describe patterns.
The patterns found
with data compression turn out to be very useful in statistics as well.
In his thesis Tim van Erven describes a practical way to improve
so-called Bayesian methods, which are standard in statistics. Through
mathematical proofs it is established that this improvement leads to
better predictions on the basis of less data. This allows us to draw
more reliable conclusions. Bayesian methods are widely used in the life
sciences, for example in genetics, but also in studies of the brain in
neuroscience, in astronomy and in physics. It is expected that over the
next few years the results in Van Erven's thesis will find their way
into applied research.
Van Erven's investigations also touch upon
philosophical questions about the foundations of statistics and even
science in general. Using the connection with data compression, it has
been mathematically established that the simplest explanation is often
best. This tenet, which is called ‘Occam's razor’, is commonly applied
throughout all of science. In this context Van Erven for example shows
that by applying Occam's razor, the famous Grue paradox from philosophy
gets a simple solution.
Statistical methods improved by data compression techniques
Publication date
3 Dec 2010
Share this page