A database machine that can analyze over 60 gigabyte raw data per second in the most economical way possible. The Netherlands Organization for Scientific Research (NWO) has granted a budget to the Centrum Wiskunde & Informatica (CWI) in Amsterdam for the development of such an engine. The mission of the research project, that is called SciLens, is to reveal hidden knowledge stored in extensive scientific databases. The required hardware and software for this purpose is not available yet. An important part of the project is to build the SciLens machine, that will be operational at the end of this year.
The leading-edge system is specially configured for database management tasks, such as fast plowing through large amounts of data and is mainly composed of energy-efficient components. The machine is distinguished from supercomputers by a heavy emphasis on a good balance between I/O bandwidth and the required CPU power for database tasks, known as Amdahl-blades.
There are applications in different research areas such as seismology, astronomy, remote sensing, data mining and fraud detection in social networks. In early 2010 at the Chili earthquake, seismologists collected two terabytes of data. With a normal computer it is almost impossible to quickly search and analyze this data. A complete scan takes the SciLens machine only 30 seconds.
“A main difference with an internet search engine is that SciLens can literally find a needle in a haystack without having indexed it first,” says Martin Kersten from CWI and the initiator of the project. “Using Google, the haystack is divided into hay bales beforehand and each bale has a sign, telling what is inside. The SciLens machine can process data rapidly without having prior information of what is searched for.”
The SciLens machine is built like a pyramid with four tiers. Each tier has a different type of computer, from 256 energy efficient Intel Atoms to sixteen high-end servers. Every tier has a total of one terabyte memory and 128 terabyte hard disk space available. A superfast Infini-band network enables the database system to use this distributed memory as a ring buffer, making it possible to process over 256 gigabyte per second. The top consists of a single system with a terabyte of memory. It will take about two years before such a system will be available on the market.
According to Kersten, the biggest challenge in the construction is to find the right balance in the components in context of the intended database software MonetDB. His ideal is that elements from the bottom tier will be an example for a MonetDB database machine with the size of a shoebox and a capacity of ten terabyte that every scientist can afford to apply in his search in the abundance of observations. When the SciLens machine is operational, it will become available for research of CWI and its partners.