Welcome To The World Machine!
At CERN, physicists are looking for nothing less than the fundamental principles of the universe, generating hundreds of petabytes of data every year. Information, that only can be processed with data analytics built on a network of trust.
Some call it "cradle of the Internet," as it was here where Tim Berners-Lee proposed a project to his employer CERN, based on the principle of hypertext. Others name it the "birthplace of the touch screen," as it was here where Frank Beck and Bernt Stume invented the first touchscreen, a simple display with buttons, which, when touched, helped the operator run it with little contact. Still others call it the "world machine," because that is exactly what they are doing at the Conseil Européen pour la Recherche Nucléaire, the CERN: dealing with fundamental issues such as understanding the very first moments of our universe, tracing the origin of the world by simulating the Big Bang. They are doing it with the Large Hadron Collider (LHC), the world’s largest and most powerful particle accelerator. Booted up in September 2008, the LHC consists of a 27-kilometer ring of superconducting magnets and a number of accelerating structures to boost the energy of the particles along the way.
Running Complex Algorithms to Achieve Structured Data
The tests and interactions at CERN create an enormous flow of data: up to nearly 1 billion particle collisions can generate up to 1 petabyte of data per second. Data that is filtered in real time, selecting potentially interesting events, the so-called "trigger." The Physicists at CERN’s data center in Meyrin, the heart of the lab’s infrastructure, sift through these petabytes – mostly in real time – running complex algorithms to achieve structured data. Still, even after filtering out almost 99 percent of it, CERN expected to gather around 50 to 70 petabytes of data in 2018.
And it’s not going to be less: "The main driver for us until today was the so-called ‘standard model of particle physics,’ a very successful way to classify interacting particles,” says Alberto Di Meglio, Head of CERN’s openlab. The scientists’ key mission was to complete this model, and it was, in the main, accomplished with the discovery of the Higgs boson. "That was the last missing piece," Di Meglio continues, "compared to the challenges ahead it was not difficult to ‘listen’ to the data and gain the relevant insights from it". Not difficult because they could describe exactly what we were looking for, says Di Meglio.
We are relying on new ways of working, as well as on new ways of analyzing information, from machine learning to deep learning.«
Today, this standard model is fairly complete; everything it can describe has been validated in experiments. However, with only 5 percent of the universe explained, it merely represents a very small part of what exists. “We know there is more that we cannot describe properly. We don’t have a model to illustrate the remaining 95 percent.” That makes it difficult to determine which data to discard and which to use. Since there is always a risk of throwing away useful information, it is necessary to really listen to the data and what it contains, instead of hunting for specific patterns that correspond to a predefined search pattern.
Already, the scale and complexity of data from the LHC is unprecedented and will continue to set new standards in the future. Data that needs to be stored, easily retrieved, and analyzed by physicists all over the world requires massive storage facilities, global networking, and immense computing power. But because CERN does not have the computing or financial resources to crunch all of the data on site, it turned to grid computing in 2002, in order to share the burden with computer centers around the world.
A Network Of Trust
The computing grid establishes the identity of the user, checks credentials, and searches for available sites that can provide the resources requested. Users do not have to worry about where the computing resources are coming from – they can tap into the grid’s computing power and access storage on demand, says the head of WLCG, Ian Bird. "One of the underlying principles of WLCG is that it’s a federated infrastructure, in the sense that each user has a single identity that is recognized everywhere," Bird says. Basically, all scientists register and are given credentials, allowing them to submit their work to the cloud. That forms a network of trust between all of these computer centers, built on the certificates issued by trusted authorities.
Bird: "There is a whole set of rules and conditions under which these certificates are issued. For instance, the identity of the applicant has to be verified in person." That validation process is the fundamental part of CERN’s network of trust, Bird continues, adding, "that puts us in a much better situation than we were, when we are a bunch of individual computer centers." .
Security, Done By People
Security is done by people rather than by technology, he says. "Of course, there is a layer of technology, but ultimately the only reason we can run that federated network is because we trust each other to issue the credentials in a way everyone agrees upon – a network of contacts that every business out there should have."CERN is pushing boundaries, technology-wise, data-wise, and business-wise. "We are relying on new ways of working, as well as on new ways of analyzing information, from machine learning to deep learning," says Alberto Di Meglio. Increasing the speed at which an analysis is made so that it can be done in real time is what all of the experiments are about.
The goal is to shrink the separation between the online and the offline world. "Today it is all about collecting the data, reducing it as much as possible, and analyzing what’s left offline. Future experiments will have to process much more data much faster in real time.” That’s why the scientists at CERN are trying to understand how advanced applications can fill the gap, since the trend toward collecting information from sensors, wearable devices, and machines, as well as the need to analyze that massive amount of information in real time, are both growing.