Photon speedway puts big data in the fast lane

IMAGE: Junko Yano and Vittal Yachandra's group and collaborators are using femtosecond X-rays at LCLS and supercomputers at NERSC to study photosystem II. Left to right: Junko Yano, Jan Kern, Sheraz...
view more

Credit: Roy Kaltschmidt, Lawrence Berkeley National Laboratory

A series of experiments conducted by Lawrence Berkeley National Laboratory (Berkeley Lab) and SLAC National Accelerator Laboratory (SLAC) researchers is shedding new light on the photosynthetic process. The work also illustrates how light sources and supercomputing facilities can be linked via a "photon science speedway" as a solution to emerging challenges in massive data analysis.

Last year, Berkeley Lab and SLAC researchers led a protein crystallography experiment at SLAC's Linac Coherent Light Source (LCLS) to look at the different photoexcited states of photosystem II, an assembly of large protein molecules that play a crucial role in photosynthesis. Subsequent analysis of the data on supercomputers at the Department of Energy's (DOE's) National Energy Research Scientific Computing Center (NERSC) helped explain how nature splits a water molecule during photosynthesis, a finding that could advance the development of artificial photosynthesis for clean, green and renewable energy.

"An effective method of solar-based water-splitting is essential for artificial photosynthesis to succeed, but developing such a method has proven elusive," said Vittal Yachandra, a chemist with Berkeley Lab's Physical Biosciences Division and one of the study leaders. "This study represents a major advance towards the real time characterization of the formation of the oxygen molecule in photosystem II and has yielded information that should prove useful for designing artificial solar energy-based devices to split water."

The findings were published July 9, 2014, in Nature Communications.

Real-Time Data Collection

The experiment was also notable as an early demonstration of a "photon science speedway" connecting LCLS and NERSC. Historically, light source users had to travel to facilities like the LCLS and Berkeley Lab's Advanced Light Source (ALS) to run experiments. They would then download the raw data to an external hard drive and take this information home for processing and analysis on their personal computers. In this case, however, the researchers saw their data arrive at NERSC in real time using the DOE's Energy Sciences Network (ESnet).

A total of 114 terabytes of data was collected over a five-day period (February 28-March 3, 2013); during each day's 12-hour experiment run, the data was transferred over ESnet at 7.5 gigabits per second, which enabled the data processing team at NERSC to determine each night how good the data was from the previous day. The scientists at SLAC could then modify the next day's experiments if need be, explained Nicholas Sauter, a computational scientist in Berkeley Lab's Physical Biosciences Division and data processing lead on the study.

"Out of the gate, the speedway approach greatly enhances researcher efficiency. More importantly it builds new research agendas in beamline science built on advanced computing," said David Skinner, NERSC's strategic partnerships lead and co-author on the Nature Communications study. "We see a bright future in connecting these capabilities through ESnet.

The speed of the data collection and analysis enabled the research team to study a time progression of the photosynthetic process, Sauter said--a vital step in gaining new insights into how photosynthesis works and how it can be replicated using manmade materials.

"We are trying to learn about a chemical reaction that is responsible for producing most of the oxygen in our atmosphere," he said. "Down the line, if we understand the way the green leaf converts sunlight to energy, it can help us make better materials that would do artificial photosynthesis. The chemical mechanism can only be understood by studying how the reaction happens over time. We had to use NERSC to rapidly process all of this to answer the question of which samples were good."

New Software Tools

Because of the unique nature of the data and the speed at which it was collected, Sauter and Paul Adams, also with the Lab's Physical Biosciences Division, had to develop new software--cctbx.xfel, a subset of the Lab's Computational Crystallography Toolbox--to process the data.

"The unique thing about cctbx.xfel is that we applied several special procedures to correct for systematic errors that result from the way we measure X-ray diffraction data at X-ray free-electron sources like the LCLS," Sauter said. "Therefore we get a better signal-to-noise ratio than with other software that doesn't include these procedures."

Also critical to the success of the experiment was the additional capacity ESnet added to the pipeline at SLAC to support the workflow, said Eli Dart, a network engineer in the ESnet Science Engagement Group.

"In terms of direct experiment coupling, this was one of the largest real-time data transfers ESnet has facilitated," Dart said. "The duty cycle for the LCLS experiment was pretty tight; they would take the data, push it, analyze it, then use the results of the analysis in the next day's setup."

In a follow-on experiment that ran at LCLS in July 2014, the researchers added another twist: SPOT Suite, a set of software tools developed at Berkeley Lab that give scientists access to automated data management, data analysis and simulation tools when working with instruments such as the ALS and LCLS.

"The scientists at these beamlines are doing experiments that involve a vast array of aspects, components, detectors, expertise and types of science," said Craig Tull, who leads Berkeley Lab's Software Systems Group and the SPOT Suite Laboratory Directed Research and Development (LDRD) project. "What SPOT Suite does is take the things scientists or their post-docs would normally have to do over and over manually and automate much of the work, allowing them to focus on the science and getting their experiments up and running."

Using SPOT Suite during the most recent LCLS experiment did just that, Tull said.

"The first time they saw crystallography data from this particular run was from SPOT Suite," he said. "As they were setting up their experiments it was in the background plugging away, so every time there was a data transfer and the data would come in, a picture would show up. As soon as they got everything working they could just turn to the computer screen and lo and behold there was the result that they otherwise would have had to spend another hour or two analyzing data to get."

Expanding the Speedway

All of this bodes well for the creation of a photon science speedway that connects beamlines like LCLS and ALS to supercomputing centers like NERSC--a scenario that will become increasingly critical as the datasets generated by experiments at these facilities grow ever larger.

"This is part of a vanguard of a new class of experimental workflows that benefit from attaching an HPC facility to a light source beamline," Dart said. "What this gives us is a world where the people at those light and neutron sources don't have to deploy a supercomputer to do the analysis necessary for the next generation of experiments. And beamline scientists don't need to become HPC experts in order to be able to use the light source to its full potential."

Building a photon science speedway could take these early examples forward toward new operational modes for light sources. "Some very exciting detectors are pushing light source data rates through the roof," Skinner said. "SLAC and NERSC recognize that the value of these instruments to scientific discovery grows when when we raise the speed limit for scientists in extracting knowledge from their data."

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.