Multi-facility Workflow Case Study

For one experiment at the Linac Coherent Light Source (LCLS) at SLAC National Lab, scientists needed to process their raw data to analyze catalytic reactions. This experiment required over 100 terabytes of data to be processed in semi-real-time so that instruments could be adjusted between 12-hour shifts. To do this, they needed a reliable, high-speed network (ESnet) and a special allocation at NERSC for 150 TB to process their data.

The detector for this crystallography experiment took 120 images per second with each image the size of 10 MB. This yielded 1.2 GB/s with potentially 4.3 TB/hr at full capacity. As this data was acquired, it was sent via ESnet from SLAC to NERSC for processing. The computational engine behind the analysis was NERSC's Mendel cluster. This cluster uses modern software adaptations of standard High Performance Computing (HPC) batch queuing which enables High Throughput Computing (HTC). The experiment data was processed in semi-real-time, which allowed the scientists to adjust the experimental equipment as needed to optimize the scientists’ time at the facility.

As part of a multi-facility collaboration between the LCLS, ESnet, and NERSC, ESnet was responsible for providing reliable, high-speed connectivity between NERSC and LCLS. ESnet monitored and tracked the experiment-specific data while it was being transferred (see figure 1), with the network running at an extremely high capacity: 96% of the total 10G circuit capacity. The amount of data the scientists sent across ESnet totaled 113.6 TB over five days.

Figure 1: This shows the network traffic from LCLS to NERSC through the MyESnet Portal. This portal shows the day, time, the amount of data traveling along the network and the direction of this traffic.

Because the scientists had the ability to perform semi-real-time analysis, they were able to make more effective use of valuable LCLS beam time, thereby enhancing the value of the facilities’ resources. During the experiment, there were some points of congestion (see figure 2), which gave the facilities valuable operational information that will allow for better optimization of the infrastructure to support future experiments. As a whole though, the multi-facility data relay was a success.

Figure 2: The data was tracked in terabytes over time. Each plateau represents 12-hour experimental shifts from shift 1 (day 1) to shift 5 (day 5). There were some points of congestion during the runtime. For example, during shift 3 a competing data transfer job reduced the available capacity of one data transfer host by 33%, lowering the data throughput for that shift. These results provide valuable input to resource scheduling and provisioning models.