Accelerating genomics research discovery to feed the world and fight disease

Genomics research is rapidly becoming one of the leading generators of Big Data for science, with the potential to equal if not surpass the data output of the high energy physics community. Like physicists, university-based life science researchers must collaborate with counterparts and access data repositories around the globe.

The ability to create and use dedicated high-speed connections between data sources and analytical computing power is a powerful tool

For example, in the United States, the National Center for Biotechnology Information (NCBI) in Maryland at the National Library of Medicine, is the largest repository, providing access to massive genomic datasets. NCBI hosts almost 25 petabytes of research data and makes it available to the global community of scientists, including researchers who are collaborating to find better ways to diagnose, treat and prevent cancer, as well those working to develop new plant varieties resistant to pests, floods and drought.

Optimizing data transfer rates

NCBI provides genomic sequence data to life scientists like the genomics researchers at Clemson University (CU) in South Carolina. Using their advanced Internet2 Network connections and Internet2 services specialized to support research collaborations, NCBI and Clemson teamed up to transform research workflow in the transfer of genomic Big Data. The work was supported in part by an National Science Foundation Campus Cyberinfrastructure—Network Infrastructure and Engineering Program award to CU.

The ability to access and analyse massive amounts of available data faster is key to accelerating the timeline from experimentation to  discovery and researchers involved in this project were eager to push boundaries. Transferring data as fast as possible into DNA analysis workflows through optimized network facilities and parallel transfer mechanisms would enable researchers to mine data and perform experiments at an unprecedented scale. The ability to create and use dedicated high-speed connections as needed between data sources and analytical computing power is a powerful tool for increasing the throughput of genomics analyses.

Using a dedicated 10 Gigabit per second (Gbps) Internet2 Advanced Layer 2 Service (AL2S) circuit between CU and the NCBI, and by varying file transfer protocols, storage system characteristics and network software tuning, the experimenters were able to achieve network transfer rates of 7.5 Gbps using the Aspera transfer client. This compares to a rate of 0.5 Gbps seen before the optimizations were applied. After optimization, the team was able to transfer the datasets totaling 12 terabytes across the Internet2 AL2S circuit in 11.6 hours instead of eight days.

Analyzing data faster

Not only did the genomics scientists obtain their data far more quickly, but this high-speed transfer now makes it reasonable for a genomic researcher to download datasets, process them, re-design the experiment—and repeat.

This collaboration among Clemson and NCBI researchers and technologists illustrates how new network services can be integrated into existing workflows to improve research productivity and reduce time to discovery.

Although advanced technology such as high performance computing and high-speed networks played an essential role in the success of this project, the human factor was a key component. The right people from both organizations—technologists and scientists—were linked.

One of the biggest lessons learned was the power of community to remove significant roadblocks to scientific research—what the collaborators called “the social level.”

Alex Feltus, Associate Professor, Genetics and Biochemistry at Clemson University said, “This project was invaluable in learning how to link two organizations via Internet2’s AL2S, and we intend to take these lessons to heart as we optimize our research computing connections with other institutions.”

Published: 10/2015

For more information please contact our contributor(s):