SearchOpen search

As more genomes from patients become available to researchers, the chances of finding genetic mutations associated with disease and thereby improving treatment rise. However, demands for computational speed and capacity also grow. In a joint project, Cancer Institute of Singapore (CSI Singapore), the National Supercomputing Centre (NSCC), and SingAREN, the national research and education network (NREN), has taken genomic cancer research to a new level.

“The network speeds of SingAREN, combined with the computational power of NSCC (..) has allowed us to greatly enhance local scientific efforts by enabling us to effectively utilize petabytes of sequencing data generated globally,” says Dr. Jason Pitt, CSI Singapore.

In the project, SingAREN provided Dr. Pitt’s team with a high-speed link for petabytes of cancer genomics data downloaded from USA repositories into the NSCC supercomputer ASPIRE 2A. The project consumed more than 2 petabytes of data from the National Cancer Institute’s Genomic Data Commons administered by the University of Chicago.

Avoids data harmonization errors

A major challenge facing researchers looking to include data from many patients in their studies is the fact that differences exist in the way countries and institutions store and handle data. When disparate computational tools and algorithms are deployed over distinct subsets of sequencing data, it may become difficult to distinguish between real mutational patterns and systematic errors. This is known as batch effects.

In the project, a workflow known as Scalable Workflows for the Analysis of Genomes (SWAG) was optimized for ASPIRE 2A to clean and harmonize data from local and external sources to eliminate batch effects.

SWAG was designed to extract valuable mutation calls from DNA sequencing data. In addition to fueling the Pitt Lab’s data intensive exploration of genome instability patterns in cancer, the recapitulated batch-effect-free output is distributed to National University of Singapore research centers to bolster collaborative scientific projects.

Tenfold increase in download speed

As the Pitt Lab harmonizes petabytes of whole genome sequencing data, downloads must be extremely fast, stable, and efficient, after which the downloaded data is to be analyzed and reprocessed with high computing power.

“We have observed a significant speed improvement in our data transfers over the internet using SingAREN’s infrastructure compared to other resources. Specifically, we have seen up to a tenfold increase in download speeds, greatly accelerating the throughput of our genomic analysis workflows,” says Akila Perera, HPC/Cloud Engineer & SWAG development lead in the Pitt Lab.

Dr. Pitt’s team has started applying AI and representational learning tools downstream to understand and predict clinically relevant genome instability phenotypes. As Singapore evolves into a repository of research data, SingAREN will facilitate the distribution of research data at a high-speed rate to researchers across the island.

The text is inspired by the article “Unlocking the future of cancer genomics with high-speed data transfer: a collaboration with Cancer Science Institute of Singapore, SingAREN and NSCC” at the SingAREN website.

Contributor

Submit a Story

Are you a R&E network with a story to tell? We want to hear it!