Delivering the data for groundbreaking cancer research

When the Sheba Cancer Research Center (SCRC) wanted to transfer 300 terabytes of data from the US National Cancer Institute Center for Cancer Genomics (CCG) Genomic Data Commons (GDC) in Chicago to local storage to advance their research, they thought the process would be relatively straightforward.

They acquired resources needed in Sheba’s data center, opened accounts on the GDC Data portal (a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine), received authorization for access to the databases, installed the client tools, and clicked ENTER to begin the transfer of the harmonized datasets. The next message: “Your download will be complete in 43,800 hours” took them by surprise. The exact language may have been a bit different, but the message was clear: the existing networking infrastructure was not up to the job.

The Sheba Cancer Research Center, affiliated with the Tel Aviv University Sackler School of Medicine, was founded and is led by Professor Gideon (Gidi) Rechavi. It is a facility known for its pioneering, top-notch core research, and not one likely to concede in the face of technology challenges.

Prof. Rechavi is not only a world-renown researcher in his area of expertise: identifying the role of transposable genetic elements in the activation of cancer-causing genes and deciphering global modifications in the human transcriptome by RNA editing and RNA methylation. He is also known for staying two steps ahead when it comes to the technology that enables advanced cancer research. It is this vision that keeps SCRC equipped with the most advanced, and necessary, technologies that makes deciphering novel genetic and epigenetic mechanisms affecting global gene expression and their implication in cancer and neuronal disorders possible. Under his direction, SCRC was the first research facility in Israel to acquire DNA microarray and next-generation sequencing (NGS) technologies. Professor Rechavi also spearheaded efforts to acquire sufficient storage to meet his ambitious goal of downloading the GDC datasets from the NIH.

Committed to getting the data, Dr. Eran Eyal, SCRC’s Head of Bioinformatics, and his team started to think outside the box. Way outside the box. “We asked the Sheba IT department and commercial ISPs for advice,” says Eran. “But neither had a workable solution. We even explored the option of traveling from Israel to the Chicago NIH facility with digital storage devices in our suitcases to physically bring the data back to Israel. But the cost of airfare, accommodations, and the time was too high. And the NIH team had also never come across a lab that wanted to do that.”

Israel’s R&E Network: the enabling solution

It was the GDC portal team that led them to seek advice from the Inter-University Computation Center (IUCC), Israel’s National Research & Education National Research & Education Network (NREN).

Eran contacted Hank Nussbacher, IUCC’s Director of Network & Computing Infrastructure, for assistance. IUCC already did some work with another Sheba Medical Center unit using IUCC’s ILAN network for high-speed tele-surgery applications. But even that level of connectivity and speed was nowhere near what was needed to transfer 300 terabytes effectively and securely.

Hank suggested a 1Gb/sec dedicated link between IUCC and SCRC using existing carrier infrastructure. IUCC contacted the NIH GDC portal staff, benchmarked the application to make sure it worked and that the connectivity from Israel would be able to handle a sustained 1Gb/sec load.

In October 2017, the line passed the tests and was put to work. “Initial estimates indicated that what started out to be a very unfeasible multi-year run, was now a very workable three-month project,” Eran recalls. “Now that we see how well it is working, we will likely extend it a bit and try to obtain more data.”

According to Dr. Nitzan Kol and Omri Nayshool, SCRC researchers who actually handle the transfer, the overall system configuration of Sheba and the NIH doesn’t allow them to actually reach the ultimate 1 Gb/sec speed, but it still more than satisfies the needs of the task at hand. “The NIH’s TCP infrastructure definitely isn’t ideal. We see peaks of 800 Mb/sec and some valleys of 600 Mb/sec. But in the end, this translates into between 4 to 6 terabytes per day which will let us transfer the 300 terabytes we aimed for in the planned time frame.”

Seminal research to restore hope

The Sheba Cancer Research Center’s work currently focuses on three main areas: RNA modification and its role in the regulation of gene expression and cell fate; the study of transposable genetic elements in cancer (TEs), also known as “jumping genes” which are DNA sequences that move from one location on the genome to another; and sequencing and genomic studies for personalized medicine, relevant to cancer and rare genetic diseases. The GDC data is essential for the Center’s transposon research and whole genome sequencing of specific subsets of cancer patients since they involve enormous datasets and analysis.

Due to the nature of the transferred GDC datasets, there is no question of sustainability over time. “This type of data never becomes ‘obsolete’,” says Professor Rechavi. “The only issue, now that we have a way in place to transfer it thanks to IUCC, becomes one of capacity. The more data we wish to obtain the more storage we need. It’s a matter of balancing resources.” The relevance and usefulness of the data extend beyond the Sheba Cancer Research Center and has the potential to be an important catalyst in cancer research taking place throughout Israel. “In principle,” says Professor Rechavi, “SCRC will be happy to share the publicly available data items among the genomic datasets which were downloaded in this scope.”

Published: 01/2018

For more information please contact our contributor(s):