Jun 2016

Mining a genetic goldmine

‘I’ve died in the meantime.’ In the last couple of years, confrontational campaign texts such as these have been drawing attention to ALS, a neurodegenerative disorder which affects mainly middle-aged people.

“It often starts with weakness in the arms, legs or speech muscles ”, Jan Veldink explains. ”That weakness becomes progressively worse; there’s no treatment. After three to five years, the respiratory muscles have weakened to the extent that people die from it.”

Veldink is a physician and professor affiliated to the ALS Center at UMC Utrecht (University Medical Center), where approximately 80 percent of all Dutch ALS patients are treated. Two of those patients, both businessmen, noticed a freezer full of DNA samples at the Center.

When they heard that no funds were available for more detailed research into the genetic causes of ALS, they offered to help. Project MinE was born.

“If you were to analyze the total data set on a PC, it would take you between 600 and 700 years to complete. Yet at the Life Science Grid we can do this in a few weeks, thanks to a few nifty tricks.”

Crowdfunding

“We know that ALS is hereditary”, Veldink explains, “yet not hereditary to the extent that looking at family DNA suffices. MinE, therefore, involves a large-scale DNA study, in which patients are compared to healthy people.”

How do you source that DNA? The past few years have seen some large-scale population screening projects with DNA sampling. However, for a rare disease such as ALS, the entire genome needs to be mapped out in detail, as no one knows which parts of the DNA are important. We need material from at least 15,000 patients and 7,500 reference subjects.

That large control group is vital, as rare genetic variation can differ strongly from country to country and even between regions within a country. Project MinE was set up with the promptness you might expect from entrepreneurs. For the sequencing (deciphering) of the DNA samples, a favorable contract was concluded with Illumina, an American company.

Crowdfunding campaigns have been set up in every participating country. In addition, funds were received from the Dutch ALS foundation, which received a lot of donations thanks to initiatives such as the Icebucket Challenge and the Amsterdam City Swim.

“The countries that are fully participating in both fundraising and supplying samples are the Netherlands, Belgium, England, Ireland and the United States. Countries such as Italy, Australia and France are also making good progress. Finally, Germany and Sweden do their own sequencing at home, but they do share their data with the MinE project,” says Veldink.

2 million gigabytes

Raising funds and recruiting participants was just the first of the challenges. This is because after sequencing just 1 DNA sample, you are left with a 75 to 100-gigabytes file. Multiply that by 22,500 and you are talking about roughly 2 million gigabytes: think of a stack of hard disks twice the height of Niagara Falls.

How do you store all that data? And more importantly: where do you find the computing power to track down the vital information?

“Many of these DNA projects run aground when it comes to the computing,” Veldink explains. “Fortunately, it didn’t take us long to find SURFsara.”

Maarten Kooyman, consultant at SURFsara, confirms that this is indeed no small task.

“It’s an enormous amount of data and, on top of that, split into tens of thousands of separate files. If you were to analyze the total data set on a PC, it would take you between 600 and 700 years to complete. Yet at the Life Science Grid we can do this in a few weeks, thanks to a few nifty tricks,” he says.

Direct connection

In addition, SURFnet is working on a direct network connection to Illumina, allowing data to be forwarded to the Netherlands immediately, i.e. while it is being deciphered.

“We prefer to keep the genetic data on European soil, but so far, Illumina has been sending this type of data on hard disks through the post. This is not effective and the security is far from ideal,” saysVeldink.

A network connection is an interesting challenge, says Kooyman. “Due to the sheer volumes involved, there’s always a risk of a few bumps in the road. And when you work with such large packages of data, the smallest bump turns into a mountain.

“Together with SURFnet, we look at how to optimize the network, possibly with international partners, to set up a fast direct network connection.”

Soon, when the data set is complete — the progress bar is currently at 23 percent — MinE will have a DNA database that is unique in its combined size and quality. The data from the 7,500 healthy reference subjects of interest for a number of research projects.

“Currently, the number of sequencing studies is enormous”, Veldink explains. “This often involves research into common diseases. Examples include dementia, autism and schizophrenia.

“They’re all illnesses for which reference groups are important, as you need to look further than family relationships alone. The Project MinE data will also be available for this type of research.”