How grid computing helps scientists to discover new diseases
Respiratory infections are the main reason why children under five end up in hospital. However, in up to 40% of the cases it’s not possible to define the exact cause of the disease and this means that there are viruses still unknown to science.
Identifying as many viruses as possible improves the chances of correct diagnostics and helps to determine the best treatment for patients. Knowing which virus is responsible for which disease is also very important to detect potential epidemics or to assess the seriousness of viral infections.
The question is: how do we find new species of virus?
Looking for viruses
Lia van der Hoek and colleagues from the Virus Discovery Unit, at the Academic Medical Centre of the University of Amsterdam (AMC), has been working on VIDISCA – a method to spot new viruses from previously unidentified genetic sequences.
The hunt for new viruses starts at the hospitals, with the collection of nose and throat swabs from affected patients. Back at the lab, the first step of the VIDISCA method is to remove residual cells and other biologic material, to enrich the sample’s viral genetic material. The genetic sequences in the sample are then amplified with standard laboratory techniques.
Success first came in 2004, when the team reported the discovery of the coronavirus NL63, which is implicated in croup – a disease that causes throat swelling and coughing in children under six-years old. The virus was spotted in samples taken from a 7-month old baby, admitted to a Dutch hospital with symptoms of acute respiratory infection.
The genome of the coronavirus NL63 was sequenced and the analysis showed that the virus was a new species with distinctive features. This research was published in the journal Nature Medicine.
The VIDISCA haystack
Over the past six years Lia’s work on the VIDISCA method has benefitted from developments in the so-called next generation sequencing techniques. The improved version, dubbed VIDISCA-454, was introduced in 2009.
The result of a VIDISCA-454 analysis is a haystack of information that includes – somewhere – the genetic sequence of the unknown virus. Picking up the needle is difficult. One way of solving the conundrum is to compare the mystery sequences to known viruses catalogued in massive reference databases, such as GenBank.
The National Centre for Biotechnology Information (NCBI) has created the BLAST tool to compare given sequences with databases via the web. But uploading data from VIDISCA-454 to this portal proved to be virtually impossible, given that the average experiment produces approximately 400.000 sequences.
Grid computing to the rescue
Looking for a solution, Lia contacted Antoine van Kampen, head of the AMC bioinformatics department, who assigned Barbera van Schaik to the problem. Barbera developed a workflow – the sequence of computational steps required to perform an analysis – to allow BLAST to run on grid computing resources of the Dutch e-science grid.
The workflows and databases were made available to the Virus Discovery Unit via the e-BioInfra platform developed and operated by the e-bioscience group of Silvia Olabarriaga.
With these tools at hand, Lia’s team is able to analyse a VIDISCA-454 experiment within 24 hours compared to weeks of intensive manual work. A test with 1444 samples produced 4,783,684 sequences and showed that the analysis can be repeated within 14 hours, compared to 17 days if it would run sequentially on a local server.
Research is now ongoing on the meaning of the 4.000.000+ sequences identified by VIDISCA-454. Lia now hopes to find new viruses linked to respiratory infections, diarrhoea, meningitis, encephalitis and other serious diseases.
L van der Hoek et al. (2004) Identification of a new human coronavirus. Nat Med 10: 368–373. (abstract)
M de Vries et al. (2011) A sensitive assay for virus discovery in respiratory clinical samples. PLoS One., 24;6(1):e16118. (full text)
Luyf AC, van Schaik BD, de Vries M, Baas F, van Kampen AH, Olabarriaga SD. Initial steps towards a production platform for DNA sequence analysis on the grid. BMC Bioinformatics. 11(1), 598. (full text)