2 of 3 Collaboration

Making Big Data Manageable

Four billion years of evolution have given humans an electrochemical system of daunting complexity. The 3 billion base pairs on a strand of DNA are divided into an estimated 20,000 to 25,000 genes, and each gene codes for a specific protein required for health. A minute alteration in the sequence can jeopardize our well-being.

Researchers have made enormous strides in the last decade in their ability to determine the sequence of DNA base pairs efficiently and economically. “The challenge now is finding ways to manage the enormous amount of data that sequencing generates,” says Stephen Rich, Harrison Scholar Professor of Public Health Sciences and director of U.Va.’s Center for Public Health Genomics. “This is an essential step if we are to use this data to better understand disease, to provide better diagnoses, and to design more effective and potentially personalized treatments.”

“The challenge now is finding ways to manage the enormous amount of data that [DNA] sequencing generates.” Stephen Rich (center)

“Each sequence can generate as much as 150 gigabytes of data,” says Ira Hall, a member of the center and an assistant professor of biochemistry and molecular genetics. “While that size is manageable, the number of sequences required by a typical study could produce more data than can be readily stored or analyzed given a university’s typical computing environment.” To complicate matters further, this data must be interpreted in the context of data collected in other huge databases, with different organizational schemes, formats, and biases.

Center researchers are part of a global effort to identify the variants causing disease, but they are also working to develop computational tools to manipulate and compare massively complex data sets. For instance, Aaron Quinlan, resident member of the center and an assistant professor of public health sciences, is studying the relationships between cancer and the structural arrangement of genetic materials. He is also developing software to accelerate the process of discovery, annotation, and interpretation of genetic variations.

“Overcoming the bioinformatics bottleneck will require new algorithms for data storage, analysis, and integration as well as international agreements on data sharing and ethics,” says Hall. “Inventing and implementing these methods is a major focus of genomics research at U.Va.”