Last week, Amazon welcomed the 1000 Genomes Project data to Amazon S3 as part of the Obama administration's big data initiative (PDF). The initiative is far-reaching and likely to have an impact on a number of businesses. Focusing just on the 1000 Genomes Project, though, I wonder if this might be an opportunity for startups to provide tools or services around the data.

The 1000 Genomes Project is attempting to build "a comprehensive resource on human genetic variation" to "find most of the genetic variations that exist in people" (PDF) by studying DNA collected "from many people whose ancestors were from various parts of the world, and then putting all of this information in scientific databases on the Internet."

Data Challenges

The 1000 Genomes Project may not be the biggest of big data projects, but it certainly fits the bill as big data. Despite the name, the project has actually collected the full genomic sequence from more than 1,700 people, and continues to add more samples. The donors are "mostly anonymous," and each donor has consented to participate, if you were concerned about data privacy. Right now, the 1000 Genomes data clocks in at 200TB, which has been a bit of a challenge to distribute and a challenge for companies that work with the data to gather.

Dr. Brandon Colby, CEO and Medical Director of Existence Genetics, says his company has been working with the data for eight or nine months already. The size of the data, says Dr. Colby, has been a roadblock.

Dr. Colby says that the company works with the 1000 Genomes data as a baseline data set to test their tools for analyzing genomes. The clients are those who are "healthy, and looking to stay healthy." They submit their DNA to find out if they may have markers that show risk factors for heart disease, prostate cancer, and so on. Existence Genetics provides a report to the client and their health care professional, which is used to help take steps to prevent disease.

The standard for genetic research for the past 30 years, says Dr. Colby, is DNA chips that hold about 5MB of data and "tens of thousands of data points." That's because older methods of genetic testing looked for only specific parts of the gene set and ignored everything else. The data from the 1000 Genomes Project - and other testing being done currently - samples the complete genome. That gives a file for each individual that can be between 5GB and 1TB. That presents "a lot of technical issues to get beyond," says Dr. Colby.

Startup Opportunity?

Herein may be the opportunity for some enterprising data scientists, but Dr. Colby says that would be "very difficult to capitalize on." The problem, he says, is that the 1000 Genomes Project is providing data in a format that is very new to geneticists. "Old-school geneticists" - which he describes as those who studied in the 90s or earlier - are used to DNA chips that contain a small subset of the information contained in the 1000 Genomes data.

There's also the fact that this is a very niche market, Dr. Colby says. But he says that if you could find the right team that can produce the right tools for understanding the 1000 Genome data, and other data like it, it could be a good opportunity.

Donnie Berkholz, an analyst with RedMonk who focuses on big data, agrees that there's an opportunity here. "I think the biggest opportunity lies in integrating this data with the other public datasets on AWS, as well as private, in-house data. Once you're at this scale of data, simply moving it around becomes infeasible, so this announcement is a big deal because it puts all this data in a place where so much other data and computational power already exists."

He also notes that while some of the academic research community is "cautious" about public clouds, "bioinformatics is bucking that trend. My expectation is that most of the researchers working with this scale of data are already using public clouds, because you simply can't work effectively with this scale of data in most environments."

So it could be that the 1000 Genomes data is the right data set, in the right place, just waiting for the right team.