Amazon is turning to the public for help, asking for public data sets in an attempt to create a cloud data service that provides what they describe as a "convenient way to share, access, and use public data."
Called AWS Hosted Public Data Sets, the service will enable you to use public data within your Amazon EC2 environment. Select public data sets will be hosted on AWS for free as an Amazon EBS snapshot.
While there are publicly available data sets, accessing them can be expensive and tedious. For instance, the Gutenberg Project offers its eBooks files as a download, but to get a copy you can expect to wait 48 hours for the download to be complete (based on DSL 1MBit/s and a 14.5 GB zip file). If you want the mp3, you'll have a nine day wait to download the 91.5GB file.
However, as there is no indication that the Gutenberg Project will be added to AWS, we've calculated how long it would take to download and upload the 80GB UGI Virtual Conformer Library, one of the listed data sets AWS plans to host.
Using a residential cable provider in California, it would take 22 hours 36 minutes to download, and 3 days 36 minutes to upload to a server in the same state. However, if the server was in New York and we accessed it from California, it would take 3 days 42 minutes to download, and 7 days 14 hours to upload. Clearly inefficient.
People have been searching for better ways to access public data sets for some time, and AWS Hosted Data Sets may just be the answer they've been looking for; allowing anyone to do the type of computing that in the past has been limited to large organizations with lots of money.
Current data sets that Amazon are working on include: annotated Human Genome data, PubChem and UGI Virtual Conformer libraries, the U.S. Census, various labor statistics, and various economic and transportation databases.
AWS will continue to add to the collection over time, and this is where you come in.
If you have a public data set and hold the rights to the distribution of it, you can submit a request on the AWS Public Hosted Data Sets site to have it included.
This is huge.