Dead Simple Data Mining with Data Science Toolkit

The Data Science Toolkit is a collection of data tools and open APIs curated by our own Pete Warden. You can use it to extract text from a document, learn the political leanings of a particular neighborhood, find all the names of people mentioned in a text and more. He unveiled it today at GigaOM Structure in San FranciscoGigaOM Structure Big Data in New York City.

It’s available as a Web service, or you download a virtual machine and host it on your own server.

The tools included at this time are:

  • Street Address to Coordinates – Street Address to Location calculates the latitude/longitude coordinates for a postal address.
  • File to Text – Converts PDFs, Word Documents, Excel Spreadsheets to text. Recovers text from JPEG, PNG or TIFF images of scanned documents.
  • Coordinates to Political Areas – Returns the country, region, state, county, constituencies and neighborhood a point is inside.
  • Geodict – Geodict pulls country, city and region names from unstructured English text, and returns their coordinates.
  • IP Address to Coordinates – IP Address to Location calculates country, state, city and latitude/longitude coordinates for IP addresses.
  • Text to Sentences – Removes any parts of the text that look like boilerplate instead of real sentences.
  • HTML to Text – Returns the full text that would actually be displayed in the browser when an HTML document was rendered.
  • HTML to Story – Takes an HTML document representing a news article or similar page, and extracts just the story text.
  • Text to People – Spots text fragments that look like people’s names or titles, and guesses their gender where possible.

You can learn about the sources of these tools here.

According to Pete, “It’s essentially a specialized Linux distribution, with a lot of useful data software pre-installed and exposing a simple interface.”

If you want to do intensive data mining, you’ll probably want your own server. The Data Science Toolkit is available as either a VMware machine or as an Amazon EC2 image. You can find out more about this here. Alternately, you can find the source on Github.

Photo by Katherine Tompkins

Facebook Comments