Home Dead Simple Data Mining with Data Science Toolkit

Dead Simple Data Mining with Data Science Toolkit

The Data Science Toolkit is a collection of data tools and open APIs curated by our own Pete Warden. You can use it to extract text from a document, learn the political leanings of a particular neighborhood, find all the names of people mentioned in a text and more. He unveiled it today at GigaOM Structure in San FranciscoGigaOM Structure Big Data in New York City.

It’s available as a Web service, or you download a virtual machine and host it on your own server.

The tools included at this time are:

  • Street Address to Coordinates – Street Address to Location calculates the latitude/longitude coordinates for a postal address.
  • File to Text – Converts PDFs, Word Documents, Excel Spreadsheets to text. Recovers text from JPEG, PNG or TIFF images of scanned documents.
  • Coordinates to Political Areas – Returns the country, region, state, county, constituencies and neighborhood a point is inside.
  • Geodict – Geodict pulls country, city and region names from unstructured English text, and returns their coordinates.
  • IP Address to Coordinates – IP Address to Location calculates country, state, city and latitude/longitude coordinates for IP addresses.
  • Text to Sentences – Removes any parts of the text that look like boilerplate instead of real sentences.
  • HTML to Text – Returns the full text that would actually be displayed in the browser when an HTML document was rendered.
  • HTML to Story – Takes an HTML document representing a news article or similar page, and extracts just the story text.
  • Text to People – Spots text fragments that look like people’s names or titles, and guesses their gender where possible.

You can learn about the sources of these tools here.

According to Pete, “It’s essentially a specialized Linux distribution, with a lot of useful data software pre-installed and exposing a simple interface.”

If you want to do intensive data mining, you’ll probably want your own server. The Data Science Toolkit is available as either a VMware machine or as an Amazon EC2 image. You can find out more about this here. Alternately, you can find the source on Github.

Photo by Katherine Tompkins

About ReadWrite’s Editorial Process

The ReadWrite Editorial policy involves closely monitoring the tech industry for major developments, new product launches, AI breakthroughs, video game releases and other newsworthy events. Editors assign relevant stories to staff writers or freelance contributors with expertise in each particular topic area. Before publication, articles go through a rigorous round of editing for accuracy, clarity, and to ensure adherence to ReadWrite's style guidelines.

Get the biggest tech headlines of the day delivered to your inbox

    By signing up, you agree to our Terms and Privacy Policy. Unsubscribe anytime.

    Tech News

    Explore the latest in tech with our Tech News. We cut through the noise for concise, relevant updates, keeping you informed about the rapidly evolving tech landscape with curated content that separates signal from noise.

    In-Depth Tech Stories

    Explore tech impact in In-Depth Stories. Narrative data journalism offers comprehensive analyses, revealing stories behind data. Understand industry trends for a deeper perspective on tech's intricate relationships with society.

    Expert Reviews

    Empower decisions with Expert Reviews, merging industry expertise and insightful analysis. Delve into tech intricacies, get the best deals, and stay ahead with our trustworthy guide to navigating the ever-changing tech market.