Home Head to Head Comparison of Text Extraction Algorithms

Head to Head Comparison of Text Extraction Algorithms

A few months ago we linked to Tomaž Kova?i?’s overview of text extraction algorithms. Now Kova?i? has posted an evaluation of several text extraction algorithms and services, including Boilerpipe, NCleaner, the Python and Node.js versions of Readability and the Extractiv API.

To conduct his evaluations, Kova?i? used the cleaneval dataset, which includes 681 documents, and a Google News dataset with 621 documents harvested by the authors of Boilerpipe.


Metric for the Google News data set

A few notes:

  • NCleaner did better on its own Cleaneval data set than it did on the Google News data set, but Boilerpipe did well on both sets.
  • Kova?i?’ was surprised by Readability’s poor performance, and notes the discrepancy between the two ports. He thinks the original JavaScript version may do better.
  • The commercial APIs had the most consistent results.

Image by Andrew Mason

About ReadWrite’s Editorial Process

The ReadWrite Editorial policy involves closely monitoring the gambling and blockchain industries for major developments, new product and brand launches, game releases and other newsworthy events. Editors assign relevant stories to in-house staff writers with expertise in each particular topic area. Before publication, articles go through a rigorous round of editing for accuracy, clarity, and to ensure adherence to ReadWrite's style guidelines.