If you are looking for large content repositories, you probably can’t get much larger than the article archive of the Associated Press. Today they announced they have launched a content analysis tool that is used to search the millions of articles in their archives to create custom archive products for their customers. Users can query for particular keywords, and the AP can use the search query traffic to see trending topics and deliver article collections to particular B2B customers. For example, they could create references on a particular subject or moment in time. The project makes use of a solution from MarkLogic, a major Big Data enabler that is used by many different kinds of publishers for this type of purpose, such as Lexis/Nexis.
We have written about prior efforts by the AP to help modernize their archives, such as this project to provide non-profits with free information feeds.
The AP didn’t start out by using the MarkLogic solution, but tried to implement a more traditional relational database structure only to run into problems. Their archives are in XML, which was difficult to design the right kind of data structures. Plus, they didn’t have a consistent metadata collection across the archives. The MarkLogic implementation took 16 weeks from start to finish and was the first time that the AP had made use of their services. “With this new tool, we are able to run complex, Boolean searches across millions of articles in our content archive and get back precise returns in seconds or minutes instead of days or weeks,” said Amy Sweigert, AP’s vice president of information management. This much quicker response time is already transforming their B2B product offerings. For example, they are able to tackle other Big Data issue and bring their content to the 21st century and further enrich their news content by managing both structured and unstructured data in real time.
MarkLogic has a free license that can be used for testing and development. Deployment can be expensive, in the tens or hundreds of thousands of dollars.