The Powerhouse Museum of Science and Design in Sydney, Australia has begun to utilize the Reuters Open Calais API (our coverage) to tag their collection. The museum’s online collection database houses some 66,303 objects, so tagging them all by hand would be quite a task. By using the Open Calais web service, the museum is able to automate much of the process.

That the museum has so much of its collection online is actually quite impressive in its own right. About 70% of the museum’s electronically documented collection is online in the database which went live in June 2006. Museum objects are searchable, taggable (by humans) and painstakingly described.

However, there are so many objects, that even though users can help to tag them, many of them haven’t yet been tagged. Sebastian Chan, who is the Manager of Web Services at the museum, told us that Open Calais is being used to compliment the people-powered tagging they’ve had running for two years. “What Open Calais lets us do now is connect people, places and companies across our collection and has already revealed many new pathways through our dataset (navigating by designer or inventor is now much easier for example),” he said.

The automatically generated tags at right were created by the API for some swim wear designed by Speedo for the 1991 Australian swimming team that competed at the World Swimming Championships in Perth. Open Calais was correctly able to identify some important locations in the document — Perth where the competition took place, and Sydney where Speedo is based — as well as an important corporation (Speedo). It also picked up the name of the designer, and the name of the person who owned the suits before the museum.

However, as you can see, the API made some mistakes too — it classified “World Championships” as a company, and mistook the general text “international swimming organisation” as an actual organized body. It missed the actual organization (FINA) and probably should have picked up the MacRae Knitting Mills company, which was a predecessor to Speedo. Further, because Open Calais is built around people, places, and companies, general information about items may be lost on it. Tags that would be obvious to humans, such as swimming, swim wear, Olympics, or the year 1991, are beyond the scope of Open Calais.

“These errors and other like them reveal Open Calais’ history as Clearforest in the business world,” said Chan. “The rules it applies when parsing text as well as the entities that it is ‘aware’ of are rooted in the language of enterprise, finance and commerce.” On the other hand, according to Chan, the technology has already revealed “many new connections between objects,” even though it has so far been deployed only very sparingly across the collection.

Powerhouse’s use of Open Calais may be the first large scale deployment of the technology across a large public data set. It will be interesting to see the results as they evolve. “It is important to remember that there is no way that this structured data could be generated manually – the volume of legacy data is too great and the burden on curatorial and cataloguing staff would be too great,” reminded Chan.