Of the approximately 6,000 languages alive in the world today, 60 percent or more are said to be dying out. The majority of the world's languages are, in fact, "minority" languages, used in the shadow of a more politically powerful tongue.
On St. Patrick's Day, Prof. Kevin Scannell of St. Louis University launched a project called Indigenous Tweets. Using a web-crawling statistical software he wrote called An Crúbadán, Scannell identifies which minority languages are being tweeted, by whom and how.
Michael Schade, one of Scannell's students, explains the need.
"Twitter describes itself as 'the best way to discover what's new in your world,' but there is a fundamental issue with this: 'world' is presently limited by the inclusion of only a handful of languages. Although people can tweet in any language on Twitter, finding users who speak the same language is a difficult, or even a seemingly impossible, task. This is especially true for...minority languages."
Scannell's web-crawling software, An Crúbadán, first seeds Twitter searches with common but distinctive words of his 500 languages and crawls Twitter. It finds users who speaks these languages and ranks them. Scannell then recalculates trending topics with a focus on the language specifically, where they are imported into the Indigenous Tweet site.
The top minority languages on Twitter are currently Haitian creole, Basque and Welsh.
On Scannell's project blog, also called Indigenous Tweets, Prof. Scannell says the way he hopes the project will prove useful to minority speakers seeking to keep their languages vital is to make "it easier for speakers of indigenous and minority languages to find each other in the vast sea of English, French, Spanish, and other global languages that dominate Twitter."
As Schade points out, Twitter attempts to classify the language of all tweets, but it sometimes does a poor job of it, especially with the minority languages. Because Scannell has amassed large corpora and developed a technology geared toward identifying languages that might have little readily-available data to start with, he has seen much higher accuracy in language identification and analysis.
This is research that could help Twitter differentiate their users with greater specificity, as well as allow the growth of individual user's communities based on, among other things, language use.