Spidering the “Dark Web”

For some, the term “dark web” simply means all the online data that search engine spiders can’t reach, crawl, or index, but for the University of Arizona’s AI Lab, the “Dark Web” refers to a research project where the social phenomena of terrorism is studied via various techniques including social network analysis, content analysis, link analysis, web metrics, video analysis, data and text mining, sentiment and affect analysis, and authorship analysis. Through the use of sophisticated, mathematical tools, the project aims to collect all web content generated by international terrorist groups, including content found on web sites, forums, chat rooms, blogs, social networking sites, videos, virtual worlds, and more.

The Dark Web Project

Federally funded through the National Science Foundation, the Dark Web’s spiders have been crawling through the web for the past five years. As of 2007, they estimated there were about 50,000 sites of extremist/terrorist content when they looked beyond just traditional web pages. This number was a great increase from Dr. Gabriel Weimann of the University of Haifa’s estimate that there were only 5000 terrorist web sites in 2006. From 2006-2007, the lab found the greatest increase in terrorist activities was on various new “web 2.0” sites, (a term they use to describe any new-generation web site including video sites, blogs, virtual worlds, etc.)

Currently, the Dark Web collection consists of the complete contents of only 1000 web sites in Arabic, Spanish, and English and the partial contents of 10,000 other sites. This collection is 2 TBs in size making it the largest open-source extremist/terrorist collection in the academic world. Researchers who would like to use this data in their own studies can contact the research center for access.

Where the Bad Guys Are

So far, the Dark Web has determined the following:

  • Forums: 300 terrorist forums found, some with more than 30,000 members; nearly 1,000,000 messages posted.
  • Blogs, social networking sites, and virtual worlds: Many transient sites have been identified before they disappear; more than 30 (self-proclaimed) terrorist or extremist groups in virtual world sites, though they have yet been unable to determine who is just “playing terrorist” vs who is for real.
  • Videos and multimedia content: 1,000,000 images and 15,000 videos from web sites and specialty multimedia file-hosting third-party servers; more than 50% of of videos are related to Improvised Explosive Devices.



Second Life Griefers – A “Terrorist Attack?”

How They Find the Data

The Dark Web project uses various tools for collection, analysis, and visualization:

  • Web site spidering: Their focused spiders can access password-protected sites and perform randomized (human-like) fetching. The spiders are trained to fetch all html, pdf, and word files, links, PHP, CGI, and ASP files, images, audios, and videos in a web site. Selected web sites are spidered every 2 to 3 months.
  • Forum spidering: The specialized forum spidering tool recognizes 15+ forum hosting software types and their formats. The spiders collect the following info from the forums: authors, headings, postings, threads, time-tags, etc., all of which allow them to re-construct participant interactions. They have collected and processed forum contents in Arabic, English, Spanish, French, and Chinese using selected computational linguistics techniques.
  • Multimedia (image, audio, & video) spidering: They use specialized techniques for spidering and collecting multimedia files and attachments from web sites and forums and perform stenography research to identify encrypted images in the collection and multimedia analysis (video segmentation, image recognition, voice/speech recognition) to identify unique terrorist-generated video contents and styles.
  • Social network analysis (SNA): They use topological metrics (betweeness, degree, etc.) and properties (preferential attachment, growth, etc.) to model terrorist and terrorist site interactions. Techniques involving clustering and projection are used to visualize the data. The focus here is on “Dark Networks” and their unique properties.
  • Content analysis: Several coding schemes have been created to analyze the contents of terrorist and extremist web sites including content involving recruiting, training, sharing ideology, communication, propaganda, etc.
  • Web metrics analysis: They examine technical features and capabilities (e.g., their ability to use forms, tables, CGI programs, multimedia files, etc.) of such sites to determine their level of “web-savvy-ness.”
  • Sentiment and affect analysis: Sentiment (polarity: positive/negative) and affect (emotion: violence, racism, anger, etc.) analysis allows them to identify radical and violent sites that warrant further study. They also examine how radical ideas become “infectious” based on their contents, and senders and their interactions. Recent advances in Opinion Mining – analyzing opinions in short web-based texts – has aided their work.
  • Authorship analysis and Writeprint: They have developed a technique called (cyber) Writeprint to uniquely identify anonymous senders based on the signatures associated with their forum messages. They have expanded the lexical and syntactic features of traditional authorship analysis to include system (e.g., font size, color, web links) and semantic (e.g., violence. racism) features of relevance to online texts. Inkblob and Writeprint visualizations to help visually identify web signatures. Writeprint can achieve an accuracy level of 95%.
  • Video analysis: A unique coding scheme has been created to analyze terrorist-generated videos based on the contents, production characteristics, and meta data associated with the videos. A semi-automated tool allows human analysts to quickly and accurately analyze and code these videos.
  • IEDs in Dark Web analysis: A smaller number of sites are responsible for distributing a large percentage of IED related web pages, forum postings, training materials, explosive videos, etc. They have developed unique signatures for those IED sites based on their contents, linkages, and multimedia file characteristics
  • .

Image Credit: Yale.edu

Privacy Concerns

The researchers want you to know that you’re not a target of their research (unless you are, of course, a terrorist).

From their web site, they state the following: “This is not a secretive government project conducted by spooks. We perform scientific, longitudinal hypothesis-guided terrorism research like other terrorism researchers…our contents are open source in nature (similar to Google’s contents) and our major research targets are international, Jihadist groups, not regular citizens…our research goal is to study and understand the international extremism and terrorism phenomena. Some people may refer to this as understanding the root cause of terrorism.”

Facebook Comments