Data Mining and Taco Bell Programming

Programmer Ted Dziuba suggests an alternative to traditional program that he called “Taco Bell Programming.” The Taco Bell chain creates multiple menu items from about eight different ingredients. Dziuba wants to be able to be able to create many applications with combinations of about eight different shell commands.

Here’s an example from Dziuba:

Here’s a concrete example: suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it? The cool-kids answer is to write a distributed crawler in Clojure and run it on EC2, handing out jobs with a message queue like SQS or ZeroMQ.

The Taco Bell answer? xargs and wget. In the rare case that you saturate the network connection, add some split and rsync. A “distributed crawler” is really only like 10 lines of shell script.

Dziuba gives another example. Instead of using Hadoop to process that data once you have it, you can use:

find crawl_dir/ -type f -print0 | xargs -n1 -0 -P32 ./process

“It is a viable way to deal with massive data problems, at least for one-off jobs,” Big data expert and ReadWriteWeb contributor Pete Warden says about Dziuba’s Taco Bell programming concept. “You’re trading off the ability to manage and tightly control the process against development speed.”

Do you have any favorite hacks like this?