Data At Work profiles data scientists working at the cutting edge of Big Data.
Suppose data scientists could track how people move through cities and towns as easily as e-commerce sites track them online?
Don’t answer that. It’s already happening—thanks, at least in part, to a startup called StreetLight Data.
StreetLight founder and CEO Laura Schewel was working on a doctorate in energy engineering at UC Berkeley when she had the “ah-ha” idea of using data from cell towers, traffic-data aggregators and GPS satellites to track people’s movement patterns in cities and states across the country.
Initially, Schewel figured the information might help traffic engineers plan new highways and parking. But the data her system aggregates turns out to be useful for much, much more.
How To Track Without Tracking
Like it or not, it’s ridiculously easy to see how people behave online. Cookies and more sophisticated techniques let advertisers track individuals across websites, in part because the online environment is controlled (in just about every sense of the term).
This type of tracking, and the related task of gathering insights into peoples’ behavior, is much more complicated in real life.
The basic problem is one of putting together all the pieces of a massive puzzle that don’t exactly fit. How do you get and make sense of the data generated by ordinary people in order to say with any degree of certainty—in generalized and anonymous but still analytically useful ways—where they shop, which highways they take or even whether they’re more likely to take the train on Fridays when the Giants are playing and traffic is lousy in San Francisco?
See also: Why Data Scientists Get Paid So Much
For StreetLight, it all starts with the cellphone. It probably won’t surprise you to know that major carriers collect detailed location data as your phone registers with different cellular broadcast towers (thus providing a detailed record of your movements).
But you might not have known that carriers sell access to that data in a format that basically provides a movement record for large chunks of the population. It’s all anonymized: The data consists of map plot coordinates and the ID numbers that identify particular phones, with the latter run through a one-way hashing function designed to yield unique numbers that can’t be matched to the original IDs.
StreetLight’s proprietary pattern-recognition algorithms can infer the “favorite” places of the people covered in the carrier geodata, such as their home and work neighborhoods. Then StreetLight cross references this info with census and other demographic information such as household income, educational status and race. [Corrected: see below]
What it ends up with are richly detailed databases that can be used, say, to generate the average profile of someone who might be shopping at Whole Foods at 5pm, dropping a child off at school on a Monday morning, or commuting from San Francisco to the East Bay to work.
Put that way StreetLight sounds sounds sort of creepy—and maybe it is. Schewel and her team, of course, stress that safeguards such as those one-way hashes make it impossible to tie aggregated data about groups back to individual users. “There is no way for us to actually map anything back to individuals. All that data is stripped out long before we get it,” Schewel told me.
At the same time, de-anonymizing such information tends to become easier over time, in part because individuals are generating increasing quantities of data about themselves that can serve as a cross-reference to pinpoint actual identities.
Whether or not such privacy concerns have merit, this type of data can provide valuable information in a variety of situations. Think companies deciding whether and where to expand; city and transit planners projecting the need for new zoning, transit or roadways; and perhaps developing nations planning new infrastructure and even entire cities.
Turning Data Into Information
The process by which StreetLight maps together these very different types of data into a coherent dataset turns out to be fairly straightforward. Every month, Schewel’s team receives a messy glob of about 400GB worth of geospatial data from mobile carriers and other data providers.
That doesn’t sound like much—even given that the load is expected to reach 800GB a month next year—considering that StreetLight’s movement patterns cover much of the continental U.S. (The company occasionally also scrapes up Canadian data by accident, and has to discard it.) But geospatial data is fairly lean and has a small footprint, Schewel says. The data is added to StreetLight’s existing multi-terabyte data store.
StreetLight then pushes the data through a custom extract, transform and load process run through Talend, a popular Big Data integration tool. This trims out unnecessary information and reformats different types of data into a uniform schema.
Along the way, this process matches up different types of data—cellular-tower location, traffic reports, census patterns, other data sources—at different geographic scales ranging from census block to town or city to region, and along expressways or other transit corridors. All that data gets referenced to particular geospatial locations and, in many cases, to specific time periods as well (“all the time,” “weekdays,” “rush hour,” etc.).
What StreetLight Knows About Us
All that work links together disparate types of data in a meaningful way, making it possible to get a good sense of where people who fit a particular demographic profile spend their time—and when.
Say, for instance, you wanted to know more about people who shop at the Stanford Mall. The StreetLight database might tell you that people over 50 with graduate degrees who live in high-end neighborhoods shop there all the time; families with children from middle-class and high-end neighborhoods shop there on weekends (especially in August and December); and people without college degrees only visit the mall on Monday evenings in the spring.
Now that’s data transparency.
StreetLight can, for instance, help out a retail chain that’s thinking about opening a new store with better information about its prospective customers. For instance, whether the average shopper in a proposed mall location earns closer to $50,000 or $100,000, has one child or three, or is a 50-year-old females or a 21-year-old male. As you can imagine, such data is incredibly valuable, and not just to companies.
As Schewel explained it to me:
We can actually show what might happen if, say, a new freeway off-ramp would be built or a road is changed or even if a big snowstorm hits. We can do this by finding days in the past when an event creating similar conditions occurred. It’s much better than running simulations because it’s real behavior.
For a decent degree of confidence, StreetLight needs a sample size equal to at least 1% of the population of any location. Schewel prefers 5% to 6% for better signal fidelity, though.
X-Raying The Average Shopper
Already, StreetLight is proving its worth in some unexpected ways. In 2013, the Oakland Business Development Corporation (OBDC) wanted to increase economic activity in downtown neighborhoods where hundreds of commercial properties lay vacant. Oakland locals, too, were spending up to three-quarters of their retail dollars elsewhere, in part for lack of options.
Foodies in the East Bay knew the downtown Oakland dining scene was on fire; OBDC, a nonprofit urban-development agency and business lending organization, tried to capitalize on the boom by courting retailers and developers. But it struck out when its prospects looked at demographic data on nearby neighborhoods, many of which are low-income areas, and backed away.
OBDC turned to StreetLight for a clearer picture of downtown Oakland’s commercial prospects. Its data revealed that the area regularly draws a healthy mix of wealthy, middle-class and lower income people.
OBDC used those findings to convince skeptical store owners to consider locating downtown. But the organization, which also makes loans to retailers, put the data to broader use—primarily to confirm that the area’s shopping demographics could support a variety of store types.
“That data helped us fill dozens of vacant storefronts over the next year,” says Jacob Singer, OBDC’s president and CEO.
Singer is now considering purchasing StreetLight data as part of retail and urban planning efforts around an upcoming bus-based rapid-transit project slated for downtown Oakland in the next few years. “There really are no comparable alternatives that provide data this detailed and accurate for urban planning and project assessment,” he says.
Reading The StreetLight X-Ray
VeggieGrill, a rapidly growing vegetarian fast-food chain, signed up with StreetLight to learn where people who most closely matched the vegetarian demographic tended to shop and spend their time.
Other retailers are using StreetLight data in reverse. Men’s Wearhouse, for instance, uses StreetLight not just to spot new store locations, but to identify underperforming stores based on traffic patterns and shopper demographics.
StreetLight’s data often reveals unexpected patterns—or their absence. Sometimes it shows big differences in the types of shoppers that frequent two adjacent shopping centers, or surprising discrepancies between stores and their neighborhoods.
“We can also tell a store chain that the wealthy people who live around a location rarely go to that store,” says Schewel. “For some customers, we have seen surprising dead zones where you would think a ton of people would shop but in fact few venture in.”
Schewel has big plans beyond helping merchants optimize their store locations. Like, for instance, improving public planning in developing countries with detailed data.
“Many of these countries don’t have a census and don’t really know how people are moving around, so our information would be the first real data,” she says. And since countries that never had widespread land lines often have denser cellphone networks than the U.S., Schewel thinks StreetLight could provide even more detailed user data.
Ultimately, StreetLight’s data could also help answer more difficult questions about whole-day transport patterns. These patterns reflect complex human decisions that result in behaviors and traffic patterns that are hard to analyze in isolation.
As Schewel told me:
We can capture the entire traveling day of citizens. Rather than just seeing what happens when someone is going from home to work, we can see that people have not taken public transit because they have to pick up their child from school or that they are more likely to go to a supermarket to buy groceries on Friday night. This type of detail lets everyone that needs to know how people move see cause-and-effect far better than before.
Correction, 11:19pm PT: An earlier version of this article incorrectly described the information StreetLight purchases from carriers. It acquires only geolocation data from carriers, not anonymized demographic and user information.