Why I'm Uneasy about "Big Data" and Government Programs

December 18, 2012 Dennis D. McDonald

Don’t get me wrong, I love data of all kinds – statistical, financial, bibliographic, replacement parts for washing machines, images, opinions – I love them all.

For many years I made a living helping people build, convert, and sell databases on local area networks, optical media, or online. I still can imagine the “feel” of picking up data in my bare hands and letting the numbers and letters run through the fingers.

These days, though, I admit to some concern when I read articles extolling the wonders and benefits of “big data,” especially when the context consists of data generated by government agencies during the course of their provision of services to citizens. Terms like “innovation,” “discoverability,” “accountability,” and “transparency” are bandied about quite loosely, perhaps too loosely.

I get it. I appreciate that scaling, discoverability, and innovation are all potentially enhanced when the size, variety, quality and number of data sets surrounding a particular process or function are aggregated and exposed. Jewels can become visible. Inconsistencies can be identified and resolved. Impacts can be tracked. Especially attractive are evolving capabilities to automate pattern and trend recognition and to extract and transform data into easier-to-analyze standardized packages via readily available APIs.

I understand all that. Deep down, I really want to believe it. That’s my data management heritage kicking in, I guess, along with the fact that I’m a “glass half full” kind of guy.

Still, a lot of these benefits (at least the way I’ve listed them here) are potential benefits. A lot of work (e.g., data cleanup, standardization, and conversion) might be needed before such benefits can be realized — assuming they can be realized.

Why would I, data lover that I am, harbor such doubts, especially when we see the benefits that can occur when large data sets such as Federal stimulus expenditure are available to agencies and the organizations that administer the programs? Here are some possible reasons:

Costs are easier to understand and quantify than benefits, especially when we are considering government programs that are not generally intended to make a profit, as is the case with the private sector.
Can we be sure that the original target constituency for the services generating the data will be benefited by the “big data” efforts?
Given the importance of state and local governments in the actual delivery chain for federally sourced services, how likely will it be that we can get everyone “on the same page” for gathering the data?
In these days of fiscal austerity, what’s the likelihood that effective data aggregation efforts will be effectively and sufficiently funded?
Given the potential complexity of data projects and a number of different interests that need to be orchestrated, where’s the leadership to come from?
What happens if new data need to be gathered? What kinds of burdens will be placed on organizations and contractors?

Lest I be accused of simply creating “strawman” scenarios to be pessimistic about, I’m not trying to do that. As I said, I’m a friend of data and data’s value. But I have enough experience working with data and managing data projects to know that the “train can easily go off track” for a variety of reasons related to policy, cost, management, and/or leadership.

Where to start? Here are some suggestions to follow when planning data projects designed to aggregate and expose data from multiple sources:

Don’t spend money on converting or standardizing data that are old or out of date.
Start small so that the initial “humps” of data conversion costs and having to operate dual or overlapping systems won’t sink the program.
Align with the programs responsible for supplying and using the data.
Make sure that leadership, operations, and budget are adequate and reliable.
Target critical use cases that generate real benefits to users.
Avoid an “if we create it they will come” strategy.
Make sure all stakeholders are on the same page — agency staff, procurement staff, legislators, and most importantly, users.
Focus program metrics on outcomes, not just internal transactions, costs, and increased program efficiency.
Target mobile devices first for exposing data to different constituencies through creative methods of data visualization and manipulation.
Maximize reliance on open source, non-proprietary, and off the shelf tools and techniques.

If there is one key ingredient that we need to make “big data” efforts successful, it’s effective project management that incorporates the interests of stakeholders, manages resources effectively, stays flexible, and manages to realistic and trackable schedules. I’ll be discussing this in future posts.