Data Cleanup, Big Data, Standards, and Program Transparency
Mary Shacklett’s 10 Roadblocks to Implementing Big Data Analytics will be familiar to anyone experienced with large data conversion or database construction projects:
- IT know-how
- Business know-how
- Data cleanup
- The storage bulge
- New data center workloads
- Data retention
- Vendor role clarification
- Business and IT alignment
- Developing new talent
Obviously “budget” is number one. My personal favorite, though, is number 4, “Data cleanup.” I’ve been on the vendor side of some large data-intensive projects and had to face the question, “When is the best time to tell the client how much data prep is actually going to cost?”
I remember thinking that, if you tell them too soon you scare them away but if you tell them too late everyone suffers. I do agree with what Shacklett recommends: raise the issue sooner than later.
This assumes you know in advance what it will cost to clean and prep the data. Making such estimates might require a separate project just to analyze the source data in order to understand what needs to be done to prepare the data for analysis and maintenance. If the conversion job is large or complex you may have to do this anyway. The downside of doing too much analysis up front, of course, is that you might delay starting to deliver useful benefits, hence the not-unusual recommendation that an “agile” approach might be more appropriate.
If the sources of data are multiple programs, each of which might have different definitions and data models driving its data efforts, the impact on both front end and ongoing maintenance costs could be substantial. So too will be the impact on program transparency and on hoped for “big data” analysis of the data coming out of the programs, especially if the software and hardware requirements for analysis differ significantly from what is already available.
As I suggested in What Makes a Government Program “Transparent”? the creation of a single database from data generated by multiple programs can be an expensive undertaking. How much cleaning and restructuring of the data is required should be driven by the type of analysis that is anticipated and by the type of tools — and tool users — that will be doing the analysis.
If the advocates of a unified approach to big data analysis are in aposition to require the standardized submission of data to the central repository, as was the case with ARRA in the U.S., so much the better. Standards are good. They make life easier for everyone — once they are implemented. If the standards are detailed enough, they can also provide a baseline measure for estimating data cleanup costs. Knowing such costs is helpful in case, for example, costs of data preparation have to be offset by cost reductions in other program areas.
Promoting a standards based data preparation process can also simplify program transparency efforts. Similar tools and processes can be implemented across programs. Examples of this are the use of similar web based data management, analysis, and reporting tools for multiple programs.
I’ve personally known of instances where tools like Drupal and SharePoint were recommended precisely for such reasons. Following data structure or toolset standards for data prep can also stimulate the development of a pool of qualified staff, which is a key challenge in Shacklett’s list of “big data roadblocks.” Handling transactional data isn’t the same as handling data modeling, visualization, or statistical analysis and may require different skill sets.
Imposing data preparation standards supports but does not guarantee program transparency. Standards sometimes take on a life of their own, as when “standards compliant” starts showing up on acquisition requirements lists without an understanding of what that standard is supposed to promote.
Despite the potential benefits of “big data” and the move to standard data exchange formats, we also should keep our eyes on whether or not making a program more “transparent” and visible and accessible to its target users does, in fact, benefit those users. If it doesn’t, maybe we should be rethinking why we want to make programs more transparent in the first place.
Copyright (c) 2012 by Dennis D. McDonald