Feb 20

Feb 20 Transparently Speaking, Are Bad Data Better than No Data At All?

ARRA, Big Data, Data, Data Conversion, ETL, Legislation, Standards, Transparency, eGovernment

I’ve been researching government program transparency and the hype surrounding “big data.” Given OMB’s recent statement of support for improving access to accurate Federal spending data I’ve also been giving some thought to what improved access might actually mean, based on my own experience with data conversion and consolidation projects.

It’s difficult to find out what the Federal government actually spends. The reasons are many:

There are many different financial management and reporting systems.
Different rules exist for how data have to be reported.
Some reports are generated automatically and some manually.
Date ranges may differ on different reports tracking the same topic.
The same entities and transactions in different reports have inconsistent definitions and tagging.
Money appropriated is not the same as money spent.

And so on. One conclusion: we need to be careful about using the word “error” to describe what seems to be a discrepancy between one report and another since the reasons for differences might make total sense. Just as “there’s many a slip ‘twixt the cup and the lip,” the reasons for disparities may vary from an actual error (e.g., a keyboarding error by a contractor entering expense report data for subcontractors) to a difference in reporting requirements (e.g., only expense items in excess of a certain amount might legally rise to a level requiring public reporting).

Is it appropriate to release data to the public that contain errors or which, due to a lack of standardization, differ in key categories or field definitions from similar data being reported by other agencies? Consider the following four categories that describe hypothetical expenditure data sets being considered for release to the public via an official Federal reporting program:

Data Not Standardized and Not Correct
Data Standardized and Not Correct
Data Not Standardized and Correct
Data Standardized and Correct

While Category 4 is the ideal state for data that you want to be accessible to the public and the program’s stakeholders, shouldn’t Category 1 data also be available to the public, assuming the public is made aware of possible issues?

In this type of situation we need to be clear about the relationships among transparency requirements, data accuracy, and differences due to a lack of standards. We also need to make people aware of possible issues about the data. For example, are data issues serious enough to keep the data from being released to the public because of possible health or safety implications?

Such questions are tough to address on the theoretical level. We need to be specific since what we really care about is not just the spending data but the impacts the spending has on program objectives.

Government needs to do more than just “throw data over the transom.” Even if requirements for rapid public access are incorporated into legislation requiring standardization and transparency, at the same time we also need to address the program cost, governance, and process change issues associated with improving data access.