Management Needs Data Literacy To Run Open Data Programs
Sunlight Foundation’s Júlia Keserű’s When open data isn’t enough is an insightful piece. I especially like how the relationship between “reactive information disclosure” and “proactive access to information” is discussed:
Sunlight has already written a great deal about how real change requires a healthy transparency ecosystem in which a number of different contributing parts play an important role. The same is true for the necessary coexistence of reactive information disclosure, as exemplified by traditional freedom of information (FOI) laws, and proactive access to information, which is increasingly the product of open data laws. While one guarantees citizens the right to ask public officials for information about what they are doing and for any documents they hold, the other describes the positive obligation of public bodies to provide information about their main activities, budgets and policies.
The distinction the author then makes between (a) what’s available in documents as opposed to (b) what’s now available in (faster moving) data sets is, in my opinion, a bit too simplistic. An underlying issue, not mentioned by the author, is that whether you have useful information stored in documents or in data sets you also need at some point to have the ability and skill to manage or at least understand the processes involved in gathering and analyzing the data.
Let’s call this “data management literacy.” Open data advocates might say that’s one reason why you need to make data sets available to the public, so that other (non-governmental) eyeballs and brains can analyze and disclose what’s meaningful.
While that’s certainly true, it also helps to have data skills available within government as well as a core competency so that processes associated with inventorying, organizing, standardizing, and using data are better managed, including the processes involved in making data more open and accessible to the public.
Needed: a minimum level of “data management literacy”
What is needed is some level of “data literacy” on the part of those managing open data and transparency programs. I’m not saying that everyone needs to be a “data scientist” or whatever the term of art is these days in order to be considered to be data literate, but people certainly need more than an understanding of spreadsheets and Excel in order to effectively plan and manage open data and transparency programs and this starts with how such programs are planned, designed, and costed.
Also needed is a some understanding that classical research designs, hypothesis testing, and sample selection and survey methodologies have been supplemented with newer tools that accommodate large and constantly changing floods of data. As discussed in Planning for Big Data: Lessons Learned from Large Energy Utility Projects, the sheer volume of data can now have a profound impact on the ability to manage it wisely.
A baseline of shared knowledge
As a taxpayer I’m an advocate of government transparency but I’m writing this primarily from the perspective of a project manager. I’ve learned that project communication is a key element in the success of any technology dependent project including those that focus on creating, managing, or accessing data. For effective communication to take place among project management, staff, and stakeholders, a baseline of shared knowledge and language needs to exist. When a project focuses on developing and implementing a government agency’s open data programs, then, how do we define what these core elements of “data literacy” are? What do managers need to know?
2. AN OPEN DATA MANAGEMENT PROCESS MODEL
Figure 1 is the first draft of an “open data process model.” I’m using it here to outline the types of knowledge people need to effectively manage an open data program.
Figure 1. Open Data Management Process Model
A. Management and governance
My approach to defining “data management literacy” involves more than software tools and analytical skills.
Responsibility for managing and overseeing an open data program needs to be considered right from the start. You need to understand the roles of the stakeholders right from the beginning and how they will interact in controlling the program over time. Failure to do this due to an overemphasis on tools or technology at the expense of strategic alignment, sustainability or performance measurement will block the program’s impact. Basic data literacy on the part of management thus means attending to organizational and administrative issues right from the start.
B. Standards, quality-control, and performance metrics
Decisions will have to be made about data standards, how quality is maintained, and how program performance is measured. Part of what is needed is an understanding of what one’s peers are already doing in terms of the data and metadata in their own publishing; this can enhance communication and generalization of findings when done correctly.
Attention also needs to be given to the implications of publishing data from multiple internal systems that themselves may not have standard data definitions or formats. Management needs to be aware that an internal lack of data standards has both cost and political implications as changes may be needed in how data are generated from multiple internal sources.
C. Extract and capture
At some point data intended for release to the public via an open data program needs to be captured and/or extracted from different systems that, likely as not, are not – at least initially — under the direct control of the open data program’s governance process. At minimum, discovery and research will be necessary to identify source data for inclusion in the open data program. While in theory any data of a non-personal or non-sensitive nature might be a candidate, in reality priorities need to be set.
How priorities are set for sequencing which data are released to the public and how that is done is a basic component of data management literacy and includes being able to understand the connection between different types of data and how they might be used by the public in ways that are aligned with the organization’s goals and objectives. Once a mapping does take place between data types and intended data usage, any additional steps required for getting the data to the public need to be considered.
D. Data prep
Once targeted data are captured they need to be prepared for public access. Sometimes this preparation process is straightforward and involves copying and moving data from one system to another. At other times some type of conversion or transformation of the data may be involved including cleaning up errors, adding metadata, changing measurement units, obfuscation or removal of personal identifying information, or the application of external or community defined standards.
A major component of data management literacy is understanding the work involved during this data prep process. That doesn’t necessarily mean an understanding of the inner workings of all the tools that might come into play at this stage but it does require an understanding and appreciation of data prep’s management, schedule, and resource requirements, especially if the data have to come from multiple sources and will need to be updated and maintained over time.
While Data Prep focuses on the data — getting, converting, cleaning, standardizing — in Staging we focus on making the open data program sustainable. This may involve setting up a new set of processes (for example, implementing organization wide metadata and then applying tags and index terms consistently via a combination of manual and automated processes) or modifying old ones (for example, changing how legacy systems that provide input data are operated in order to simplify or streamline data prep processes).
At this stage it must be understood by management that the data files, apps, and APIs we want to make available to the public in the open data program are just the “tip of the iceberg.” Work behind the scenes, like managing the content presented by a complex and constantly changing website, can be substantial.
It also becomes clear that the governance process involves several different parties working together, parties whose “ownership” of the source data intended for the open data program might initially feel challenged when changes are sought in the processes by which the source data are generated or extracted.
Data management literacy is important at this stage since resistance to change can come from both technical and political directions. Management needs to understand the realities of the two and must be able to balance organizational realities with the practical work of getting the job done. The ability to negotiate and collaborate at this stage will be at least as important as the ability to manage and use technology and software tools.
In “presentation” data are exposed for use by users. Both internal and external users may need access to the data via online access, downloads, search, analytics, visualizations, modeling, and any variety of web-based or local tools. Depending on the types of uses to be made of the data the functionality presented to the user for interacting with the data may differ between internal and external users but the underlying data may remain the same even if legitimates access restrictions must be exposed (e.g., to protect sensitive personal data).
The ability to manage and sustain a web based “portal” function that presents data to users in a variety of ways requires careful planning as well as technical, human, and financial resources. In many cases these resources may be spread throughout the organization and not controlled by a single entity.
Management needs to understand who is using open data, why open data are being used, what impacts use of the open data are having, and whether or not the open data are helping to accomplish the organization’s goals and objectives.
Understanding usage may not be easy. Some usage tracking may come from tools embedded within open data delivery mechanisms (e.g., web site based analytics), some may come from social media and social network tracking that captures mentions or communications associated with the data, and some may come from [potentially expensive] purpose-built survey and usage measurement.
Open data programs can challenge keeping track of usage and require management to appreciate the multiplicity of ways that open data can be made useful. While the source organization may itself provide a variety of online search, analysis, and visualization tools that are aligned with its goals and objectives, open data may also be downloaded or extracted for use in novel or creative ways, sometimes in combination with data from other sources by entities independent of the source agency.
If you follow the “flow” from left to right in Figure 1 too literally you might overlook what is characteristic of the open data “movement” and its frequent involvement with hackathons, datathons, and related community oriented mechanisms. These can generate ideas, tools, and products, some of which may have only a remote connection to why source data were initially generated. Even though I suggested in How To Make Datathon Efforts Sustainable that there are things that management can do before, during, and after a datathon to increase sustainability, the fact is that open data programs are, by design, not as “buttoned down” as traditional information system efforts. They are intended to push data out for use and re-use in intended as well as novel and creative ways.
This does not absolve management from assessing usage and effectiveness of open data programs. It does suggest that usage of open data needs to be defined — and measured — in potentially novel and creative ways, starting with an awareness by management of the variety of groups and communities that might play a role in generating creative uses and applications.
Copyright © 2014 by Dennis D. McDonald