Dennis D. McDonald's Web Site

View Original

On the Importance of Data Janitors

By Dennis D. McDonald

Data janitors

Many who read the news release Nearly 40% of data professionals spend half of their time prepping data rather than analyzing it will nod really hard. To quote from the article:

Most notably, nearly 40% of data professionals (37.5%) spend more than 20 hours per week accessing, blending and preparing data rather than performing actual analysis.

What data scientist worth his or her salt doesn't cringe at the thought of prepping data rather than doing fun and sexy analytical, predictive, and visualization stuff -- the stuff they're really getting paid to do?

Too much drudgery

I was a bit disappointed when I read this study's data. That people dislike the drudgery of data prep is understandable. That people still have to spend so much time preparing data for analysis in these days of constantly improving data analysis tools is downright depressing. After all, aren't we all headed to the day when "data prep" is part and parcel of any serious AI application, as envisioned by IBM?

No change?

Looking at the numbers you would think that things haven't changed since the days when I crunched numbers for a living and would feel really lucky if I could actually spend 20% of a statistical or survey research project on analysis. Or when I built digital database products for a living composed of text, numeric, and image data extracted from dozens of different systems and platforms.

In those days, data cleanup and standardization were always a major cost and time component for any client project, be the client an appliance retailer, a truck manufacturer, or an international insurance company. And heaven help you if you had to move a decade of customer payments data from one mainframe system to another and the systems were based on radically different -- and inconsistently applied -- financial or customer route models. (I know, modern network based digital businesses that have grown up with the Web may not have such concerns, but I'm talking about the messy real world here.)

Distributed systems

Obviously today's tools are much better and, perhaps even more important, there does seem to be a recognition that better data governance is one way to improve tha ratio of data analysis to data cleanup time. Also, the distributed ledgers in blockchain systems require synchronized and compatible data (see Linking Up Blockchain and Data Integration by David Linthicum.)

Better data governance

From the same article quoted at the top of this post:

Nearly a third (32%) of respondents' organizations are planning or researching a formalized data governance program, and nearly 20% (19.4%) are in the early stages of rolling out their governance programs, primarily with the goal of ensuring that everyone is working with consistent data.

Down and dirty

As important as I do think better data governance is, I also think there's no real substitute for getting "down and dirty" with the data regardless of how sophisticated the analysis is going to be. 

I still believe that, regardless of the sophistication of the planned analysis, there's no substitute for "running your fingers" through the data and getting a feel for it, hence the "data janitor" reference above.  

Copyright (c) 2017 by Dennis D. McDonald. An edited version of this blog post has been published by aNewDomain.