Open source data wrangling and audit #data-science
A few months ago I wrote about a method I was developing to restructure messy spreadsheets into a single schema, and that this could be done in a drag ‘n drop interface. The thing that struck people most was that – once data wrangling becomes easy – you could create schemas on the fly without needing buy-in across the organisation. Very useful for anyone wanting to merge multiple departmental budget reports, while also being able to provide an audit trail as to how that integration was achieved.
A few days ago I released an open source version of the core software: https://github.com/whythawk/whyqd
I’m not sure how many people in the group work directly with Python/pandas, but you may work with people who do. I’d appreciate your thoughts as I continue to develop this resource. There’s also a full guided tutorial for anyone wanting to learn from scratch: https://whyqd.readthedocs.io/en/latest/tutorial.html
What is it?
whyqd provides an intuitive method for restructuring messy data to conform to a standardised metadata schema. It supports data managers and researchers looking to rapidly, and continuously, normalise any messy spreadsheets using a simple series of steps. Once complete, you can import wrangled data into more complex analytical systems or full-feature wrangling tools.
It aims to get you to the point where you can perform automated data munging prior to committing your data into a database, and no further. It is built on Pandas, and plays well with existing Python-based data-analytical tools. Each raw source file will produce a json schema and method file which defines the set of actions to be performed to produce refined data, and a destination file validated against that schema.
whyqd ensures complete audit transparency by saving all actions performed to restructure your input data to a separate json-defined methods file. This permits others to scrutinise your approach, validate your methodology, or even use your methods to import data in production.
Once complete, a method file can be shared, along with your input data, and anyone can import whyqd and validate your method to verify that your output data is the product of these inputs.
Gavin Chait is a data scientist and development economist at Whythawk.