Hi all, got a question and something to show …
One of the great challenges I face when implementing open data programs is the work involved in preparing data for public release. Often this becomes a blocker when persuading data owners to commit to their own project, and results in limited data release.
Every quarter, ongoing since 2016, I have collated about 300 different datasets from local authorities across the UK as part of my open data https://sqwyre.com project. This is a service mapping business history in every commercial property across the UK. This is not scraping. It is tedious and repetitive data wrangling, converting multiple files, in multiple formats, into a single schema to permit analysis, comparisons and further enrichment.
I have built a collection of tools that offers a simple collaborative drag 'n drop interface to support our data wranglers, and create a json file that permits the schema to be validated according to the https://frictionlessdata.io/ standard, collaboration on wrangling workflows, and the methodology to be repeated and tracked. My objective is wrangling simplicity, complete audit transparency, and at speed.
Here’s a brief video overview showing what that looks like in action: https://www.youtube.com/watch?v=HQw8IBLUnL4
I would like to extract this from my application and create a stand-alone SaaS drag 'n drop data wrangling tool useful for data owners and managers, journalists, and researchers looking to rapidly, and continuously, normalise and validate any messy spreadsheets using a simple series of steps, and without any coding knowledge.
Is this something that might be of interest to the SIKM community? What do you currently use, or recommend for simple, easy-to-use data wrangling? If you are interested in my system, is there anyone who may wish to see about potential collaboration?
Gavin Chait is a data engineer and development economist at Whythawk.
Note: groups.io will be down for maintenance on Monday, September 26th, starting at 9AM Pacific Time (4PM Monday September 26, 2022 UTC), for approximately one hour.