Re: Data-wrangling as a service #metadata #data-science

Stephen Bounds

Hi Gavin,

Thanks for putting your solution out there. I'd be curious to hear how you compare the features of your product to something like OpenRefine?

I helped a client with something a bit similar a couple of years back. We were aggregating and cleaning metadata from a range of highly diverse article catalogues as part of a semantic auto-classifier engine. This was only a proof of concept so it was much less user-friendly, the wrangling was done through custom JSON configuration files and XPath selectors. With my project, there was always the intent to hand over the ongoing repository management responsibilities to a local staffer, but there just wasn't the expertise or commitment within the organisation to do so.

To be honest, my feeling is that the vast majority of people would lack the data analysis experience to effectively configure any wrangling tool, with a "no coding" interface or otherwise.

The tool does seem like a nice productivity booster for a data analyst/wrangler though, and you could hand off the data upload and validation process to an administrative person once the mappings were in place. To commercialise what you've got, I'd pitch it as more of a consultancy/maintenance engagement with SaaS provisioning of instances, rather than being a pure SaaS solution.

Happy to take the discussion offline if you'd like to talk further.


PS. Curious to hear about your past Australian project since that's where I'm based!

Stephen Bounds
Executive, Information Management
E: stephen.bounds@...
M: 0401 829 096
On 31/10/2019 4:26 am, Gavin Chait wrote:

Hi all, got a question and something to show …


One of the great challenges I face when implementing open data programs is the work involved in preparing data for public release. Often this becomes a blocker when persuading data owners to commit to their own project, and results in limited data release.


Every quarter, ongoing since 2016, I have collated about 300 different datasets from local authorities across the UK as part of my open data project. This is a service mapping business history in every commercial property across the UK. This is not scraping. It is tedious and repetitive data wrangling, converting multiple files, in multiple formats, into a single schema to permit analysis, comparisons and further enrichment.


I have built a collection of tools that offers a simple collaborative drag 'n drop interface to support our data wranglers, and create a json file that permits the schema to be validated according to the standard, collaboration on wrangling workflows, and the methodology to be repeated and tracked. My objective is wrangling simplicity, complete audit transparency, and at speed.


Here’s a brief video overview showing what that looks like in action:


I would like to extract this from my application and create a stand-alone SaaS drag 'n drop data wrangling tool useful for data owners and managers, journalists, and researchers looking to rapidly, and continuously, normalise and validate any messy spreadsheets using a simple series of steps, and without any coding knowledge.


Is this something that might be of interest to the SIKM community? What do you currently use, or recommend for simple, easy-to-use data wrangling? If you are interested in my system, is there anyone who may wish to see about potential collaboration?








Gavin Chait is a data engineer and development economist at Whythawk. | |

Join { to automatically receive all group messages.