Re: Data-wrangling as a service #metadata #data-science


Hi Stephen,


I’ll pick this up offline as well, but you raise the key challenge of semantic classification for any form of knowledge / data management. Below is my “teacher’s hat” response ;)


OpenRefine (and the proprietary equivalent, Trifecta) are “opinionated”. They have their own programming language, and they take you from messy data to final product. There can be a huge investment in time and expertise to bring in a particular dataset which, depending on your needs, can be crucial and critical. Or completely over-the-top.


From a data curation perspective, they go too far, since what we want is easy-to-find, easy-to-use data, but we don’t define the use-case. A bit like the functionality in Excel. Most is unnecessary.


I used to, until about two years ago, regularly teach OpenRefine as part of my data manager courses. These are a week-long, full-time course teaching data owners how to prepare and release data into a knowledge management platform (syllabus here:


Text-based documents can – at a pinch – be analysed by a natural-language processor and some sort of sense extracted to permit search. Spreadsheets cannot, and so it is critical to layer on the importance of metadata and data release management in the course.


Towards the end of this rather gruelling week, I’d introduce OpenRefine. After about an hour of demonstrating what’s involved in cleaning up even a modest spreadsheet, I’d watch the will-to-live evaporate from the majority of the people in the room, and much of the momentum and goodwill would vanish. I stopped in preference for simpler validation techniques, but they’re still clumsy.


Introducing any KM program from scratch usually involves importing thousands, sometimes hundreds of thousands, of documents, into the system. Data owners want something that takes, at most, a few minutes per document. Anything that takes a long time, especially for what could be considered “dead data”, is going to lead to huge resistance and potentially kill the project.


OpenRefine is great, taking you from messy data all the way through to the finished product ready to go into a database, but that is way more than is required from KM. We just want people to be able to find data and use it.


My objective is to get the data into some sort of readable format structured around a defined set of fields. At that point it’s fairly easy for any competent analyst to automatically restructure and revalidate the data.


So, yeah, you’re right, my approach is intended as a productivity enhancer, not an end-to-end solution to data use.





Gavin Chait is a data engineer and development economist at Whythawk. | |


From: <> On Behalf Of Stephen Bounds
Sent: 03 November 2019 12:58
Subject: Re: [SIKM] Data-wrangling as a service


Hi Gavin,

Thanks for putting your solution out there. I'd be curious to hear how you compare the features of your product to something like OpenRefine?

I helped a client with something a bit similar a couple of years back. We were aggregating and cleaning metadata from a range of highly diverse article catalogues as part of a semantic auto-classifier engine. This was only a proof of concept so it was much less user-friendly, the wrangling was done through custom JSON configuration files and XPath selectors. With my project, there was always the intent to hand over the ongoing repository management responsibilities to a local staffer, but there just wasn't the expertise or commitment within the organisation to do so.

To be honest, my feeling is that the vast majority of people would lack the data analysis experience to effectively configure any wrangling tool, with a "no coding" interface or otherwise.

The tool does seem like a nice productivity booster for a data analyst/wrangler though, and you could hand off the data upload and validation process to an administrative person once the mappings were in place. To commercialise what you've got, I'd pitch it as more of a consultancy/maintenance engagement with SaaS provisioning of instances, rather than being a pure SaaS solution.

Happy to take the discussion offline if you'd like to talk further.


PS. Curious to hear about your past Australian project since that's where I'm based!

Stephen Bounds
Executive, Information Management
E: stephen.bounds@...
M: 0401 829 096

On 31/10/2019 4:26 am, Gavin Chait wrote:

Hi all, got a question and something to show …


One of the great challenges I face when implementing open data programs is the work involved in preparing data for public release. Often this becomes a blocker when persuading data owners to commit to their own project, and results in limited data release.


Every quarter, ongoing since 2016, I have collated about 300 different datasets from local authorities across the UK as part of my open data project. This is a service mapping business history in every commercial property across the UK. This is not scraping. It is tedious and repetitive data wrangling, converting multiple files, in multiple formats, into a single schema to permit analysis, comparisons and further enrichment.


I have built a collection of tools that offers a simple collaborative drag 'n drop interface to support our data wranglers, and create a json file that permits the schema to be validated according to the standard, collaboration on wrangling workflows, and the methodology to be repeated and tracked. My objective is wrangling simplicity, complete audit transparency, and at speed.


Here’s a brief video overview showing what that looks like in action:


I would like to extract this from my application and create a stand-alone SaaS drag 'n drop data wrangling tool useful for data owners and managers, journalists, and researchers looking to rapidly, and continuously, normalise and validate any messy spreadsheets using a simple series of steps, and without any coding knowledge.


Is this something that might be of interest to the SIKM community? What do you currently use, or recommend for simple, easy-to-use data wrangling? If you are interested in my system, is there anyone who may wish to see about potential collaboration?








Gavin Chait is a data engineer and development economist at Whythawk. | |

Join { to automatically receive all group messages.