Re: Data-wrangling as a service #metadata #data-science
toggle quoted message Show quoted text
I’ll pick this up offline as well, but you raise the key challenge of semantic classification for any form of knowledge / data management. Below is my “teacher’s hat” response ;)
OpenRefine (and the proprietary equivalent, Trifecta) are “opinionated”. They have their own programming language, and they take you from messy data to final product. There can be a huge investment in time and expertise to bring in a particular dataset which, depending on your needs, can be crucial and critical. Or completely over-the-top.
From a data curation perspective, they go too far, since what we want is easy-to-find, easy-to-use data, but we don’t define the use-case. A bit like the functionality in Excel. Most is unnecessary.
I used to, until about two years ago, regularly teach OpenRefine as part of my data manager courses. These are a week-long, full-time course teaching data owners how to prepare and release data into a knowledge management platform (syllabus here: https://docs.google.com/document/d/1g_aJrN91xHXGFi6wUaeQNQSRf5AndBqRexIgmF8mjUk/edit).
Text-based documents can – at a pinch – be analysed by a natural-language processor and some sort of sense extracted to permit search. Spreadsheets cannot, and so it is critical to layer on the importance of metadata and data release management in the course.
Towards the end of this rather gruelling week, I’d introduce OpenRefine. After about an hour of demonstrating what’s involved in cleaning up even a modest spreadsheet, I’d watch the will-to-live evaporate from the majority of the people in the room, and much of the momentum and goodwill would vanish. I stopped in preference for simpler validation techniques, but they’re still clumsy.
Introducing any KM program from scratch usually involves importing thousands, sometimes hundreds of thousands, of documents, into the system. Data owners want something that takes, at most, a few minutes per document. Anything that takes a long time, especially for what could be considered “dead data”, is going to lead to huge resistance and potentially kill the project.
OpenRefine is great, taking you from messy data all the way through to the finished product ready to go into a database, but that is way more than is required from KM. We just want people to be able to find data and use it.
My objective is to get the data into some sort of readable format structured around a defined set of fields. At that point it’s fairly easy for any competent analyst to automatically restructure and revalidate the data.
So, yeah, you’re right, my approach is intended as a productivity enhancer, not an end-to-end solution to data use.
Gavin Chait is a data engineer and development economist at Whythawk.
uk.linkedin.com/in/gavinchait | twitter.com/GavinChait | gavinchait.com
From: SIKM@groups.io <SIKM@groups.io> On Behalf Of Stephen Bounds
Sent: 03 November 2019 12:58
Subject: Re: [SIKM] Data-wrangling as a service
Thanks for putting your solution out there. I'd be curious to hear how you compare the features of your product to something like OpenRefine?
I helped a client with something a bit similar a couple of years back. We were aggregating and cleaning metadata from a range of highly diverse article catalogues as part of a semantic auto-classifier engine. This was only a proof of concept so it was much less user-friendly, the wrangling was done through custom JSON configuration files and XPath selectors. With my project, there was always the intent to hand over the ongoing repository management responsibilities to a local staffer, but there just wasn't the expertise or commitment within the organisation to do so.
To be honest, my feeling is that the vast majority of people would lack the data analysis experience to effectively configure any wrangling tool, with a "no coding" interface or otherwise.
The tool does seem like a nice productivity booster for a data analyst/wrangler though, and you could hand off the data upload and validation process to an administrative person once the mappings were in place. To commercialise what you've got, I'd pitch it as more of a consultancy/maintenance engagement with SaaS provisioning of instances, rather than being a pure SaaS solution.
Happy to take the discussion offline if you'd like to talk further.
PS. Curious to hear about your past Australian project since that's where I'm based!
Executive, Information Management
M: 0401 829 096
On 31/10/2019 4:26 am, Gavin Chait wrote: