Principles of data probity in open research sharing - UK commercial location data #data-science


 

A few months ago, I shared our approach to ensuring data probity in data science during one of the monthly SIKM Leaders Community Monthly Calls. A few days ago, I released a new open research data explorer which represents these principles: https://openlocal.uk. Nine months of development, about 54,000 lines of code.

 

openLocal is a quarterly-updated commercial location database, aggregating open data on vacancies, rental valuations, rates & ratepayers, into an integrated time-series database of individual retail, industrial, office and leisure business units. For the last two years, these data have been reference data used by government as part of their £4.8 billion economic recovery fund, Levelling Up. This redeveloped data explorer was funded by the London Mayor’s Resilience Fund to support London’s post-COVID economic recovery. Except for source data downloads, the service is free.

 

The specific data may not be of interest to you, but our way of organising and ensuring trust and confidence in these data may be.

 

 

Trust in these data for such fundamental use, is critical. These are the ethical principles:

 

  1. Identifiable sources — Our publisher source history, along with links to their source data, are listed in public.
  2. Transparent methods — Our data and software are openly licenced. Anyone who wants to review our methods, source code, or research processes need only ask.
  3. Publication before analysis — People are fantastic pattern-makers, even when no patterns exist. Our role is to curate our source data impartially and without bias, implicit or otherwise. We continually review our data and systems to ensure we do not inadvertently introduce artifacts which could distort analysis.
  4. Point data before aggregation — While our online data explorer presents aggregations, all reports are derived from point data and not from summaries.
  5. Repeatable, auditable trail — The openLocal app, including publisher history and data explorer, exist to ensure a public view on our work, helping others to scrutinise us.

 

Each report has been designed to support our user’s workflows, which usually means needing to copy and paste charts into PowerPoint presentations. You can screenshot an entire report and know that all relevant information will fit in a slide, like this:

 

 

On the top right of each report is a direct link to the historical source data direct from the database, as well as a visual indicator of the data quality informing the report, with links to our sources’ reference data.

 

 

The objective is to ensure that those using the reports don’t get lost in the usual overconfidence of assuming that nicely-presented charts and data are truthy without caveats.

 

Maps are treated as area charts, permitting cluster analysis of data points. Again, this is to get away from the idea that this is somehow a Google Map. It is not for random exploration, but for presenting point data in analysis.

 

 

I realise the actual subject is probably not relevant to you, but that’s what makes your opinion even better. For a subject where you have little context, how much do all these features and design approaches help you navigate and trust the data?

 

Have a look, and please let me know your thoughts.

 

Thanks and regards

 

Gavin

 

 

>--------------------<

Gavin Chait is a data scientist and development economist at Whythawk.

uk.linkedin.com/in/gavinchait | twitter.com/GavinChait | gavinchait.com

 

Join main@SIKM.groups.io to automatically receive all group messages.