docs/index.md

title: "Companies House Data Quality" author: "Rob Eva and Lauren Calow" date: "2022-10-17" site: bookdown::bookdown_site output: bookdown::bs4_book documentclass: book bibliography: [book.bib, packages.bib] url: "https://companieshouse.github.io/DARr/" cover-image: "../images/star2.png" description: | This is an introduction to Companies House data quality address matching. The HTML output format for this example is bookdown::bs4_book, set in the _output.yml file. The book is aimed towards data analysts and data engineers in the Data, Analystics and Research department. biblio-style: apalike csl: chicago-fullnote-bibliography.csl

Introduction

Companies House data quality isn't great, and a lot of this is due to legal restrictions mandating that we have to accept data 'as is' without much capacity to correct or amend. That said, a lot of our data quality issues are self-induced; address quality is one of our own-goals. This document is an attempt to measure and quantify the extent of the issue, and to come up with some possible remedies.

What our Data Strategy says

In our Data Strategy we say ^[https://www.gov.uk/government/publications/companies-house-strategy-2020-to-2025/companies-house-strategy-2020-to-2025]:

"A clear strategic CH goal is to create, maintain and publish a companies’ register built upon relevant and accurate information that supports the UK’s global reputation as a trusted place to do business and a leading exponent of greater corporate transparency. Our data needs to inspire trust and confidence so that we can maximise its value and tackle economic crime through analysis and intelligence."

The address problem

In order to build upon relevant and accurate information, we need relevant and accurate reference data. One of our reference data sources is the Postcode Address File (PAF), which we store in a MongoDB database. For various reasons this database isn't updated regularly (as of 14 Oct 2022 it hasn't been updated for over a year), which creates problems. According to the Royal Mail website, there are almost 20,000 delivery points added to the PAF every month, as seen in Figure \@ref(fig:apistats) below.

Royal Mail September 2022 PAF statistics

(\#fig:apistats)Royal Mail September 2022 PAF statistics

Royal Mail Public Sector Licenses

According to the Royal Mail, Companies House has held a Public Sector License since 2014 (see Figure \@ref(fig:publicsector) below). More information about Public Sector Licenses can be found on the Royal Mail website ^[https://www.poweredbypaf.com/licence-our-products/licence-agreement-for-the-public-sector/about-public-sector-licences/]:

Royal Mail Public Sector License Search

(\#fig:publicsector)Royal Mail Public Sector License Search

How we assess and score addresses

Our first attempt at scoring address quality was done by Lauren Calow in 2022. The following two chapters are a record of those endeavours, and provide not only an excellent piece of analysis, but also set the standard for documenting data quality work within Companies House.



companieshouse/DARr documentation built on Oct. 22, 2022, 8:26 p.m.