As part of Companies House Data Strategy, Data Governance and Data Quality is being reviewed. From this, one of the issues identified with Companies House data, is that around 1% of companies do not supply an accurate postcode, or do not provide a postcode at all.
The current legislation is that Companies have to provide a physical address, however it is not a legal requirement to provide a postcode.
Companies House also states the below regarding the information available for public use: “We carry out basic checks on documents received to make sure that they have been fully completed and signed, but we do not have the statutory power or capability to verify the accuracy of the information that companies send to us.”
This means some data can be incorrectly input and isn’t verified, leading to a lower data quality, particularly when some fields on the registering form are not mandatory. In order to gain further insight on the postcode inaccuracies in Companies House data, a scoring matrix for the quality of address and postcode data given by a company. Companies address details were compared to the Postcode Address File (PAF) and scores given when the address details met a certain criteria.
Table: (#tab:unnamed-chunk-1)Postcode Scoring Matrix | Version 1
| Score|Definition | |-----:|:----------------------------------------------------------| | 0|No postcode given | | 1|Postcode given but doesn’t exist in PAF | | 2|Partial postcode given | | 3|Postcode matches PAF but given in the incorrect field | | 4|Full correct postcode given, address doesn't match PAF | | 5|Full correct postcode given, address partially matches PAF | | 6|Address and postcode details match PAF |
The data from Companies House was from the Companies House Free Company Data Product containing basic details of live companies on the register.
This is saved on the Companies House website. The file used for this data set was named ‘BasicCompanyDataAsOneFile-2022-03-01.zip (415Mb)’ and is publicly available at:
http://download.companieshouse.gov.uk/en_output.html
The PAF is a database containing all known postcodes in the UK, containing the Royal Mail postal addressed. This was downloaded from MongoDB.
The data used in this analysis containing the Glasgow addresses is saved in the below Companies House sharepoint page (access may be restricted to DAR).
The Software/programs used for analysis were Microsoft Excel and Microsoft Power BI.
This data was the current position of Companies House and the PAF on 11/03/2022 therefore any subsequent updates in the Companies House data or Postcode Address File will not be reflected in this analysis.
The filters used within the data were:
Not all address data was complete so the filters were applied to fit one or more of the above criteria.
Import the data files onto Power BI (desktop).
Open the PAF and highlight the ‘postcode.stripped’ column. Replace the spaces with nothing to remove any whitespace.
Open the CH data file and highlight the ‘RegAddress.PostCode’ column. Replace the spaces with nothing to remove any whitespace.
Open the Companies House Data in the Query editor
Add a conditional column and insert the below conditions to populate the condition with the phrase “GLASGOW”, else “Other”.
Filter any remaining rows out (in this data, there were a few locations on ‘Glasgow Road’ with a different postcode that were not filtered out using the above conditions and some postcodes beginning with be such as B80 which are not in the Glasgow PAF).
See below section for partial matching methods tested. The method for Score 5 is yet to be completed but Python coding should be attempted next in order to score the data. This can be integrated into Power BI or completed before importing the data to Power BI.
Partial matching methods tested
Test 1 - Fuzzy merges on Power BI
The below snip of the merged table shows an example of an address in the left column that was matched to the addresses using the fuzzy matching in Power BI. Some of the building numbers in those that were matched were incorrect with only one character difference. If the similarity threshold was increased, it would mean the partial match would not pick up on any mis-typed addresses.
(\#fig:unnamed-chunk-2)Merged Table Example
Test 2 – Fuzzy merges on Power BI
A second test on the partial matching addresses was done with the numbers removed from the PAF merged addresses column.
The same method was used, but by merging the columns in the PAF, not including the building number. The similarity threshold and join type were also kept the same (0.7, Left Outer Join). The test was completed on addresses in the Glasgow file RegAddress.AddressLine1 and RegAddress.AddressLine2.
The tables below show the matches from the Glasgow file and PAF with only a few characters difference that were incorrect but this method was showing as a match. As with Test 1, increasing the threshold would reduce the likelihood of matching those with a mistyped address that is still partially correct.
(\#fig:unnamed-chunk-3)Merged Table Example - RegAddress.AddressLine1
(\#fig:unnamed-chunk-4)Merged Table Example- RegAddress.AddressLine2
Test 3 – Using a range of similarity thresholds on Power BI
After the above method had been tried, using ranges as below for matching the addresses was considered:
0.0 – 0.3 – No match 0.4 – 0.6 – Partial match 0.7 – 1 – Full match
Power BI does not allow a range of numbers to be used within the similarity threshold so this method was not used, but documented for awareness.
Test 4 – List.Containsany function on Power BI
Test 5 – Python coding
For integrating Python into Power BI:
Proposed method for using Python for partial matching is outlined in the link below.
thefuzz/README.rst at master · seatgeek/thefuzz · GitHub
The columns to work with are Address line 1 & 2 in the CH data and the merged column in PAF.
Comments & Update - 31/03/2022
(\#fig:unnamed-chunk-5)Fuzzy Matching for Score 5
The Companies House Address Line 1 column was then split by number, building name and street name and repeated on address line 2.
Comments & Update – April 2022
These were scored while scoring postcodes with a ‘4’. Any that did not score 4, were given the score of 5. These were The same conditions in the address line details, but with the PAF Match column filtered to those that were ‘True’ and matched the postcode field in Companies House Data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.