Glasgow Address Matching

Introduction and Purpose

As part of Companies House Data Strategy, Data Governance and Data Quality is being reviewed. From this, one of the issues identified with Companies House data, is that around 1% of companies do not supply an accurate postcode, or do not provide a postcode at all.

The current legislation is that Companies have to provide a physical address, however it is not a legal requirement to provide a postcode.

Companies House also states the below regarding the information available for public use: “We carry out basic checks on documents received to make sure that they have been fully completed and signed, but we do not have the statutory power or capability to verify the accuracy of the information that companies send to us.”

This means some data can be incorrectly input and isn’t verified, leading to a lower data quality, particularly when some fields on the registering form are not mandatory. In order to gain further insight on the postcode inaccuracies in Companies House data, a scoring matrix for the quality of address and postcode data given by a company. Companies address details were compared to the Postcode Address File (PAF) and scores given when the address details met a certain criteria.

Scores used

Score <- c(0, 1, 2, 3, 4, 5, 6)
Definition <- c("No postcode given", 
         "Postcode given but doesn’t exist in PAF", 
         "Partial postcode given", 
         "Postcode matches PAF but given in the incorrect field", 
         "Full correct postcode given, address doesn't match PAF", 
         "Full correct postcode given, address partially matches PAF", 
         "Address and postcode details match PAF")

df <- data.frame(Score, Definition)
knitr::kable(
  df, booktabs = TRUE,
  caption = 'Postcode Scoring Matrix | Version 1'
)

Data

Companies House Data

The data from Companies House was from the Companies House Free Company Data Product containing basic details of live companies on the register.

This is saved on the Companies House website. The file used for this data set was named ‘BasicCompanyDataAsOneFile-2022-03-01.zip (415Mb)’ and is publicly available at:

http://download.companieshouse.gov.uk/en_output.html

Postcode Address File

The PAF is a database containing all known postcodes in the UK, containing the Royal Mail postal addressed. This was downloaded from MongoDB.

The data used in this analysis containing the Glasgow addresses is saved in the below Companies House sharepoint page (access may be restricted to DAR).

https://companieshouse.sharepoint.com/sites/team-aci/Shared%20Documents/Forms/AllItems.aspx?id=%2Fsites%2Fteam%2Daci%2FShared%20Documents%2FData%20Quality%2FPostcode%20Files%2FGLASGOW&p=true

Software

The Software/programs used for analysis were Microsoft Excel and Microsoft Power BI.

Caveats and Limitations

This data was the current position of Companies House and the PAF on 11/03/2022 therefore any subsequent updates in the Companies House data or Postcode Address File will not be reflected in this analysis.

The filters used within the data were:

Not all address data was complete so the filters were applied to fit one or more of the above criteria.

Method

Import the data files onto Power BI (desktop).

Cleaning the data

  1. Open the PAF and highlight the ‘postcode.stripped’ column. Replace the spaces with nothing to remove any whitespace.

  2. Open the CH data file and highlight the ‘RegAddress.PostCode’ column. Replace the spaces with nothing to remove any whitespace.

Separating the Glasgow data

  1. Open the Companies House Data in the Query editor

  2. Add a conditional column and insert the below conditions to populate the condition with the phrase “GLASGOW”, else “Other”.

    • ‘RegAddress.PostCode’ beings with ‘G1’,’G2’,’G3’,’G4’,’G5’,’G6’,’G7’ OR
    • RegAddress.AddressLine1, RegAddress. AddressLine2, RegAddress.PostTown, RegAddress.County contains ‘Glasgow’.
  3. Filter any remaining rows out (in this data, there were a few locations on ‘Glasgow Road’ with a different postcode that were not filtered out using the above conditions and some postcodes beginning with be such as B80 which are not in the Glasgow PAF).

  4. Use a filter to only select the values in this column that say ‘Glasgow’.
  5. Close and apply the query.

Scoring the data

Score 0 - No Postcode Given

  1. Add a new column on the CH File. This can be done in the query editor, or ‘Data’ view. For this report, the column was added in the ‘Data’ view.
  2. Input the formula No Postcode = if('BasicCompanyDataAsOneFile-2022-03-01'[RegAddress.PostCode]="",0,1)
  3. into the new column. It will return a 0 for the postcodes in the postcode field that are blank, and a 1 for the rows in the postcode field which contain a value.
  4. Go to the reports tab and into the ‘Score 0’ sheet.
  5. Click onto a visualisation to insert. A pie chart was used for the Glasgow file as there were only two distinct values. The legend and details are the column that contains the “0” or “1”. Values are the count of this field. A table summarizing these values was also added to the sheet.

Score 1 - Postcode given but doesn’t exist in PAF

  1. Add a new column on the CH File. This can be done in the query editor, or ‘Data’ view. For this report, the column was added in the ‘Data’ view.
  2. Input the formula PAF Match = CONTAINS('G-2022-03','G-2022-03'[postcode.stripped],'BasicCompanyDataAsOneFile-2022-03-01'[RegAddress.PostCode]) into the new column. It will return a True for the postcodes in the postcode field that match the PAF and a False for the postcodes that do not match the PAF. The column was labelled ‘PAF Match’.
  3. Insert another new column and input the formula Length = LEN('BasicCompanyDataAsOneFile-2022-03-01'[RegAddress.PostCode])
  4. This will count the number of characters in the postcode supplied in the CH file. The column was labelled ‘Postcode Length’.
  5. Go to the reports tab and into the ‘Score 1’ sheet.
  6. Click onto a visualisation to insert. A pie chart was used for the Glasgow file as there were only two distinct values. The legend and details are the column that contains the “True” or “False”. Values are the count of this field.
  7. Go to the filter pane and add a filter for the column ‘Postcode Length’. Tick ‘5’ and ‘6’. This will only bring back the score for the full postcodes given. The blank postcodes have already been scored and the incomplete postcodes are scored separately in the next tab.

Score 2 - Partial postcode given

  1. Go to the reports tab and into the ‘Score 2’ sheet.
  2. Click onto a visualisation to insert. A pie chart was used for the Glasgow file. The legend and details field is the Postcode Length Column and Values are the Count of the postcode length field.
  3. Go to the filter pane and add a filter for the column ‘Postcode Length’. Untick ‘0’, ‘5’ and ‘6’. This will bring back the score for the postcodes that were incomplete and were not blank. Ful and blank postcodes have been scored elsewhere.

Score 3 - Postcode matches PAF but given in the incorrect field

  1. Go to the Query Editor (click transform data) and load the Companies House data file.
  2. Insert a conditional column.
  3. Insert a condition that RegAddress.AddressLine1 contains ‘G1’,’G2’,’G3’,’G4’,’G5’,’G6’,’G7’
  4. This formula will bring back the 0 if there isn’t a postcode in the address line 1 field, or ‘Postcode in address line 1’ if the postcode is in this column.
  5. Repeat steps 2-4 for RegAddress.AddressLine2 and RegAddress.PostTown
  6. Filter the ‘PAF Match’ column to ’False’ – the ones which show ‘True’ are scored otherwise.
  7. Check the remaining rows for any data that doesn’t match the conditions.
  8. Close and apply the query.
  9. Go to the reports tab and into the ‘Score 3’ sheet.
  10. Insert a visualisation – a clustered bar chart was used for this data.
  11. The Axis values were the PAF Match column and values were a count of the column with the postcodes contained in address line 1.
  12. Go to the filter pane and add a filter for the column containing the count of the postcodes in line 1, and untick the ‘0’ option. The visual will now only be looking at postcodes contained in address line 1.
  13. Repeat steps 11&12 for the postcodes in address line 2 and post town.
  14. The bar charts now show how many postcodes were in the address lines.
  15. A second filter can be added on the ‘PAF Match’ column with only ‘False’ ticked to show the postcode values in the postcode column that did not match the PAF. For this report, both ‘True’ and ‘False´ values were kept in the visual.
  16. Only the values where a full postcode in an address field and ‘False’ in the PAF match field were counted.
    In this data set, there were no full, correct postcodes given in the address fields. NOTE If a full postcode matching the PAF file was entered into the address line 1 and 2 or post town, these would need filtering out of the Score 1 postcodes, otherwise some postcodes would be scored both 1 and 4. In this data set, none of the postcodes given in the address fields matched the PAF, so this step was left out.

Score 4 - Full correct postcode given, address doesn't match PAF

  1. Go to the Query Editor and open the PAF data
  2. Highlight sub_building_name, building_number, thoroughfare.name and thoroughfare.descriptor – NA on Glasgow data columns in that order and merge. The delimiter is a space.
  3. Once the column has been created, right click the column, click Transform then Trim. This removes spaces from the beginning and end of the text.
  4. Close & Apply the Query.
  5. Open data view and open the Companies House data file
  6. Add a new column and type in the formula Address Match = CONTAINS('B-2022-03','B-2022-03'[Merged],'BasicCompanyDataAsOneFile-2022-03-01'[Address Line 1])
  7. The companies where address line 1 matches the PAF will come back as ‘True’ and the ones that do not match will be ‘False’
  8. Go to the ‘Score3&5’ report tab and click onto a visualisation to insert.
  9. The Details is the column that contains the ‘True’ or ‘False’ that was just created and the values are a count of this column.
  10. Go to the Filter column and add a Filter for the ‘PAF Match’ column and untick ‘False’.
  11. The values in this report that are ‘True’ score 5, and those that are ‘False’ score 3.

Score 5 - Full correct postcode given, address partially matches PAF

See below section for partial matching methods tested. The method for Score 5 is yet to be completed but Python coding should be attempted next in order to score the data. This can be integrated into Power BI or completed before importing the data to Power BI.

Partial matching methods tested

Test 1 - Fuzzy merges on Power BI

  1. Go to Transform data and open the PAF
  2. Highlight the columns ‘sub_building_name’, ‘building_number’, ‘thoroughfare.name’, ‘thoroughfare.descriptor’. Right click and merge the columns.
  3. On the Home tab, click Merge Queries > Merge Queries as New.
  4. Select the PAF as the first able and highlight the Merged Addresses column.
  5. For the second table, select the Glasgow file and highlight the RegAddress.AddressLine1
  6. The join type is Left Outer
  7. The threshold was tested at 0.8, then at 0.7

The below snip of the merged table shows an example of an address in the left column that was matched to the addresses using the fuzzy matching in Power BI. Some of the building numbers in those that were matched were incorrect with only one character difference. If the similarity threshold was increased, it would mean the partial match would not pick up on any mis-typed addresses.

knitr::include_graphics("./images/snip.png")

Test 2 – Fuzzy merges on Power BI

A second test on the partial matching addresses was done with the numbers removed from the PAF merged addresses column.

The same method was used, but by merging the columns in the PAF, not including the building number. The similarity threshold and join type were also kept the same (0.7, Left Outer Join). The test was completed on addresses in the Glasgow file RegAddress.AddressLine1 and RegAddress.AddressLine2.

The tables below show the matches from the Glasgow file and PAF with only a few characters difference that were incorrect but this method was showing as a match. As with Test 1, increasing the threshold would reduce the likelihood of matching those with a mistyped address that is still partially correct.

knitr::include_graphics("./images/snip2.png")
knitr::include_graphics("./images/snip3.png")

Test 3 – Using a range of similarity thresholds on Power BI

After the above method had been tried, using ranges as below for matching the addresses was considered:

0.0 – 0.3 – No match 0.4 – 0.6 – Partial match 0.7 – 1 – Full match

Power BI does not allow a range of numbers to be used within the similarity threshold so this method was not used, but documented for awareness.

Test 4 – List.Containsany function on Power BI

  1. Create new query
  2. Type = #shared in the formula bar to show a list of all available queries
  3. find list.containsany
  4. For the first table use CH Glasgow address line 1 and for the second, select the merged addresses column from the PAF.
  5. This will bring back any values that are matching.
  6. For this, there were no matching addresses between the files.

Test 5 – Python coding

For integrating Python into Power BI:

  1. Open transformation table
  2. Click run Python script
  3. Enter script into jupyter notebook
  4. Copy and paste in box in power BI. This should run the script and the output can be used in BI.

Proposed method for using Python for partial matching is outlined in the link below.

thefuzz/README.rst at master · seatgeek/thefuzz · GitHub

The columns to work with are Address line 1 & 2 in the CH data and the merged column in PAF.

Comments & Update - 31/03/2022

  1. The ‘fuzzy matching’ was attempted for Score 5, however, addresses with one digit difference in the building number were brought back as a matched address from the PAF File.
knitr::include_graphics("./images/snip4.png")

The Companies House Address Line 1 column was then split by number, building name and street name and repeated on address line 2.

Comments & Update – April 2022

  1. Fuzzy matching on Power BI doesn’t work for what we need it to. It was decided we would use Python integrated into Power BI to clean the data and Python coding may work for the partial matching (score 5), and Power BI should be used for the dashboards/visualisations.

Score 6 - Address and postcode details match PAF

These were scored while scoring postcodes with a ‘4’. Any that did not score 4, were given the score of 5. These were The same conditions in the address line details, but with the PAF Match column filtered to those that were ‘True’ and matched the postcode field in Companies House Data.

Power BI Report Link

Power BI Dashboard Link

Recommendations for next steps – June 2022

  1. Define what a ‘good address’ would look like in Companies House data. One of the issues with current data is that some address details are not in the appropriate field, such as the town being input into ‘Address Line 2’ column instead of the Post Town column. Examples of an acceptable address should be documented to compare with those that have been submitted by companies.
  2. If the postcode field in CH data is blank, consider augmenting data by adding a new column to lookup the company address to bring back a postcode that exists in the PAF. Currently CH cannot correct any submitted data such as postcodes but a reference column for a matched address could be added, as outlined in Corporate Transparency and Register Reform White Paper (publishing.service.gov.uk), “If an individual fails to verify, the public register will be annotated to show this. This will enable anyone viewing the register to make their own assessment of the integrity and risk profile of those they are researching”.
  3. Conduct further analysis on the correlation between a Company not supplying a postcode, or not supplying a correct address (one that doesn’t match the PAF) and filing accounts and confirmation statements on time. Preliminary analysis has been completed on this and in the future, the quality of addresses submitted could be used as an indicator for companies that are fraudulent, or would not file accounts and confirmation statements on time. The analysis already completed can be found in the following BI Dashboard: https://app.powerbi.com/groups/me/dashboards/2921b718-7552-40f2-b759-6ae77120bad5?ctid=e6acfe5e-7f2a-455a-9d85-ecf93371f601&pbi_source=linkShare
  4. Alter the method used to clean Companies House data, such as removing unnecessary columns and whitespace in the postcode fields. Instead of using Power BI, Python could be a better option. This will also save time when using other data files that will need to be cleaned in the same way because the Python code can be copied and reused for each file.
  5. Use Python Fuzzy String matching for the partial matching scores. This can be done through Power BI or before import and the code used for other CH data files for Score 5.
  6. Continue to find alternative software for analysis, such as AWS, so files don’t need to be saved and imported into software, or split because they are too large to analyse as a full set with our current software. Currently, the PAF is split into sections (Birmingham and Glasgow in this example), so the Companies House data requires filtering to find addresses in those cities, however, not all postcodes with the filters applied are in the correct cities, as it is impossible to capture them all, particularly when some data is missing or in the wrong format. Addresses are also input into several different fields, or customers do not state the city, but smaller towns within the city as the main post town. This makes it unfeasible to filter the CH accurately. This is a temporary issue until we can use software to analyse the full PAF against Companies House data.
  7. Consider collaborative work with engineers or data scientists within CH who may already have completed similar analysis


companieshouse/DARr documentation built on Oct. 22, 2022, 8:26 p.m.