Why does deduplication matter specifically for enrichment?
Most CRM hygiene guides treat deduplication as a general housekeeping task. Enrichment adds a hard financial reason to act: you are charged per record that matches in the bureau's reference file. Submit 10,000 records with 12% duplication and you are paying to enrich roughly 1,200 contacts twice. At typical UK enrichment rates of £40–£80 per 1,000 records, that is £48–£96 wasted before you have even looked at the output.
The cost argument is obvious once stated. The data quality argument is subtler and, in the long run, more damaging. When you enrich duplicate A and duplicate B separately, the bureau may return slightly different results for each (different telephone numbers matched, different job title found, different company revenue band). You now hold two records for the same person with contradictory appended data. Your "mobile phone penetration" figure is overstated because it counts the same individual twice. Your suppression logic may suppress one record but not the other, exposing you to a compliance risk on GDPR's data minimisation principle.
The cleaner sequence is: deduplicate, select one survivor record per individual, enrich that record, then propagate the appended fields to any archived non-survivor records if your CRM requires the history. See our guide to what data enrichment involves for a broader overview of the process before turning to the deduplication mechanics below.
How much duplication should you expect in a UK CRM?
Duplication rates vary by how the CRM was built and how long it has run without active cleansing. The following figures come from bureau-side experience across UK B2B and B2C files submitted for enrichment:
- UK B2B CRMs: 5%–15% duplicate rate. The lower end applies to recently built, single-source databases. The upper end is common in CRMs that were migrated from a legacy system, merged with an acquired company's database, or had manual data entry as the primary input channel.
- UK B2C databases: 8%–20%. Consumer records are created across multiple touchpoints, a purchase, a competition entry, a content download, and there is no reliable unique identifier equivalent to the Companies House number in B2B. A single consumer named Sarah Jones at a Leeds postcode may appear four times under slight name variants (Sara Jones, S Jones, Sarah B Jones) across a multi-channel retailer's CRM.
- Post-merger databases: 20%–35%. When two companies combine their CRMs into one instance, the overlap between the two existing customer bases is often higher than either team expects.
Databases older than three years without a dedicated cleansing programme sit at the upper end of these ranges. Running a gap analysis on your CRM before enrichment is a useful way to quantify the problem rather than guessing.
The four match strategies for UK CRM deduplication
No single matching rule catches all duplicates. In practice, you apply several strategies in sequence, each with a different precision-recall trade-off. The table below summarises the four most commonly used in UK CRM deduplication projects.
| Strategy | How it works | Best for | False positive risk | Typical recall |
|---|---|---|---|---|
| Exact email match | Normalise email to lowercase, strip whitespace, compare character-for-character | B2B and B2C records where email is reliably populated | Very low (shared mailboxes are the main edge case) | 30%–55% of all duplicates |
| Fuzzy name plus postcode | Levenshtein distance <3 on surname, combined with exact Royal Mail full postcode | B2C consumer records; catches spelling variants and data-entry errors | Low to medium (common surnames in dense postcodes require a secondary check) | 20%–40% of remaining duplicates |
| Name plus normalised company | Strip legal suffixes (Ltd, PLC, LLP, Limited, plc) and common abbreviations; compare surname and normalised company name | B2B records; catches the same contact entered once as "Acme Ltd" and once as "Acme" | Medium (large companies with many contacts of the same surname) | 15%–25% of remaining duplicates |
| Composite field score | Weighted score across first name initial, surname, postal sector (first half of postcode), telephone last four digits, and optionally date of birth; threshold typically 85%+ | Records missing email, or where multiple weak signals together create a strong match | Medium to high; requires manual review above a lower confidence threshold | 10%–20% of remaining duplicates |
Run exact email first, as it is the fastest and most reliable pass. Then apply fuzzy name plus postcode. Company-normalised matching comes third, limited to B2B records. Finally, composite scoring sweeps up residual candidates, with anything scoring between 70% and 84% going to a manual review queue rather than auto-merging.
Preparing data before matching
Match quality is only as good as the data going in. Standardise before you score. For postal data this means parsing Royal Mail addresses through a postal cleansing service to PAF (Postcode Address File) standard, which resolves common variants like "St." vs "Street" and "Rd" vs "Road". For company names, build a normalisation function that removes punctuation, converts to uppercase, and strips the legal suffixes listed above. A company called "ACME LOGISTICS LIMITED" and "Acme Logistics Ltd." will not match on a raw string comparison; they will match once normalised.
Survivor record selection: most recent versus most complete?
Once you have identified duplicate pairs, you need to pick which record survives. Two philosophies dominate: "most recent wins" and "most complete wins". Neither is correct in all situations.
Most recent wins is the right default for contact-level fields: telephone number, email address, job title, company name. These fields decay. A telephone number recorded in 2019 is less likely to be current than one recorded in 2024. The most recently updated record is therefore the better source of truth for reaching the individual.
Most complete wins is better for demographic or firmographic fields that do not change often: date of birth, gender, SIC 2007 sector code, property ownership flag, household income band. If one duplicate has a populated date-of-birth field and the other does not, the populated record should survive (or contribute that field to a merged survivor), regardless of which record was updated more recently.
In practice, the best outcome is a field-level merge: create the survivor record by taking the most recent version of each contact field and the most complete version of each demographic field. Most CRM platforms support this logic through their duplicate merge tools, or you can implement it in SQL with a CASE WHEN structure that scores and ranks field values before selecting the winner per field.
Merge versus flag-only: which approach is right for your CRM?
Choosing between a merge strategy and a flag-only strategy depends on your CRM's data architecture and the confidence you have in your match logic.
Merge physically consolidates duplicates. The survivor record absorbs the best field values and the non-survivors are either deleted or archived as inactive. This is the cleanest outcome for enrichment: you submit one record per individual and receive one enriched record back. The risk is irreversibility. If your match logic produces false positives and you merge two genuinely different people into one record, recovering the original state requires a backup restore.
Flag-only leaves all records in place but adds a deduplication group ID (e.g. dedup_group_id = 7841) and a survivor flag (e.g. is_survivor = TRUE/FALSE) to each record. Downstream queries filter to is_survivor = TRUE for analysis and for building the enrichment submission file. The non-survivors remain in the database with their history intact, which matters if other systems hold foreign-key references to those record IDs, for example a sales activity log that points to a contact ID you cannot safely delete.
In our experience, flag-only is the safer starting point for most UK businesses enriching a CRM for the first time. It provides a full audit trail and makes it simple to roll back if a match is later found to be incorrect. Once the match logic has been validated over one or two enrichment cycles, a controlled merge of high-confidence groups (those identified by exact email match) is a sensible second step.
GDPR note on merged records
Under UK GDPR, if the two duplicate records have different consent or opt-out histories, the merged survivor must inherit the most restrictive preference. If record A is opted in to email and record B has an email opt-out, the survivor must carry the opt-out. Merging without preserving the strictest preference is a data protection failure. Build this logic into your merge procedure before running it.
Tools for CRM deduplication: SQL, Python, and specialist bureaus
SQL for exact matching
Exact-match deduplication requires nothing more than a SQL window function. A ROW_NUMBER() OVER (PARTITION BY LOWER(email) ORDER BY updated_at DESC) query ranks records within each email group by recency, and any record with row_num > 1 is a duplicate candidate. This handles the 30%–55% of duplicates that share a common email address without any external dependency.
Python for fuzzy matching
The rapidfuzz library (a faster implementation of the classic fuzzywuzzy approach) calculates Levenshtein distances at scale without needing to compare every record against every other record, which becomes computationally prohibitive above about 50,000 records. The dedupe library goes further: it uses active learning to train a probabilistic model on a small set of hand-labelled pairs, then predicts match probabilities across the full dataset. Both are well-documented and open-source.
For a CRM of under 100,000 records, a Python script combining rapidfuzz for name similarity and pandas for postcode blocking (running comparisons only within the same postcode sector) is adequate. Above that scale, blocking on multiple fields and parallel processing become necessary to keep runtimes under an hour.
Specialist data bureaus
Where the CRM lacks a reliable anchor field such as email or telephone, a specialist bureau adds genuine value. Bureaus match your records against reference files that include Companies House data, Royal Mail's PAF, and electoral roll-derived consumer reference data, producing a probabilistic match score and a suggested survivor record without you needing to build the matching infrastructure. The trade-off is cost: bureau deduplication for a 500,000-record B2C file typically runs from £800 to £2,500 depending on complexity, though that cost is usually recovered quickly in reduced enrichment charges.
Post-enrichment deduplication: a necessary second pass
Deduplicating before enrichment removes the obvious candidates. Enrichment itself, however, appends new fields that can expose duplicate pairs that were invisible beforehand. Two records for the same consumer with different name spellings and no email address will not match on a fuzzy-name-plus-postcode pass if one record has a full postcode and the other has only the outward code. Once enrichment appends the full postcode to both, a second dedup pass will catch the pair.
Post-enrichment deduplication typically finds 1%–3% of additional duplicates. Run it as a lighter pass: exact email on the newly appended email addresses, then fuzzy name plus now-complete postcode. The composite scoring step is usually not worth repeating unless you have significant concerns about data quality in the appended fields.
The total sequence is therefore: pre-enrichment dedup (full four-strategy pass), enrich the survivor file, post-enrichment dedup (lighter two-strategy pass), then propagate the final clean fields across your CRM.
