What duplicate rate should I expect in a UK B2B marketing list?

Typical UK B2B lists hold 5% to 15% duplicates when multiple data sources have been merged. B2C consumer files run higher, at 8% to 20%, because the same household frequently appears under multiple name variants and historical addresses. The rate climbs sharply after any CRM migration or file merge.

What is the best match rule for deduplicating a UK B2B list?

Start with exact email address as a match key, then layer in name plus postcode as a second pass, and name plus company name as a third. Composite rules, combining all three signals, catch the duplicates that single-field matching misses. Standardise casing, remove punctuation, and strip common noise words like Ltd and Limited before comparing.

Should I delete duplicate records or keep them flagged?

Flag and suppress, do not delete. Deleting records removes the audit evidence you need to respond to a GDPR Subject Access Request or a right-to-erasure request. The suppressed non-survivor records prove you held the data, when you received it, and what happened to it. Physical deletion before you no longer have a legal need for that evidence creates a compliance gap.

How does the GDPR right to erasure interact with deduplication?

When a contact exercises their right to erasure under Article 17 UK GDPR, you must erase all copies of their record, including suppressed duplicates, not just the survivor. Your dedup audit log should record which raw records were matched together so you can locate every copy. A suppression flag alone does not satisfy erasure; the data must be deleted from active and archived files once the legal basis for holding it no longer exists.

When should I run deduplication in my data workflow?

Run deduplication before enrichment, not after. Enriching duplicates doubles your API costs and overwrites clean data with conflicting values. Re-run dedup after any file merge, CRM migration, or fresh data purchase. For live CRM databases receiving regular inbound leads, a monthly automated dedup pass is the minimum; weekly is better for high-volume B2C files.

Can I use SQL for deduplication or do I need specialist software?

SQL window functions (ROW_NUMBER with PARTITION BY) handle straightforward exact-match deduplication on millions of records without any additional tooling. Python with pandas or dedupe library handles fuzzy matching and probabilistic linking when exact keys are unreliable. Specialist dedup bureaus add value primarily when your volumes exceed a few million records, when the data is dirty enough that fuzzy matching at scale is impractical in-house, or when you need a defensible third-party audit trail.

Data deduplication best practices

Why duplicates are more expensive than they look

A duplicate contact is not just a wasted send. At the direct-mail rate of around £0.80 per piece, 3,000 duplicates in a 25,000-record B2C file costs £2,400 in print and postage before you have touched the creative. For a telemarketing campaign, dialling the same number twice in a week is a compliance event waiting to happen under the Telephone Preference Service (TPS) rules. For B2B email, a contact who receives the same message from two slightly different list segments will flag your domain as a source of duplicate sends, hurting future deliverability.

Then there is the data enrichment problem. If you run CRM appending on a file that still contains duplicates, you will hit the same contact twice, spend the per-record fee twice, and potentially overwrite a complete record with a less complete one from the duplicate row. In our experience, clients who dedup before enrichment reduce their enrichment spend by 10% to 18% compared with those who do not.

The GDPR dimension is often overlooked. Under Article 5(1)(c) of UK GDPR, personal data must be "adequate, relevant and limited to what is necessary" for the purpose. Holding ten copies of the same contact's mobile number across different tables is hard to justify as minimal. The Information Commissioner's Office (ICO) guidance on data minimisation is explicit: if you hold the same data twice without a documented reason, you are carrying unnecessary risk.

What are the five deduplication best practices?

1. Define your match rules before the job runs

The single biggest mistake in dedup projects is letting the tool decide what counts as a duplicate. You must specify the match rules in advance and document them. The standard approach for UK marketing lists is a cascade of three passes, applied in order.

The first pass uses exact email address. Two records sharing [email protected] are duplicates regardless of how the name fields are spelled. Email is the strongest single key for B2B and B2C alike, though it fails when the same person appears on different files with different addresses.

The second pass applies name plus postcode. Standardise to uppercase, strip punctuation, and reduce the postcode to the outward code (the first half) before comparing, otherwise legitimate siblings at the same address get wrongly collapsed. A Manchester-based B2C file with "Jane Smith, M1 3AB" and "J Smith, M1 3AX" are almost certainly the same person if no other distinctions exist, but "Jane Smith, M1 3AB" and "Jane Davies, M1 3AB" at the same postcode are probably not.

The third pass uses name plus company name for B2B files. This catches contacts who changed personal email addresses but stayed at the same employer. Standardise company suffixes first: strip "Ltd", "Limited", "PLC", "LLP", and their variants to a root name before comparing. "Acme Ltd" and "Acme Limited" are the same company; failing to normalise them means the pass misses obvious duplicates.

A composite pass, requiring two or more keys to agree, catches edge cases the single-key passes miss and reduces false positives. The right combination depends on your data, but name plus email domain plus postcode sector is a reliable fourth pass for large B2C files.

2. Select survivors by completeness and recency, not row position

Once you have identified a cluster of duplicate records, you need to decide which row to keep. Row position (first in, first out) is the laziest and worst-performing rule. A record imported in 2019 will frequently be less accurate than one added in 2024, and the older row may have fewer populated fields.

The correct survivor selection logic scores each record in the cluster on two dimensions. Recency: the record with the most recent last_updated or date_added timestamp scores higher. Completeness: count the number of non-null, non-empty fields and score proportionally. The record with the highest combined score becomes the survivor.

Where two records in a cluster hold different non-null values in the same field (for example, one has a mobile number and the other does not), consider a merge rather than a straight survivor selection: copy the unique values from the non-survivor into the survivor before suppressing the non-survivor row. This is sometimes called "golden record" creation. It takes more engineering than a simple dedup, but on enriched CRM data it is nearly always worth the effort. See our guide on enrichment and deduplication strategy for a full worked approach.

3. Flag, suppress, and log: never delete

This is the rule most often ignored by teams who want a clean database and treat deletion as the obvious outcome. Do not delete duplicate records. Flag them with a status column (for example, dedup_status = 'suppressed'), record the ID of their survivor, and log when the suppression happened and which match rule triggered it.

The audit trail serves three purposes. First, it lets you reverse the dedup if you discover an error. False positives happen: two contacts with the same name at the same postcode who are genuinely different people. Without the audit log, you cannot recover the wrongly suppressed record.

Second, it supports Subject Access Request (SAR) responses. Under Article 15 UK GDPR, a data subject who submits a SAR is entitled to know what data you hold. If you have deleted a duplicate row that contained their historical email address, you cannot confirm or deny holding that data. The suppressed row, still present and flagged, gives you the complete picture.

Third, it supports right-to-erasure requests under Article 17 UK GDPR. When a contact asks to be erased, you must locate and delete every copy of their data, including suppressed duplicates. The audit log tells you how many rows to erase and where they are. Without it, you risk leaving orphaned copies in archived files or backup tables.

GDPR note on suppression versus erasure

Suppression and erasure are not the same thing. A suppressed record is still personal data and remains subject to all UK GDPR obligations. Suppression prevents the record from being used for marketing; it does not constitute deletion. When a right-to-erasure request arrives, suppressed records must be deleted alongside active ones, unless a legitimate retention reason (such as a legal claim) applies.

4. Run deduplication before enrichment, and re-run after file merges

The sequencing matters. Enrichment appends third-party data to your records: telephone numbers, job titles, company firmographics, email addresses. If you enrich first and dedup second, you pay the enrichment fee per record including duplicates. Worse, you may have enriched the non-survivor record with a more current phone number, then suppressed that row, discarding the update.

Run dedup, select survivors, perform the golden-record merge if applicable, then enrich. Your enrichment hit rate will be higher (fewer stale duplicates pulling down the average) and your cost per clean record will drop.

Re-running dedup after any file merge is non-negotiable. A Manchester-based financial services firm buying a vertical-specific B2B file to append to their CRM will almost always introduce duplicates: their existing prospects appear on the bought file under slightly different name formatting or with a previous email address. Running the full cascade of match rules after the merge, before the combined file goes anywhere near a mailing platform, catches those overlaps. The same applies after any CRM migration, system integration, or data import from a third-party platform.

5. Keep the audit trail GDPR-ready

The audit trail requirement has already appeared in best practice 3, but it deserves its own section because the content of the log matters as much as its existence. A minimal GDPR-ready dedup log records, per suppressed record: the original row ID, the survivor row ID, the match rule that triggered the suppression, the date and time of suppression, and the source file or system that contained the record.

That last field, source file, is particularly important for bought-in data. When a contact bought from a third-party data provider is suppressed as a duplicate of an existing CRM contact, the log must show both origins. If that contact later submits a deletion request, you must delete the record from both the CRM and any archived copy of the original bought file. Without the source field in the log, you may satisfy the deletion request for the survivor but leave the original bought record intact in a file store.

Store the audit log in a system that persists beyond the working file. A dedup log that lives only in the same database as the marketing list gets lost if the list is archived or deleted. A separate audit table, or a CSV export stored to a controlled location, is the minimum. For teams handling large consumer files, a dedicated data governance platform is worth the investment.

Deduplication best practices: comparison table

Best practice	What it covers	Common mistake	UK B2B rate	UK B2C rate
Define match rules first	Exact email, name + postcode, name + company, composite	Letting the tool use defaults; skipping normalisation of Ltd/Limited	5%–15% duplicates found	8%–20% duplicates found
Survivor selection by score	Recency + field completeness, with golden-record merge for unique values	Defaulting to first-in or last-in row position	Applies to all sizes	Applies to all sizes
Flag and suppress, not delete	Suppression with audit log; reversible dedup	Hard-deleting non-survivor rows before GDPR obligations are clear	Required under UK GDPR Article 5	Required under UK GDPR Article 5
Dedup before enrichment	Sequencing to avoid double enrichment cost and field overwrites	Enriching the full file including duplicates, then deduping	Saves 10%–18% enrichment cost	Saves 10%–18% enrichment cost
Re-run after merges	Post-merge dedup pass on combined files	Assuming the bought file has no overlap with the existing CRM	Critical after every file merge	Critical after every file merge

What tools should I use for deduplication?

SQL for exact-match dedup at scale

SQL window functions handle exact-match deduplication cleanly on tens of millions of records without any additional software. The standard pattern uses ROW_NUMBER() OVER (PARTITION BY match_key ORDER BY completeness_score DESC, last_updated DESC). Rows where the row number is greater than one are the non-survivors. Set their dedup_status to suppressed, write the survivor ID into a survivor_id column, log the event, and you are done. This works on PostgreSQL, SQL Server, BigQuery, and most modern databases.

The limit of SQL exact-match dedup is that it cannot catch fuzzy variants: "Jon Smith" and "Jonathan Smith" at the same postcode will not match on a string equality check. For smaller files (under 100,000 records), a manual review pass after exact dedup is feasible. For larger files, you need fuzzy matching.

Python for fuzzy matching

The Python dedupe library uses a trained machine-learning model to compare record pairs and assign match probabilities. It handles name variants, address abbreviations, and other common causes of missed exact-match duplicates. The trade-off is training time: you need to label a sample of pairs as matches or non-matches before the model performs well on your specific data. Budget at least half a day for initial training on a new dataset.

For simpler fuzzy work, the rapidfuzz library provides fast string similarity scoring (Levenshtein, Jaro-Winkler, token set ratio) without the training overhead. Combine it with a blocking step, comparing only records that share an exact postcode outward code, for instance, to keep runtime manageable on large files.

When to use a specialist bureau

Specialist data bureaus add value in three situations: your volumes exceed a few million records and in-house fuzzy matching is too slow; the data is dirty enough (inconsistent address formatting, missing postcodes, mixed encoding) that configuring a reliable in-house pipeline is not cost-effective; or you need a third-party audit trail for compliance purposes. For routine dedup on files under 500,000 records, a well-structured SQL or Python pipeline is almost always faster and cheaper than an outsourced job.

For an end-to-end look at combining deduplication with enrichment in a single workflow, read our article on enrichment and deduplication strategy. If your dedup reveals structural gaps in your CRM data (fields consistently missing across large record cohorts), the CRM gap analysis guide covers how to categorise and prioritise those gaps before enrichment.

How does deduplication interact with the GDPR right to erasure?

Article 17 of UK GDPR gives data subjects the right to request deletion of their personal data where the data is no longer necessary for the purpose for which it was collected, where consent has been withdrawn, or where the processing was unlawful. The interaction with deduplication is specific and often mishandled.

When a contact submits an erasure request, the obvious step is to delete their record from the active marketing list. The less obvious step is to delete the suppressed duplicate rows too. Those rows still contain personal data (name, email, address) and are still subject to UK GDPR obligations. A dedup process that created those suppressed rows without a clear audit trail of which rows belong to which individual makes the erasure response much harder to execute cleanly.

The practical answer is to store individual identifiers consistently across the audit log. If your dedup log links every suppressed row back to a person_id or a canonical email address, an erasure request can be resolved with a single parameterised query that locates all rows for that identifier across active, suppressed, and archived tables. Without that link, the erasure team must search manually, which introduces the risk of missing rows and the corresponding risk of an ICO enforcement notice.

One further point: suppression lists used for TPS or MPS compliance are a legitimate retention exception. A record kept solely to prevent future unsolicited contact (a "do not contact" flag) is not personal data held for marketing purposes; it is held to honour an opt-out. The ICO accepts this distinction. But the flag must contain only what is necessary to achieve that purpose: typically a hashed identifier or the minimum data needed to match future inbound records. A full record with job title, mobile number, and company revenue is not a suppression entry; it is a marketing record with a suppression flag, and it is still subject to full UK GDPR obligations.

Data deduplication best practices for marketing lists

Key points