What Is NER and Why Does It Matter for PII?

Named entity recognition (NER) is the task of identifying and classifying spans of text that refer to specific real-world things: people, organisations, locations, dates, monetary values. A sentence like “Sarah Johnson joined Pfizer’s London office in March 2024” contains four entities – a person, an organisation, a location, and a date. NER models label each span with its type.

This is useful in many contexts. It becomes urgent when the text contains information that should not leave your organisation, be sent to a third-party API, or appear in a model’s training data. At that point NER is not a convenience feature – it is the first line of a compliance pipeline.

What counts as PII

Personally identifiable information is any data that can identify a specific individual, directly or in combination with other data. The standard NER entity types – person names, organisations, locations – cover some of it. They do not cover all of it. The categories that matter most for compliance extend considerably further:

Direct identifiers: name, national insurance number, passport number, date of birth, NHS number, driving licence number.
Contact data: email address, phone number, home address, IP address.
Financial identifiers: bank account number, sort code, credit card number.
Health data: diagnoses, prescriptions, clinical notes, any data that can be linked to a medical record.
Identifiers in combination: job title plus employer plus approximate location can uniquely identify a person even when no name appears.

Standard NER models trained on news corpora handle person names, organisations, and locations reasonably well. They handle the compliance-critical categories inconsistently, because those categories rarely appear in news text and are therefore underrepresented in training data.

The three operations

NER supports three distinct operations on sensitive text, each with different purposes and different error consequences.

Detection establishes whether PII is present in a document. The output is a flag or a confidence score. Useful for routing decisions – should this document go to a human reviewer before processing? – and for audit logging.

Extraction pulls the identified spans with their types and positions. The output is a structured record: entity text, entity type, character offsets. This feeds downstream systems: audit trails, anonymisation pipelines, data catalogues.

Redaction replaces identified spans in the source text with placeholders or type labels. “Sarah Johnson joined Pfizer’s London office” becomes “[PERSON] joined [ORG]’s [LOC] office”. The redacted text can then be passed to an LLM or other downstream process without the original identifying information.

These three operations are often conflated but require separate evaluation. A model that detects PII with high recall may produce so many false positives in extraction that the redacted text is unusable. A model that redacts accurately on news text may miss domain-specific identifiers entirely.

Failure modes

False negatives are the dangerous failure. An entity the model misses passes through unredacted. In a pipeline sending text to an external API or storing outputs in a retrievable system, a missed name or account number represents a real exposure. False negative rates vary significantly by entity type: person names from common English-language news corpora perform well; names from other scripts, unusual spellings, or domain-specific terminology perform worse.

False positives destroy meaning. Aggressive redaction of a clinical document may remove every date reference, making the document clinically useless. A legal document with all organisation names redacted may lose the context that makes it interpretable. The precision-recall tradeoff in NER has downstream consequences that a simple F1 score does not capture.

Domain shift is the most common real-world failure. NER models trained on news text encounter problems when applied to legal documents, clinical records, financial reports, or internal communications. Entity types that are rare in training data are handled poorly. An NHS trust deploying a standard NER model for clinical notes will find that it misses a significant proportion of patient identifiers that do not follow patterns seen in newswire text.

Extraction error propagates. Text entering a memory store or a structured data pipeline rarely enters raw – it passes through NER and other extraction steps first. An extraction error writes incorrect data into the downstream system. A fact extractor that misidentifies a drug name as a person name, or a summariser that drops a negation, produces structured records that are wrong. Those wrong records are then retrieved and acted upon as if they were correct. The pipeline has no mechanism to discover the error unless there is an evaluation layer explicitly checking extraction quality. (The evaluation approaches for extraction pipelines are covered in How to Evaluate an AI Pipeline.)

When NER and PII removal are not enough

NER addresses the surface form of sensitive information. It does not address the information content of the text itself. This distinction matters more than it is usually given credit for.

Consider a pharmaceutical company processing internal documents about a drug in late-stage clinical trials. Removing all person names, dates, and locations leaves a document that still contains the drug mechanism, the trial design, the patient population characteristics, the preliminary efficacy signals, and the supplier relationships that make the project possible. None of that information is a named entity. All of it is sensitive.

The same problem applies in defence, in legal proceedings, in competitive intelligence contexts, and in any domain where the relationships between entities – who is working with whom, on what, at what stage – carry as much risk as the entities themselves. A contract from which all names have been redacted may still reveal the commercial structure of a deal. A research document with all author names removed may still identify the institution through its methodology and references.

NER and PII removal are not a solution to data sensitivity. They are a solution to a specific and relatively narrow problem: preventing directly identifying information from appearing in plaintext in contexts where it should not. For organisations in sectors where the information content of documents is itself sensitive – independent of who authored them or who they mention – the appropriate architecture is to keep processing inside a controlled boundary entirely, rather than relying on extraction to sanitise documents before they cross one.

Models available

The practical landscape runs from lightweight rule-augmented models to large general-purpose classifiers.

spaCy (en_core_web_sm through en_core_web_trf) is the standard production NER tool for English text. The small CNN-based model (sm) runs efficiently on CPU but achieves F1 around 0.83 on newswire benchmarks – adequate for low-stakes detection, not sufficient for compliance-critical work. The transformer-based pipeline (trf) achieves substantially better accuracy at higher memory cost. spaCy covers the standard four entity types (person, organisation, location, miscellaneous) and is straightforward to extend or fine-tune on domain-specific data.

dslim/bert-base-NER is a BERT model fine-tuned on CoNLL-2003, widely used as a reference point and as a component in larger pipelines. It handles the four standard entity types and performs well on news-domain text. Its limitations are the same as spaCy’s transformer pipeline – CoNLL-2003 is a newswire corpus, and performance degrades on out-of-domain text. It does not cover compliance-specific entity types without fine-tuning.

Microsoft Presidio sits above the model layer – it is a framework that combines NER models (including spaCy and BERT-based models) with rule-based recognisers and regular expressions to cover the identifier types that NER models miss: NHS numbers, credit card patterns, email addresses, IBANs. For production PII pipelines in regulated sectors, Presidio is the practical starting point because it handles the compliance-critical categories that neural NER models underperform on.

openai/privacy-filter (Apache 2.0, Hugging Face) is a bidirectional token classifier derived from the GPT-OSS architecture, post-trained specifically for PII detection and masking. At 1.5B parameters with 50M active parameters it runs efficiently – in a browser or on a laptop – while offering a 128,000-token context window that handles long documents without chunking. It covers eight PII label categories, supports precision-recall tuning at runtime, and is designed explicitly for on-premises deployment in high- throughput sanitisation workflows. For organisations that need a single model handling long documents with configurable sensitivity, it is currently the strongest available open-weight option.

For high-volume pipelines where throughput matters and entity types are well-defined, fine-tuned BERT-class models remain competitive with much larger models. For out-of-domain text or novel entity types, recent benchmarks show that LLM-based NER with few-shot prompting outperforms fine-tuned classifiers – at significantly higher cost per document and lower throughput. The choice follows from volume, domain specificity, and the cost of a missed entity.

Composing NER into pipelines

NER and PII removal work in most production contexts as a preprocessing step: text enters, entities are identified, the redacted form proceeds to the next stage. The challenge is that this step needs to run consistently across every task in a pipeline – not just at the entry point, but at each stage where text is passed between components, written to memory, or returned as output.

Marigold allows NER and PII removal to be composed as pipeline stages in a workflow definition. A pipeline that ingests documents, redacts identified PII, runs an extraction or summarisation task, and writes the result to a retrievable store can declare the redaction step as a typed stage with its own inputs, outputs, and error handling. The same redaction configuration applies consistently across every document the pipeline processes.

For organisations where the concern is not just direct identifiers but the information content of documents – the pharmaceutical or defence cases described above – Marigold’s private inference architecture provides the complementary layer: the processing happens inside a controlled boundary rather than relying on sanitisation to make documents safe to export.

(Get in touch to discuss pipeline design for your specific data environment.)