Building Knowledge Graphs from Documents: Extraction Pipelines

A pile of documents is information. A knowledge graph is structured knowledge — entities named, relationships labeled, facts connected. This is how you build one: pipelines that identify entities and relations, resolve ambiguities, and assemble a graph machines can actually reason over.

Building knowledge graphs from documents: extraction pipelines.

Building Knowledge Graphs from Documents: Extraction Pipelines

Series: Graph RAG | Part: 4 of 10

You have a thousand documents. Product specs. API documentation. Internal wikis. Support tickets. Research papers. A decade of accumulated knowledge locked in unstructured text.

You need a knowledge graph. Which means you need to transform narrative text into structured triples: entities, relationships, and attributes that machines can traverse.

This is the extraction problem. And it's the make-or-break challenge for Graph RAG in production.

Manual graph construction—hiring experts to read every document and hand-code triples—doesn't scale. You need automated extraction pipelines that can process your corpus and produce a usable graph without human annotation for every sentence.

The good news: modern NLP and LLMs have made this tractable. The bad news: extraction pipelines are complex, error-prone, and require careful engineering to produce graphs worth querying.

Here's how the pieces fit together.

The Pipeline Architecture

A complete extraction pipeline has five stages:

Entity Recognition: Identify which spans of text refer to entities
Entity Linking: Disambiguate entities and link to canonical identifiers
Relation Extraction: Identify relationships between entities
Attribute Extraction: Extract properties and metadata
Graph Construction: Assemble triples into a queryable graph database

Each stage has multiple implementation strategies, from traditional NLP to modern LLM-based approaches. The architecture you choose depends on your domain, data volume, and quality requirements.

Stage 1: Entity Recognition

Entity recognition asks: which phrases in this text refer to things that should become nodes?

Traditional approach: Named Entity Recognition (NER)

Classical NER models—trained on labeled corpora—identify entities by type:

Input: "Marie Curie won the Nobel Prize in Physics in 1903."

Output:
- "Marie Curie" → PERSON
- "Nobel Prize in Physics" → AWARD
- "1903" → DATE

State-of-the-art NER models like spaCy's transformer-based pipelines or Flair achieve 90%+ F1 scores on standard benchmarks for common entity types (people, organizations, locations, dates).

But domain-specific entities are harder. "Kubernetes pod" and "Docker container" won't be recognized by a model trained on news articles. You need domain adaptation—fine-tuning on examples from your specific corpus.

LLM-based approach: Prompted extraction

Recent practice uses LLMs with few-shot prompts:

Extract all entities from the following text. Identify their type.

Text: "The authentication service uses Redis for session storage and
connects to the PostgreSQL user database via port 5432."

Entities:
- "authentication service" (SERVICE)
- "Redis" (DATABASE)
- "session storage" (COMPONENT)
- "PostgreSQL user database" (DATABASE)
- "port 5432" (CONFIGURATION)

LLMs handle domain-specific entities better than pre-trained NER because they've seen diverse text during pre-training. They can generalize from the prompt to identify entities in your domain even without fine-tuning.

The tradeoff: cost and latency. Running an LLM over every sentence in a large corpus is expensive. Many pipelines use NER for common entities and LLMs for domain-specific or ambiguous cases.

The span problem

A subtle challenge: where do entity boundaries start and end?

Is it "Nobel Prize" or "Nobel Prize in Physics"? Both are valid entities, but they're different nodes. "Nobel Prize in Physics" is a specific award; "Nobel Prize" is a category.

Aggressive extraction creates fine-grained entities. Conservative extraction creates broader ones. The right choice depends on your query patterns. If users ask "Who won the Nobel Prize?" you want the broader category. If they ask "Who won the Nobel Prize in Physics in 1903?" you need the specific instance.

Many pipelines extract both and create a hierarchy: "Nobel Prize in Physics" is a subtype of "Nobel Prize."

Stage 2: Entity Linking

You've identified "Apple" as an entity. But which Apple?

Apple Inc. (the company)
Apple (the fruit)
Apple Records (the label founded by the Beatles)
Apple Bank (a New York financial institution)

Entity linking resolves this ambiguity by mapping textual mentions to canonical identifiers.

Approaches:

1. Lexical matching to knowledge bases

If you're linking to an existing KB like Wikidata, you can match entity mentions to Wikidata labels and aliases. "Apple Inc." matches Q312 (the Wikidata ID for Apple Inc.). Disambiguation uses context: if the surrounding text mentions "iPhone" and "Tim Cook," it's probably the company.

2. Embedding-based linking

Encode the context around the entity mention and compare to embeddings of candidate entities. The candidate with the highest semantic similarity wins.

This works well when candidates have rich descriptions. It fails when entities are new or poorly documented—there's nothing to match against.

3. LLM-based disambiguation

Provide context to an LLM and ask it to choose:

Text: "Apple announced record profits in Q4 2023."

Which entity does "Apple" refer to?
A) Apple Inc. (technology company)
B) Apple (fruit)
C) Apple Records (music label)

Answer: A

LLMs excel at using context to disambiguate. They understand that "announced record profits" implies a corporate entity, not a fruit.

Creating new entities

Sometimes the entity doesn't exist in your KB. "The new authentication service we launched last month" isn't in Wikidata. The pipeline needs to decide: create a new entity or link to an existing one.

Conservative pipelines create new entities liberally, leading to duplicates. Aggressive pipelines merge aggressively, leading to incorrect conflations. The balance is domain-specific and often requires iterative refinement.

Stage 3: Relation Extraction

You've identified entities. Now: how do they relate?

Pattern-based extraction

The oldest approach uses hand-written patterns:

Pattern: <PERSON> won <AWARD> in <YEAR>
Match: "Marie Curie won the Nobel Prize in 1903"
Triple: (Marie Curie, won, Nobel Prize in Physics, year=1903)

This is precise but brittle. It only captures relationships matching your patterns. Paraphrase the sentence—"The Nobel Prize was awarded to Marie Curie in 1903"—and the pattern might miss it.

Supervised relation extraction

Train a classifier on labeled examples:

Input: sentence + two marked entities
Output: relationship type or NONE

Training examples:
- "Marie Curie [E1] won [R] the Nobel Prize [E2]" → WON
- "Einstein [E1] developed [R] relativity [E2]" → DEVELOPED
- "The cat [E1] sat on [R] the mat [E2]" → NONE

This generalizes better than patterns but requires labeled training data—expensive to create at scale.

Open Information Extraction (OpenIE)

OpenIE extracts relationships without predefined schemas:

Input: "Marie Curie discovered radium in 1898."
Output: (Marie Curie, discovered, radium, in 1898)

The relationship label comes directly from the text ("discovered"), not from a fixed taxonomy. This produces messy graphs with hundreds of unique edge types, but it captures information that schema-based extraction would miss.

LLM-based extraction

The current frontier uses LLMs:

Extract all relationships from this sentence:

"The authentication service uses Redis for session storage and
connects to the PostgreSQL user database."

Relationships:
- (authentication service, USES, Redis)
- (authentication service, CONNECTS_TO, PostgreSQL user database)
- (Redis, STORES, session storage)

LLMs handle paraphrase, complex syntax, and implicit relationships. They can infer that "for session storage" implies a STORES relationship even though it's not stated as a verb.

The challenge: hallucination. LLMs sometimes extract relationships that aren't in the text, especially when prompted to find many relationships. Validation and filtering are essential.

Stage 4: Attribute Extraction

Beyond relationships, entities have properties: dates, quantities, descriptions, categories.

Entity: Marie Curie
Attributes:
- birth_date: 1867-11-07
- nationality: Polish, French
- field: Physics, Chemistry
- known_for: radioactivity, polonium, radium

Attribute extraction identifies these key-value pairs from surrounding text.

Approaches:

Slot filling: Template-based extraction for common attributes
Dependency parsing: Extract attributes based on syntactic relationships
LLM prompts: Ask the model to extract properties for a given entity

The challenge is consistency. Different documents might describe the same property in different ways: "born in 1867" vs "birth year: 1867" vs "1867-11-07." Normalization is crucial—converting varied expressions into canonical forms.

Stage 5: Graph Construction

You have entities, relationships, and attributes. Now assemble them into a graph database.

Storage options:

1. Triple stores (RDF databases)

Systems like Apache Jena, Stardog, or Amazon Neptune store triples in RDF format:

@prefix ex: <http://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ex:MarieCurie a ex:Person ;
    ex:birthDate "1867-11-07"^^xsd:date ;
    ex:won ex:NobelPrizePhysics1903 .

RDF stores support SPARQL queries and ontology reasoning. They're standards-compliant and interoperable but can be slower than property graphs for some query patterns.

2. Property graph databases

Systems like Neo4j, Amazon Neptune (also supports property graphs), or TigerGraph store graphs as nodes and labeled edges with properties:

CREATE (marie:Person {
  name: "Marie Curie",
  birthDate: date("1867-11-07"),
  nationality: ["Polish", "French"]
})
CREATE (nobel:Award {name: "Nobel Prize in Physics", year: 1903})
CREATE (marie)-[:WON]->(nobel)

Property graphs are often faster for traversal-heavy queries and have more intuitive query languages (Cypher), making them popular for operational systems.

Schema enforcement

During construction, enforce schema constraints:

Validate that relationships connect allowed entity types
Require mandatory properties
Detect and merge duplicate entities
Maintain referential integrity

Without enforcement, extraction errors compound. A single misidentified relationship can create nonsensical graph structures that break downstream queries.

Quality and Iteration

Extraction pipelines are never perfect on the first pass. Expect:

Precision issues: Extracted triples that don't exist in the source text
Recall issues: Missed entities and relationships
Linking errors: Entities incorrectly merged or split
Schema drift: Relationship types proliferating beyond your ontology

Production pipelines include:

1. Human-in-the-loop validation

Sample extracted triples and have domain experts review them. Use feedback to refine prompts, adjust models, or add correction rules.

2. Consistency checking

Run automated checks:

Do all PERSON entities have birthdates?
Are there WON relationships pointing to non-AWARD entities?
Are dates in valid ranges?

Violations indicate extraction errors.

3. Iterative refinement

Extraction isn't one-and-done. As your corpus grows, re-run extraction on new documents. As your ontology evolves, re-extract to capture new relationship types.

Treat your graph as living infrastructure, not a static artifact.

The Tradeoff Space

Building extraction pipelines requires navigating tradeoffs:

Precision vs Recall: Extract conservatively and miss relationships, or aggressively and include noise
Schema-based vs Schema-free: Constrain to a predefined ontology or allow open extraction
Cost vs Quality: Use expensive LLMs for high-quality extraction or cheaper models for scale
Batch vs Incremental: Process the entire corpus at once or extract incrementally as documents are added

The right choices depend on your use case. A research graph prioritizes recall—capture everything, filter later. A production system prioritizes precision—wrong information breaks user trust.

The Path to Production

Automatic extraction enables Graph RAG at scale. Instead of manually crafting triples, you build a pipeline that transforms your existing documentation into a queryable knowledge graph.

Once that graph exists, the next question is: what can you do with it that vector search can't?

That's where multi-hop reasoning enters—the ability to answer questions by traversing relationship chains, not just finding similar text. And that's what makes Graph RAG transformative.

Building Knowledge Graphs from Documents: Extraction Pipelines

Building Knowledge Graphs from Documents: Extraction Pipelines

The Pipeline Architecture

Stage 1: Entity Recognition

Stage 2: Entity Linking

Stage 3: Relation Extraction

Stage 4: Attribute Extraction

Stage 5: Graph Construction

Quality and Iteration

The Tradeoff Space

The Path to Production

Further Reading

Comments ()

Building Knowledge Graphs from Documents: Extraction Pipelines

The Pipeline Architecture

Stage 1: Entity Recognition

Stage 2: Entity Linking

Stage 3: Relation Extraction

Stage 4: Attribute Extraction

Stage 5: Graph Construction

Quality and Iteration

The Tradeoff Space

The Path to Production

Further Reading

Comments ( )

Comments ()