Building Knowledge Graphs from Documents: Extraction Pipelines
Building Knowledge Graphs from Documents: Extraction Pipelines
Series: Graph RAG | Part: 4 of 10
You have a thousand documents. Product specs. API documentation. Internal wikis. Support tickets. Research papers. A decade of accumulated knowledge locked in unstructured text.
You need a knowledge graph. Which means you need to transform narrative text into structured triples: entities, relationships, and attributes that machines can traverse.
This is the extraction problem. And it's the make-or-break challenge for Graph RAG in production.
Manual graph construction—hiring experts to read every document and hand-code triples—doesn't scale. You need automated extraction pipelines that can process your corpus and produce a usable graph without human annotation for every sentence.
The good news: modern NLP and LLMs have made this tractable. The bad news: extraction pipelines are complex, error-prone, and require careful engineering to produce graphs worth querying.
Here's how the pieces fit together.
The Pipeline Architecture
A complete extraction pipeline has five stages:
- Entity Recognition: Identify which spans of text refer to entities
- Entity Linking: Disambiguate entities and link to canonical identifiers
- Relation Extraction: Identify relationships between entities
- Attribute Extraction: Extract properties and metadata
- Graph Construction: Assemble triples into a queryable graph database
Each stage has multiple implementation strategies, from traditional NLP to modern LLM-based approaches. The architecture you choose depends on your domain, data volume, and quality requirements.
Stage 1: Entity Recognition
Entity recognition asks: which phrases in this text refer to things that should become nodes?
Traditional approach: Named Entity Recognition (NER)
Classical NER models—trained on labeled corpora—identify entities by type:
Input: "Marie Curie won the Nobel Prize in Physics in 1903."
Output:
- "Marie Curie" → PERSON
- "Nobel Prize in Physics" → AWARD
- "1903" → DATE
State-of-the-art NER models like spaCy's transformer-based pipelines or Flair achieve 90%+ F1 scores on standard benchmarks for common entity types (people, organizations, locations, dates).
But domain-specific entities are harder. "Kubernetes pod" and "Docker container" won't be recognized by a model trained on news articles. You need domain adaptation—fine-tuning on examples from your specific corpus.
LLM-based approach: Prompted extraction
Recent practice uses LLMs with few-shot prompts:
Extract all entities from the following text. Identify their type.
Text: "The authentication service uses Redis for session storage and
connects to the PostgreSQL user database via port 5432."
Entities:
- "authentication service" (SERVICE)
- "Redis" (DATABASE)
- "session storage" (COMPONENT)
- "PostgreSQL user database" (DATABASE)
- "port 5432" (CONFIGURATION)
LLMs handle domain-specific entities better than pre-trained NER because they've seen diverse text during pre-training. They can generalize from the prompt to identify entities in your domain even without fine-tuning.
The tradeoff: cost and latency. Running an LLM over every sentence in a large corpus is expensive. Many pipelines use NER for common entities and LLMs for domain-specific or ambiguous cases.
The span problem
A subtle challenge: where do entity boundaries start and end?
Is it "Nobel Prize" or "Nobel Prize in Physics"? Both are valid entities, but they're different nodes. "Nobel Prize in Physics" is a specific award; "Nobel Prize" is a category.
Aggressive extraction creates fine-grained entities. Conservative extraction creates broader ones. The right choice depends on your query patterns. If users ask "Who won the Nobel Prize?" you want the broader category. If they ask "Who won the Nobel Prize in Physics in 1903?" you need the specific instance.
Many pipelines extract both and create a hierarchy: "Nobel Prize in Physics" is a subtype of "Nobel Prize."
Stage 2: Entity Linking
You've identified "Apple" as an entity. But which Apple?
- Apple Inc. (the company)
- Apple (the fruit)
- Apple Records (the label founded by the Beatles)
- Apple Bank (a New York financial institution)
Entity linking resolves this ambiguity by mapping textual mentions to canonical identifiers.
Approaches:
1. Lexical matching to knowledge bases
If you're linking to an existing KB like Wikidata, you can match entity mentions to Wikidata labels and aliases. "Apple Inc." matches Q312 (the Wikidata ID for Apple Inc.). Disambiguation uses context: if the surrounding text mentions "iPhone" and "Tim Cook," it's probably the company.
2. Embedding-based linking
Encode the context around the entity mention and compare to embeddings of candidate entities. The candidate with the highest semantic similarity wins.
This works well when candidates have rich descriptions. It fails when entities are new or poorly documented—there's nothing to match against.
3. LLM-based disambiguation
Provide context to an LLM and ask it to choose:
Text: "Apple announced record profits in Q4 2023."
Which entity does "Apple" refer to?
A) Apple Inc. (technology company)
B) Apple (fruit)
C) Apple Records (music label)
Answer: A
LLMs excel at using context to disambiguate. They understand that "announced record profits" implies a corporate entity, not a fruit.
Creating new entities
Sometimes the entity doesn't exist in your KB. "The new authentication service we launched last month" isn't in Wikidata. The pipeline needs to decide: create a new entity or link to an existing one.
Conservative pipelines create new entities liberally, leading to duplicates. Aggressive pipelines merge aggressively, leading to incorrect conflations. The balance is domain-specific and often requires iterative refinement.
Stage 3: Relation Extraction
You've identified entities. Now: how do they relate?
Pattern-based extraction
The oldest approach uses hand-written patterns:
Pattern: <PERSON> won <AWARD> in <YEAR>
Match: "Marie Curie won the Nobel Prize in 1903"
Triple: (Marie Curie, won, Nobel Prize in Physics, year=1903)
This is precise but brittle. It only captures relationships matching your patterns. Paraphrase the sentence—"The Nobel Prize was awarded to Marie Curie in 1903"—and the pattern might miss it.
Supervised relation extraction
Train a classifier on labeled examples:
Input: sentence + two marked entities
Output: relationship type or NONE
Training examples:
- "Marie Curie [E1] won [R] the Nobel Prize [E2]" → WON
- "Einstein [E1] developed [R] relativity [E2]" → DEVELOPED
- "The cat [E1] sat on [R] the mat [E2]" → NONE
This generalizes better than patterns but requires labeled training data—expensive to create at scale.
Open Information Extraction (OpenIE)
OpenIE extracts relationships without predefined schemas:
Input: "Marie Curie discovered radium in 1898."
Output: (Marie Curie, discovered, radium, in 1898)
The relationship label comes directly from the text ("discovered"), not from a fixed taxonomy. This produces messy graphs with hundreds of unique edge types, but it captures information that schema-based extraction would miss.
LLM-based extraction
The current frontier uses LLMs:
Extract all relationships from this sentence:
"The authentication service uses Redis for session storage and
connects to the PostgreSQL user database."
Relationships:
- (authentication service, USES, Redis)
- (authentication service, CONNECTS_TO, PostgreSQL user database)
- (Redis, STORES, session storage)
LLMs handle paraphrase, complex syntax, and implicit relationships. They can infer that "for session storage" implies a STORES relationship even though it's not stated as a verb.
The challenge: hallucination. LLMs sometimes extract relationships that aren't in the text, especially when prompted to find many relationships. Validation and filtering are essential.
Stage 4: Attribute Extraction
Beyond relationships, entities have properties: dates, quantities, descriptions, categories.
Entity: Marie Curie
Attributes:
- birth_date: 1867-11-07
- nationality: Polish, French
- field: Physics, Chemistry
- known_for: radioactivity, polonium, radium
Attribute extraction identifies these key-value pairs from surrounding text.
Approaches:
- Slot filling: Template-based extraction for common attributes
- Dependency parsing: Extract attributes based on syntactic relationships
- LLM prompts: Ask the model to extract properties for a given entity
The challenge is consistency. Different documents might describe the same property in different ways: "born in 1867" vs "birth year: 1867" vs "1867-11-07." Normalization is crucial—converting varied expressions into canonical forms.
Stage 5: Graph Construction
You have entities, relationships, and attributes. Now assemble them into a graph database.
Storage options:
1. Triple stores (RDF databases)
Systems like Apache Jena, Stardog, or Amazon Neptune store triples in RDF format:
@prefix ex: <http://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
ex:MarieCurie a ex:Person ;
ex:birthDate "1867-11-07"^^xsd:date ;
ex:won ex:NobelPrizePhysics1903 .
RDF stores support SPARQL queries and ontology reasoning. They're standards-compliant and interoperable but can be slower than property graphs for some query patterns.
2. Property graph databases
Systems like Neo4j, Amazon Neptune (also supports property graphs), or TigerGraph store graphs as nodes and labeled edges with properties:
CREATE (marie:Person {
name: "Marie Curie",
birthDate: date("1867-11-07"),
nationality: ["Polish", "French"]
})
CREATE (nobel:Award {name: "Nobel Prize in Physics", year: 1903})
CREATE (marie)-[:WON]->(nobel)
Property graphs are often faster for traversal-heavy queries and have more intuitive query languages (Cypher), making them popular for operational systems.
Schema enforcement
During construction, enforce schema constraints:
- Validate that relationships connect allowed entity types
- Require mandatory properties
- Detect and merge duplicate entities
- Maintain referential integrity
Without enforcement, extraction errors compound. A single misidentified relationship can create nonsensical graph structures that break downstream queries.
Quality and Iteration
Extraction pipelines are never perfect on the first pass. Expect:
- Precision issues: Extracted triples that don't exist in the source text
- Recall issues: Missed entities and relationships
- Linking errors: Entities incorrectly merged or split
- Schema drift: Relationship types proliferating beyond your ontology
Production pipelines include:
1. Human-in-the-loop validation
Sample extracted triples and have domain experts review them. Use feedback to refine prompts, adjust models, or add correction rules.
2. Consistency checking
Run automated checks:
- Do all PERSON entities have birthdates?
- Are there WON relationships pointing to non-AWARD entities?
- Are dates in valid ranges?
Violations indicate extraction errors.
3. Iterative refinement
Extraction isn't one-and-done. As your corpus grows, re-run extraction on new documents. As your ontology evolves, re-extract to capture new relationship types.
Treat your graph as living infrastructure, not a static artifact.
The Tradeoff Space
Building extraction pipelines requires navigating tradeoffs:
- Precision vs Recall: Extract conservatively and miss relationships, or aggressively and include noise
- Schema-based vs Schema-free: Constrain to a predefined ontology or allow open extraction
- Cost vs Quality: Use expensive LLMs for high-quality extraction or cheaper models for scale
- Batch vs Incremental: Process the entire corpus at once or extract incrementally as documents are added
The right choices depend on your use case. A research graph prioritizes recall—capture everything, filter later. A production system prioritizes precision—wrong information breaks user trust.
The Path to Production
Automatic extraction enables Graph RAG at scale. Instead of manually crafting triples, you build a pipeline that transforms your existing documentation into a queryable knowledge graph.
Once that graph exists, the next question is: what can you do with it that vector search can't?
That's where multi-hop reasoning enters—the ability to answer questions by traversing relationship chains, not just finding similar text. And that's what makes Graph RAG transformative.
Further Reading
- Stanovsky, G. et al. (2018). "Supervised Open Information Extraction." NAACL 2018.
- Hoffart, J. et al. (2011). "Robust Disambiguation of Named Entities in Text." EMNLP 2011.
- Bosselut, A. et al. (2019). "COMET: Commonsense Transformers for Automatic Knowledge Graph Construction." ACL 2019.
This is Part 4 of the Graph RAG series, exploring how knowledge graphs solve the limitations of naive vector retrieval.
Previous: Knowledge Graphs 101: Nodes, Edges, and Semantic Structure
Next: Multi-Hop Reasoning: How Graphs Enable Complex Queries
Comments ()