Deep Dive

Elasticsearch
Internals

From Lucene segments to distributed clusters —
understanding the search engine from the ground up.

Part I

Lucene Internals

The engine under the hood

Part II

Limitations

What Lucene can't do alone

Part III

Elasticsearch

The distributed solution

      Press → or Space to navigate
    

Part I

Apache Lucene

A Java library for full-text search.
The foundation everything else is built on.

First released in 1999 by Doug Cutting — named after his wife's middle name.

Core Data Structure

The Inverted Index

Think of it like the index at the back of a textbook — but automatically built for every word in every document.

DOC 1

"Elasticsearch is fast and scalable"

DOC 2

"Lucene is fast and powerful"

DOC 3

"Elasticsearch runs on Lucene"

Forward Index "Which words are in this doc?"

Doc 1

→

elasticsearch fast scalable

Doc 2

→

lucene fast powerful

Doc 3

→

elasticsearch lucene

Search: "lucene"

1

Scan Doc 1: "elasticsearch is fast and scalable" — no match

2

Scan Doc 2: "lucene is fast and powerful" — match!

3

Scan Doc 3: "elasticsearch runs on lucene" — match!

!

Must scan every document in the entire corpus.

O(N) — 100M docs = scan all 100M

Inverted Index "Which docs contain this word?"

elasticsearch

→

Doc 1 Doc 3

fast

→

Doc 1 Doc 2

lucene

→

Doc 2 Doc 3

powerful

→

Doc 2

scalable

→

Doc 1

Search: "lucene"

1

Look up "lucene" in term dictionary — found!

2

Read posting list → Doc 2, Doc 3 — done!

✓

Direct lookup. Zero scanning.

O(k) — k = term length, independent of index size

Text Processing

The Analyzer Pipeline

Before text enters the inverted index, it passes through a 3-stage pipeline that transforms raw text into searchable terms.

INPUT "The Quick Brown Foxes are <b>FAST</b>!"

1 Char Filters

Transform raw characters before tokenizing.

html_strip
<b>FAST</b> → FAST
pattern_replace
remove special chars, normalize

→

2 Tokenizer

Splits text into individual tokens.

          The
          Quick
          Brown
          Foxes
          are
          FAST
        

Only one tokenizer per analyzer

→

3 Token Filters

Transform, remove, or add tokens.

lowercase: Quick → quick, FAST → fast
stop: the are removed
stemmer: foxes → fox

OUTPUT Terms for inverted index → quick brown fox fast

Core Data Structure

Anatomy of the Inverted Index

Three layers working together to make full-text search fast.

1. Term Dictionary

A sorted list of every unique term → disk pointer. Stored as an FST (Finite State Transducer) — a compressed automaton that shares prefixes and suffixes.

          elasticsearch → block 1

          fast → block 1

          lucene → block 2

Lookup: O(length of term) — independent of index size.

2. Posting List

3. Per-Field Indexes

Deep Dive

FST: The Term Dictionary Engine

Why can't we just use a HashMap? Because with 100M unique terms, a HashMap would eat all your RAM.

The Problem

Terms: fast faster fastest fun funny

A HashMap stores each term separately. At 100M unique terms, that's ~6 GB of RAM just for one field's dictionary.

HashMap — 100M terms ~6 GB RAM

6,000 MB — each string stored separately

The Solution: Finite State Transducer (FST)

Like a trie that also shares suffixes. Common prefixes AND suffixes stored once.

• "fast" and "faster" share f→a→s→t nodes
• Accept nodes store a disk pointer to the posting list

FST — 100M terms ~300 MB RAM

300 MB

20× less memory — same O(term length) lookup speed

FST for our 5 terms

graph LR S((" ")) --> F(("f")) F --> A(("a")) F --> U(("u")) A --> AS(("s")) AS --> T(("t")) T --> E(("e")) E --> R(("r")) E --> ES2(("s")) ES2 --> EST(("t")) U --> UN(("n")) UN --> FUN((" ")) UN --> UNN(("n")) UNN --> FUNNY(("y")) T -.-|"fast →"| T_OUT["1, 5, 12"] R -.-|"faster →"| R_OUT["2, 8"] EST -.-|"fastest →"| EST_OUT["3"] FUN -.-|"fun →"| FUN_OUT["1, 4, 7, 11"] FUNNY -.-|"funny →"| FUNNY_OUT["4, 9"] style S fill:#12121A,stroke:#FEC514,color:#FEC514,stroke-width:3px style F fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style A fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style U fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style AS fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style T fill:#12121A,stroke:#00D68F,color:#00D68F,stroke-width:3px style E fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style R fill:#12121A,stroke:#00D68F,color:#00D68F,stroke-width:3px style ES2 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style EST fill:#12121A,stroke:#00D68F,color:#00D68F,stroke-width:3px style UN fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style FUN fill:#12121A,stroke:#00D68F,color:#00D68F,stroke-width:3px style UNN fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style FUNNY fill:#12121A,stroke:#00D68F,color:#00D68F,stroke-width:3px style T_OUT fill:#0a1a1a,stroke:#00BFB3,color:#00BFB3,stroke-width:2px style R_OUT fill:#0a1a1a,stroke:#00BFB3,color:#00BFB3,stroke-width:2px style EST_OUT fill:#0a1a1a,stroke:#00BFB3,color:#00BFB3,stroke-width:2px style FUN_OUT fill:#0a1a1a,stroke:#00BFB3,color:#00BFB3,stroke-width:2px style FUNNY_OUT fill:#0a1a1a,stroke:#00BFB3,color:#00BFB3,stroke-width:2px

start Entry a-z Character green Accept 1, 5, 12 Posting List

Lookup "fast": start→f→a→s→t → docs [1, 5, 12]

Core Data Structure

Anatomy of the Inverted Index

Three layers working together to make full-text search fast.

1. Term Dictionary

FST: term → disk pointer

Lookup: O(term length)

2. Posting List

For each term: the sorted list of document IDs + frequency + positions. Compressed with delta encoding + bit packing.

          lucene:

            doc IDs:   [2, 3] ← sorted!

            freq:      [1, 1] ← per doc

            positions: [[0], [2]] ← for phrase queries

Stores: doc IDs, term frequency, positions, offsets.

3. Per-Field Indexes

Deep Dive

Posting List: Step 1 — Delta Encoding

Store gaps instead of absolute IDs — smaller numbers = fewer bits

Doc ID ≠ _id | Doc ID: Lucene's internal auto-assigned integer per segment (0, 1, 2, ...). Used in posting lists & scoring. | _id: The string you provide ("product-42"). Stored in _source, mapped via a separate lookup.

The Problem

2

First doc ID

needs only 2 bits

vs

55

Last doc ID

needs 6 bits

The same list has IDs ranging from tiny to large.

If we store fixed-width integers, small values waste bits. Can we make all values uniformly small?

The Idea

The list is sorted. So instead of storing each ID, store the gap (difference) from the previous one.

Gaps are much smaller and more uniform → need fewer bits to store.

RAW IDs

          2
          9
          16
          21
          29
          34
          43
          48
          55
          max 55 → 6 bits
        

           
          9-2
          16-9
          21-16
          29-21
          34-29
          43-34
          48-43
          55-48
        
DELTAS

          2
          7
          7
          5
          8
          5
          9
          5
          7
          max 9 → 4 bits
        

Deep Dive

Posting List: Step 2 — Bit Packing

We have small deltas now. But how do we actually store them efficiently on disk?

The Rotten Apple Effect

All values are packed using the same bit-width: the maximum value decides for everyone.

10,000 deltas packed together:

          1
          3
          2
          50,000
          2
          1
          4
          3
          ...×10,000
        

          max = 50,000 → 16 bits
          All 10,000 values → 16 bits each = 20 KB
        
Without outlier: 10,000 × 3 bits = 3.75 KB — one rotten apple wasted 5× more space!

The Solution: Split into Blocks

Group deltas into fixed-size blocks. Each block finds its own max and picks its own bit-width. The rotten apple only poisons its own block.

BLOCK 1 max 5 → 3 bits

BLOCK 2 max 55 → 6 bits

Why 128?

Too big (1024)? One outlier wastes thousands of values.

Too small (16)? Each block needs a 1-byte header — millions of tiny blocks = header overhead > actual data.

128 = Sweet Spot

Rotten apple contained: outlier affects only 128 values, not thousands.

Hardware aligned: fits in CPU cache line. SIMD decodes entire block in one instruction.

Deep Dive

Posting List: Step 3 — Skip List

We packed blocks efficiently. But how do we find a specific doc without scanning everything?

The Problem

We have thousands of bit-packed blocks. To find doc 740, we'd have to decode every block from the start:

Block 0
docs 1–128
decode & scan
Block 1
docs 129–390
decode & scan
Block 2
docs 391–620
decode & scan
Block 3
docs 621–890
decode & scan
Block 4
docs 891–...
...

We decoded 4 blocks just to find doc 740 is in Block 3. With thousands of blocks, this is O(n)!

The Solution: Skip Pointers

Store a small index alongside the blocks: for each block, remember the last doc ID and the byte offset on disk.

max: 128
offset: 0
→
max: 390
offset: 49
→
max: 620
offset: 97
→
max: 890
offset: 145
→
max: ...
offset: ...
↓
↓
↓
↓
↓
Block 0
Block 1
Block 2
Block 3
Block 4

Finding doc 740:

1 Scan skip index
128 < 740, 390 < 740, 620 < 740, 890 ≥ 740 → Block 3!

2 Jump to byte 145
Seek directly on disk. Blocks 0–2 never touched!

3 Decode Block 3
Unpack 128 deltas → found 740!

Deep Dive

Posting List: The Full Pipeline

A popular term can match millions of documents. How do we store all those doc IDs efficiently?

The Raw Data — every value uses 32 bits, even small ones

256 doc IDs × 32 bits = 1 KB per term. With millions of terms this explodes.

STEP 1 Delta Encode — store the gap from the previous ID

9547 needed 14 bits → as a delta it's just 159 (8 bits). Dense regions shrink to 2–7 (3 bits!)

STEP 2 Bit Pack into 128-value Blocks — each block picks its own bit-width

            Skip[0] doc≤482, offset=0
          

            Skip[1] doc≤9547, offset=48B
          

Block 0 — 128 deltas max=7 → 3 bits

            3
            2
            3
            3
            3
            4
            3
            4
            ...120 more...
          

128 × 3 bits = 48 bytes  •  docs 3–482

Block 1 — 128 deltas max=165 → 8 bits

            25
            27
            67
            165
            152
            165
            13
            165
            ...120 more...
          

128 × 8 bits = 128 bytes  •  docs 507–9547

Skip list in action: Searching for doc 918? Skip[0] says max doc is 482 → skip entire Block 0, jump straight to Block 1. No decompression of skipped blocks.

Core Data Structure

Anatomy of the Inverted Index

Three layers working together to make full-text search fast.

1. Term Dictionary

FST: term → disk pointer

Lookup: O(term length)

2. Posting List

Sorted doc IDs + metadata

Compression: delta + bit packing

3. Per-Field Indexes

Each field gets its own inverted index. Completely independent.

          title: laptop → [1, 3, 7]

          body:  laptop → [1, 2, 3, 5, 7]

Why: title:laptop only scans the title index.

Key insight: The inverted index trades write-time work (analyzing text, building data structures) for read-time speed (instant lookups). This is the fundamental trade-off of every search engine.

Core Concept

Segments: Immutable Mini-Indexes

A Lucene index is not a single giant inverted index. It's a collection of smaller, immutable pieces called segments.

graph TB subgraph LUCENE_INDEX["Lucene Index"] CP["Commit Point
segments_N"] SEG0["Seg 0 · 50K docs"] SEG1["Seg 1 · 30K docs"] SEG2["Seg 2 · 12K docs"] SEG3["Seg 3 · 800 docs
uncommitted"] CP --> SEG0 CP --> SEG1 CP --> SEG2 CP -.->|"not committed"| SEG3 end style LUCENE_INDEX fill:#0a0a0f,stroke:#FEC514,color:#E8E8F0 style CP fill:#12121A,stroke:#FEC514,color:#FEC514 style SEG0 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A style SEG1 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A style SEG2 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A style SEG3 fill:#12121A,stroke:#00BFB3,color:#00BFB3

Why Immutable?

Once written, never modified. This enables:

1. No Locking

Multiple threads read simultaneously — zero contention.

2. Cache Friendly

OS page cache never invalidated — data never changes.

3. Compression

Compress once, read forever — no decompress for updates.

4. Easy Cleanup

Delete segment = delete file. No garbage collection.

Deep Dive

Inside a Segment

A segment is a self-contained mini-index on disk. Inside, the same data is stored in 3 different layouts — each optimized for a different access pattern.

Indexes (Inverted, BKD, ...)

Question: "Which docs contain this word?"

Term Dictionary (FST, in RAM)
laptop	→ disk:0x4A20
phone	→ disk:0x7F10

Posting Lists (on disk, sorted by ID)
laptop	2 5 13 89 144
phone	1 3 5 8

Stored Fields

Question: "Give me the original JSON of doc 5."

_source (row-oriented, on disk)
doc 2	{"name":"laptop pro", "price":999}
doc 5	{"name":"laptop air", "price":1299}
doc 13	{"name":"gaming laptop", "price":1899}

Doc Values

Question: "Sort these results by price."

ROW-BASED — entire row is read

doc	name	category	price
0	laptop pro	electronics	999
1	laptop air	electronics	1299
2	gaming lap	gaming	1899

3 fields read, only 1 was needed
Random I/O — each row on a different disk block

vs

COLUMN-BASED (Doc Values)

price.dvd

999
1299
1899

doc 0
doc 1
doc 2

Only price column read, contiguous block
Sequential I/O — single disk read

One column file per field → ideal for sorting & aggregations. Same-type values stored together → better compression.

.liv File — "which docs are still alive?"

Immutable segments can't delete in-place — a .liv bitset (1 bit/doc) tracks live docs. Reclaimed on merge.

            1
            1
            0
            1
            0
            1
          
1=live   0=deleted

Write Path

Writing to Lucene

How documents go from your application into a searchable index.

Application

IndexWriter

Buffer

Segment

Disk

Durability

Where Does the Data Actually Live?

Data passes through three layers, each with different trade-offs between speed and safety.

JVM Heap

In-Memory Buffer

New documents analyzed and indexed entirely in application memory.

✗

Not searchable

✗

Not durable

Lost on process crash

flush

→

OS Page Cache (RAM)

Flushed Segments

Segment files in OS-managed memory. NRT reader can open and search them.

✓

Searchable (NRT)

✗

Not durable

Lost on OS crash / power failure

fsync
(commit)

→

Persistent Storage

Disk

Segment files physically written via fsync. Survives everything.

✓

Searchable

✓

Durable

Crash-safe & permanent

The problem: fsync is slow (~10ms per call). Calling it after every document kills throughput. But without it, data is lost on crash. Elasticsearch solves this with the translog.

Read Path

Searching in Lucene

A search must visit every segment. Per-segment search is the fundamental unit of work.

Query

Analyzer

Seg 0

Seg 1

Seg 2

Relevance

How Lucene Scores Results: BM25

Every matching document gets a relevance score. Lucene uses BM25 — an evolution of TF-IDF with term frequency saturation.

            score(q,d) = Σ IDF(t) × TF(t,d) × norm(d)
          

summed for each query term

IDF Inverse Document Frequency

How rare is the term? Rare terms score much higher.

"the" — 99% of docs

IDF ≈ 0.01

"elasticsearch" — 2% of docs

IDF ≈ 3.9

TF Term Frequency (with saturation)

More occurrences = higher score, but with diminishing returns.

            tf=1 → 0.55
            tf=3 → 0.77
            tf=10 → 0.89
            tf=100 → 0.99
          

NORM Field Length Normalization

Shorter fields rank higher. "laptop" in a 3-word title beats "laptop" in a 500-word body.

            b = 0.75
            controls length sensitivity (0 = ignore length)
          

Example: `"laptop"`

Doc A title: "Laptop Pro Max"

8.2

IDF

3.2

TF

1× (0.55)

NORM

3 words ↑↑

Doc B body: "This store sells ...laptop... (500 words)"

1.4

IDF

3.2

TF

1× (0.55)

NORM

500 words ↓↓

Key Insights

• Score is computed per segment (each segment has its own IDF stats)
• k1 = 1.2 controls TF saturation, b = 0.75 field-length impact
• Use "explain": true in your query to see the full breakdown
• Replaced TF-IDF as default in Lucene 6+ / ES 5+

Immutability Consequences

Delete & Update in Lucene

Since segments are immutable, neither delete nor update can modify existing data in-place.

Delete

Document stays in the segment. A .liv bitset tracks who's alive.

Segment 0 — 5 docs

doc 1

✓ live

doc 2

✗ deleted

doc 3

✓ live

doc 4

✗ deleted

doc 5

✓ live

.liv

filtered at search time

Bytes only reclaimed during segment merge.

Update = Delete + Re-index

No in-place update. Old version marked deleted, new version indexed fresh.

Segment 0 Step 1: mark deleted

doc 42 v1

price: ~~100~~

✗

↓

Segment 3 (new) Step 2: re-index

doc 42 v2

price: 150

✓

Old bytes remain until merge. Frequent updates = many dead copies across segments.

Maintenance

Segment Merge

The essential background process that keeps a Lucene index healthy.

graph LR subgraph BEFORE["Before Merge"] S0["Seg 0
50K docs"] S1["Seg 1
30K docs"] S2["Seg 2
20K docs
8K deleted"] S3["Seg 3
10K docs
2K deleted"] S4["Seg 4
5K docs"] end subgraph DURING["Merge Thread"] M["Select segments
Read live docs
Write new segment
No deleted docs!"] end subgraph AFTER["After Merge"] S0B["Seg 0
50K docs"] BIG["Seg 5 (merged)
55K live docs"] end S1 --> M S2 --> M S3 --> M S4 --> M M --> BIG style BEFORE fill:#1a1a2e,stroke:#FF8B4A,color:#E8E8F0 style DURING fill:#1a1a2e,stroke:#FEC514,color:#E8E8F0 style AFTER fill:#1a1a2e,stroke:#00D68F,color:#E8E8F0 style M fill:#12121A,stroke:#FEC514,color:#FEC514 style BIG fill:#12121A,stroke:#00D68F,color:#00D68F style S0 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A style S0B fill:#12121A,stroke:#FF8B4A,color:#FF8B4A style S1 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A style S2 fill:#12121A,stroke:#F04E98,color:#F04E98 style S3 fill:#12121A,stroke:#F04E98,color:#F04E98 style S4 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A

Maintenance

Why Merge & Merge Policy

Merging is essential for keeping search performance high and reclaiming disk space.

Why Merge?

1

Fewer Segments = Faster Search

Each search visits every segment. 50 segments = 50x the overhead vs 5 segments.

2

Reclaim Deleted Docs

Deleted docs waste disk and slow searches. Merging rewrites only live docs — dead docs are gone forever.

3

Better Compression

Larger segments compress more efficiently. Delta encoding and bit packing work better with more data.

TieredMergePolicy (default)

⚙

Similar-Size Segments

Groups segments of roughly equal size for merging. Avoids merging tiny segments into huge ones.

⚙

Prefers High-Delete Segments

Segments with many deleted docs are prioritized — more space to reclaim per merge.

⚙

Max 5 GB • Background Threads

Merged segment won't exceed 5 GB. Runs on background threads — does not block indexing or search.

Part II

Lucene's Limitations

Lucene is a brilliant library, but it's just a library.
Using it in production exposes several hard problems.

The Hard Problems

What Lucene Can't Do

1. Single-Node Only

Your data must fit on one machine. No built-in distribution.

2. No High Availability

One server down = total outage. No replication, no failover.

3. Commit is Expensive

fsync after every doc? ~100 docs/sec. Unusable.

4. Not Real-Time

Docs invisible until next commit or reader reopen.

5. No HTTP API

Java library only. Need your own server layer for other languages.

6. No Schema Management

You handle field types, backward compatibility, reindexing yourself.

These are exactly the problems Elasticsearch was built to solve. ↓

Part III

Enter Elasticsearch

A distributed system built on top of Lucene that solves
every limitation we just discussed.

Solutions

How ES Solves Each Problem

Single-Node Only

Lucene runs on one machine.

→

Sharding → Horizontal Scaling

Split into shards (each a Lucene index), distribute across nodes. 10TB? 10 nodes with 1TB each.

No High Availability

One server down = total outage.

→

Replication → Fault Tolerance

Replica shards on different nodes. Primary dies? Replica promoted automatically. Zero downtime.

Commit is Expensive

fsync after every doc is too slow.

→

Translog → Cheap Durability

Append-only transaction log, cheap to fsync. Full Lucene commits happen infrequently in the background.

Not Real-Time

Docs invisible until commit.

→

Refresh → Near Real-Time

New segments written to OS page cache every 1s. Searchable via NRT reader, no fsync needed.

Architecture

Cluster, Nodes, Shards

CLUSTER: "my-app"

Node 1 (master+data)

P0 primary

Seg0 Seg1 Seg2

R1 replica

Seg0 Seg1

Node 2 (data)

P1 primary

Seg0 Seg1

R0 replica

Seg0 Seg1 Seg2

Node 3 (data)

P2 primary

Seg0

R2 replica

Seg0

Primary shard Replica shard SegLucene segment

Cluster

One or more nodes working together. Identified by a cluster name. Nodes auto-discover each other.

Node Roles

master cluster state data stores shards
ingest pipelines coordinating routes requests

Shard

Each shard = one Lucene index. ES distributes many Lucene indexes across machines. Each contains its own segments.

Routing

shard = hash(_routing) % num_primary_shards
Default: _routing = _id — Why primary shard count is fixed at creation.

NRT Search

Refresh: Near Real-Time Search

Lucene's trick: new segments can be opened from the filesystem cache without a full commit.

Client

ES

Buffer

Translog

Segment

Disk

Write Path

A Write Request: End to End

Client

Coord Node

Primary

Replica 1

Replica 2

Coord Routes via hash(_id) % shards

Primary Validates, indexes, replicates in parallel

In-Sync Only replicas in the in-sync set must ACK — lagging replicas get removed from the set, not block writes

Read Path

A Search Request: Scatter & Gather (Adaptive Replica Selection)

Client

Coord Node

Shard 0

Shard 1

Shard 2

Query Each shard returns doc IDs + scores only — lightweight

Fetch Only winning shards return _source — avoids unnecessary I/O

ARS Each shard copy ranked by response_time — routes to fastest replica

Smart Routing

Adaptive Replica Selection

Each shard has multiple copies (primary + replicas). Which copy should handle the search? ARS picks the fastest one.

BEFORE Round Robin ES < 6.1

Shard 0 (R)

Node 2

5ms

Shard 0 (P)

Node 1 (busy!)

200ms

Shard 0 (R)

Node 3

8ms

Round robin hits Node 1 every 3rd request → p99 latency spikes to 200ms

AFTER Adaptive Replica Selection ES 6.1+

Shard 0 (R)

Node 2

5ms

SELECTED

Shard 0 (P)

Node 1 (busy!)

200ms

SKIPPED

Shard 0 (R)

Node 3

8ms

STANDBY

ARS avoids the slow node → consistent low latency

How ARS Ranks Each Copy

The coordinating node maintains a score for every shard copy, updated after each response.

1 Response Time

EWMA (exponentially weighted moving average) of past response times. Recent performance weighted more heavily.

2 Queue Size

Number of in-flight requests to that node. High queue = node is overloaded, penalized in ranking.

3 Service Time

EWMA of time the node actually spent processing (excludes network). Separates slow nodes from slow networks.

Formula: rank(s) = queue_size(s) × service_time(s) + response_time(s)
Lowest rank wins. Automatically adapts to GC pauses, hot nodes, disk I/O spikes, and uneven shard sizes.

Fault Tolerance & Summary

When Things Go Wrong

Failure Scenarios

Primary Shard Dies

Master promotes replica to primary → retries op → allocates new replica elsewhere.

Replica Shard Dies

Removed from in-sync set → primary continues → master rebuilds replica on another node.

Stale Primary (Network Partition)

Replicas reject writes (term mismatch) → old primary discovers replacement → ops rerouted.

Read Failure

Coordinator retries on another shard copy → if all fail, returns partial results.

Putting It All Together

Lucene Concept	ES Layer
Inverted Index	Core search data structure in every shard
Segment (immutable)	Written on refresh (every 1s), merged in background
Commit Point	Created during flush, not after every write
-	Translog fills the durability gap between refreshes and commits
Lucene Index	= 1 ES Shard (primary or replica)
-	ES Index = collection of shards distributed across nodes

The Big Picture

Lucene gives us the search. Translog gives us durability. Refresh gives us near real-time.
Shards give us scale. Replicas give us availability. Cluster ties it all together.

|

Thank You

ElasticsearchInternals

Apache Lucene

The Inverted Index

The Analyzer Pipeline

Anatomy of the Inverted Index

1. Term Dictionary

2. Posting List

3. Per-Field Indexes

FST: The Term Dictionary Engine

The Problem

The Solution: Finite State Transducer (FST)

FST for our 5 terms

Anatomy of the Inverted Index

1. Term Dictionary

2. Posting List

3. Per-Field Indexes

Posting List: Step 1 — Delta Encoding

The Problem

The Idea

Posting List: Step 2 — Bit Packing

The Rotten Apple Effect

The Solution: Split into Blocks

Why 128?

128 = Sweet Spot

Posting List: Step 3 — Skip List

The Problem

The Solution: Skip Pointers

Posting List: The Full Pipeline

Anatomy of the Inverted Index

1. Term Dictionary

2. Posting List

3. Per-Field Indexes

Segments: Immutable Mini-Indexes

Why Immutable?

1. No Locking

2. Cache Friendly

3. Compression

4. Easy Cleanup

Inside a Segment

Indexes (Inverted, BKD, ...)

Stored Fields

Doc Values

.liv File — "which docs are still alive?"

Writing to Lucene

Where Does the Data Actually Live?

Searching in Lucene

How Lucene Scores Results: BM25

Example: "laptop"

Delete & Update in Lucene

Delete

Update = Delete + Re-index

Segment Merge

Why Merge & Merge Policy

Why Merge?

Fewer Segments = Faster Search

Reclaim Deleted Docs

Better Compression

TieredMergePolicy (default)

Similar-Size Segments

Prefers High-Delete Segments

Max 5 GB • Background Threads

Lucene's Limitations

What Lucene Can't Do

1. Single-Node Only

2. No High Availability

3. Commit is Expensive

4. Not Real-Time

5. No HTTP API

6. No Schema Management

Enter Elasticsearch

How ES Solves Each Problem

Single-Node Only

Sharding → Horizontal Scaling

No High Availability

Replication → Fault Tolerance

Commit is Expensive

Translog → Cheap Durability

Not Real-Time

Refresh → Near Real-Time

Cluster, Nodes, Shards

Elasticsearch
Internals

Example: `"laptop"`