Intro
or Space
Deep Dive

Elasticsearch
Internals

From Lucene segments to distributed clusters —
understanding the search engine from the ground up.

Part I
Lucene Internals
The engine under the hood
Part II
Limitations
What Lucene can't do alone
Part III
Elasticsearch
The distributed solution
Press or Space to navigate
Part I

Apache Lucene

A Java library for full-text search.
The foundation everything else is built on.

First released in 1999 by Doug Cutting — named after his wife's middle name.

Core Data Structure

The Inverted Index

Think of it like the index at the back of a textbook — but automatically built for every word in every document.

DOC 1
"Elasticsearch is fast and scalable"
DOC 2
"Lucene is fast and powerful"
DOC 3
"Elasticsearch runs on Lucene"
Forward Index "Which words are in this doc?"
Doc 1
elasticsearch fast scalable
Doc 2
lucene fast powerful
Doc 3
elasticsearch lucene
Search: "lucene"
1

Scan Doc 1: "elasticsearch is fast and scalable" — no match

2

Scan Doc 2: "lucene is fast and powerful" — match!

3

Scan Doc 3: "elasticsearch runs on lucene" — match!

!

Must scan every document in the entire corpus.

O(N) — 100M docs = scan all 100M
Inverted Index "Which docs contain this word?"
elasticsearch
Doc 1 Doc 3
fast
Doc 1 Doc 2
lucene
Doc 2 Doc 3
powerful
Doc 2
scalable
Doc 1
Search: "lucene"
1

Look up "lucene" in term dictionary — found!

2

Read posting list → Doc 2, Doc 3done!

Direct lookup. Zero scanning.

O(k) — k = term length, independent of index size
Text Processing

The Analyzer Pipeline

Before text enters the inverted index, it passes through a 3-stage pipeline that transforms raw text into searchable terms.

INPUT "The Quick Brown Foxes are <b>FAST</b>!"
1 Char Filters

Transform raw characters before tokenizing.

html_strip
<b>FAST</b> FAST
pattern_replace
remove special chars, normalize
2 Tokenizer

Splits text into individual tokens.

The Quick Brown Foxes are FAST
Only one tokenizer per analyzer
3 Token Filters

Transform, remove, or add tokens.

lowercase: Quick → quick, FAST → fast
stop: the are removed
stemmer: foxes → fox
OUTPUT Terms for inverted index → quick brown fox fast

Standard (default)

Unicode tokenizer + lowercase filter

"Foxes" → "foxes"

Custom

Your own char_filter + tokenizer + filter chain

"Foxes" → "fox" (with stemmer)

Keyword

No-op — entire input is one token

"[email protected]" → "[email protected]"

Language

Language-specific stemmer + stop words

"running" → "run" (english)
Core Data Structure

Anatomy of the Inverted Index

Three layers working together to make full-text search fast.

1. Term Dictionary

A sorted list of every unique term → disk pointer. Stored as an FST (Finite State Transducer) — a compressed automaton that shares prefixes and suffixes.

elasticsearchblock 1
fastblock 1
luceneblock 2

Lookup: O(length of term) — independent of index size.

2. Posting List

 

3. Per-Field Indexes

 

Deep Dive

FST: The Term Dictionary Engine

Why can't we just use a HashMap? Because with 100M unique terms, a HashMap would eat all your RAM.

The Problem

Terms: fast faster fastest fun funny

A HashMap stores each term separately. At 100M unique terms, that's ~6 GB of RAM just for one field's dictionary.

HashMap — 100M terms ~6 GB RAM
6,000 MB — each string stored separately

The Solution: FST

Like a trie that also shares suffixes. Common prefixes AND suffixes stored once.

"fast" and "faster" share f→a→s→t nodes
• Accept nodes store a disk pointer to the posting list

FST for our 5 terms

graph LR S((" ")) --> F(("f")) F --> A(("a")) F --> U(("u")) A --> AS(("s")) AS --> T(("t")) T --> E(("e")) E --> R(("r")) E --> ES2(("s")) ES2 --> EST(("t")) U --> UN(("n")) UN --> FUN((" ")) UN --> UNN(("n")) UNN --> FUNNY(("y")) T -.-|"✓ fast"| T_OUT[ ] R -.-|"✓ faster"| R_OUT[ ] EST -.-|"✓ fastest"| EST_OUT[ ] FUN -.-|"✓ fun"| FUN_OUT[ ] FUNNY -.-|"✓ funny"| FUNNY_OUT[ ] style S fill:#12121A,stroke:#FEC514,color:#FEC514,stroke-width:3px style F fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style A fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style U fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style AS fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style T fill:#12121A,stroke:#00D68F,color:#00D68F,stroke-width:3px style E fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style R fill:#12121A,stroke:#00D68F,color:#00D68F,stroke-width:3px style ES2 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style EST fill:#12121A,stroke:#00D68F,color:#00D68F,stroke-width:3px style UN fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style FUN fill:#12121A,stroke:#00D68F,color:#00D68F,stroke-width:3px style UNN fill:#12121A,stroke:#FF8B4A,color:#FF8B4A,stroke-width:2px style FUNNY fill:#12121A,stroke:#00D68F,color:#00D68F,stroke-width:3px style T_OUT fill:none,stroke:none style R_OUT fill:none,stroke:none style EST_OUT fill:none,stroke:none style FUN_OUT fill:none,stroke:none style FUNNY_OUT fill:none,stroke:none

start Entry   a-z Character   green Accept = valid term

Lookup "fast": startfast (accept!) → read posting list pointer

Lookup "fan": start→f→a→no "n" child!term doesn't exist

FST — 100M terms ~300 MB RAM
300 MB
Core Data Structure

Anatomy of the Inverted Index

Three layers working together to make full-text search fast.

1. Term Dictionary

FST: term → disk pointer

Lookup: O(term length)

2. Posting List

For each term: the sorted list of document IDs + frequency + positions. Compressed with delta encoding + bit packing.

lucene:
  doc IDs:   [2, 3] ← sorted!
  freq:      [1, 1] ← per doc
  positions: [[0], [2]] ← for phrase queries

Stores: doc IDs, term frequency, positions, offsets.

3. Per-Field Indexes

 

Deep Dive

Posting List: Compressed Doc IDs

Doc ID_id | Doc ID: Lucene's internal auto-assigned integer per segment (0, 1, 2, ...). Used in posting lists & scoring. | _id: The string you provide ("product-42"). Stored in _source, mapped via a separate lookup.
"fast AND scalable" Posting lists are sorted by doc ID — this enables 3 key compression & query optimizations:
fast 2 5 8 13 21 34 55 89 144 (9 docs)
scalable 8 34 144 512 (4 docs)
STEP 1 Delta Encoding Store gaps instead of absolute IDs
Problem: Raw doc IDs can be huge numbers (e.g. 1000000, 1000002, 1000005). Each needs 32 bits.
Solution: Store the difference between consecutive IDs. Since the list is sorted, deltas are always small positive numbers.
delta[0] = 2            (first value kept as-is)
delta[1] = 5 - 2 = 3
delta[2] = 8 - 5 = 3
delta[3] = 13 - 8 = 5
delta[4] = 21 - 13 = 8
delta[5] = 34 - 21 = 13
delta[6] = 55 - 34 = 21
delta[7] = 89 - 55 = 34
delta[8] = 144 - 89 = 55
Raw Doc IDs
2 5 8 13 21 34 55 89 144
max = 144 → needs 8 bits per value
Delta Encoded
2 3 3 5 8 13 21 34 55
max = 55 → needs only 6 bits per value
Decode (cumulative sum)
22+3=55+3=88+5=1313+8=21 → ...
Just add deltas to reconstruct original IDs
Real-world: IDs like 1000000, 1000002, 1000005 → deltas 1000000, 2, 3 — dramatically smaller numbers!
STEP 2 Bit Packing (Frame of Reference / FOR) 128 deltas per block
Problem: Deltas are small but still stored as 32-bit ints? 3 doesn't need 32 bits!
Solution: Group 128 deltas into a block. Find the max delta, calculate minimum bits needed, pack all values using that many bits.
// Step A: Find max delta in block
max_delta = 55

// Step B: Calculate bits needed
bits = ceil(log2(55+1)) = 6 bits

// Step C: Pack each delta into 6 bits
Store 1 byte header: bits_per_value = 6
Each delta → 6-bit binary
2
000010
3
000011
5
000101
8
001000
13
001101
21
010101
34
100010
55
110111
... ×128 values per block
BEFORE: int32[]
128 × 32 bit = 512 bytes
AFTER: packed 6-bit
96 B
5.3× compression
Why 128? Fits perfectly in CPU cache lines. SIMD instructions can decode an entire block in one operation. Leftover docs (<128) use VInt encoding.
STEP 3 Skip List Fast AND / OR intersection
Problem: For fast AND scalable, we need the intersection of two posting lists. Scanning all elements = O(n+m).
Solution: Use skip pointers to jump over blocks. Start with the shorter list as driver.
Skip List Structure (fast)
Level 2: 2 ─────────── 34 ─────────── 144
Level 1: 2 ─── 13 ─── 34 ─── 89 ─── 144
Level 0: 2 5 8 13 21 34 55 89 144
Upper levels let us skip over large ranges without scanning each ID
AND Intersection: step by step
Driver: scalable (shorter, 4 docs). For each ID, skip-search in fast:
1 scalable[0] = 8 → search in fast
2 5 8 ✓ MATCH!
2 scalable[1] = 34 → skip from 8
skip L2: 34≥13 34 ✓ MATCH! skipped 13, 21
3 scalable[2] = 144 → skip from 34
skip L2: 144 144 ✓ MATCH! skipped 55, 89
4 scalable[3] = 512 → skip from 144
144 END ✗ No match — list exhausted
Result: 8 34 144 4 iterations + skip jumps instead of scanning all 9 — O(log n)
RAW IDs
[2, 5, 8, ... 144]
8 bits/value
STEP 1: DELTA
[2, 3, 3, ... 55]
smaller numbers
STEP 2: BIT PACK
6-bit blocks
5.3× compression
STEP 3: SKIP LIST
O(log n) queries
fast intersection
RESULT
ms-level search
on billions of docs
Core Data Structure

Anatomy of the Inverted Index

Three layers working together to make full-text search fast.

1. Term Dictionary

FST: term → disk pointer

Lookup: O(term length)

2. Posting List

Sorted doc IDs + metadata

Compression: delta + bit packing

3. Per-Field Indexes

Each field gets its own inverted index. Completely independent.

title: laptop → [1, 3, 7]
body:  laptop → [1, 2, 3, 5, 7]

Why: title:laptop only scans the title index.

Key insight: The inverted index trades write-time work (analyzing text, building data structures) for read-time speed (instant lookups). This is the fundamental trade-off of every search engine.

Core Concept

Segments: Immutable Mini-Indexes

A Lucene index is not a single giant inverted index. It's a collection of smaller, immutable pieces called segments.

graph TB subgraph LUCENE_INDEX["Lucene Index"] CP["Commit Point
segments_N"] SEG0["Seg 0 · 50K docs"] SEG1["Seg 1 · 30K docs"] SEG2["Seg 2 · 12K docs"] SEG3["Seg 3 · 800 docs
uncommitted"] CP --> SEG0 CP --> SEG1 CP --> SEG2 CP -.->|"not committed"| SEG3 end style LUCENE_INDEX fill:#0a0a0f,stroke:#FEC514,color:#E8E8F0 style CP fill:#12121A,stroke:#FEC514,color:#FEC514 style SEG0 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A style SEG1 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A style SEG2 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A style SEG3 fill:#12121A,stroke:#00BFB3,color:#00BFB3

Why Immutable?

Once written, never modified. This enables:

1. No Locking

Multiple threads read simultaneously — zero contention.

2. Cache Friendly

OS page cache never invalidated — data never changes.

3. Compression

Compress once, read forever — no decompress for updates.

4. Easy Cleanup

Delete segment = delete file. No garbage collection.

Deep Dive

Inside a Segment

A segment is a self-contained mini-index on disk. Inside, the same data is stored in 3 different layouts — each optimized for a different access pattern.

Inverted Index

Question: "Which docs contain this word?"

Term Dictionary (FST, in RAM)
laptop → disk:0x4A20
phone → disk:0x7F10
Posting Lists (on disk, sorted by ID)
laptop 2 5 13 89 144
phone 1 3 5 8

Stored Fields

Question: "Give me the original JSON of doc 5."

_source (row-oriented, on disk)
doc 2 {"name":"laptop pro", "price":999}
doc 5 {"name":"laptop air", "price":1299}
doc 13 {"name":"gaming laptop", "price":1899}

Doc Values

Question: "Sort these results by price."

ROW-BASED — entire row is read
doc name category price
0 laptop pro electronics 999
1 laptop air electronics 1299
2 gaming lap gaming 1899

3 fields read, only 1 was needed
Random I/O — each row on a different disk block

vs
COLUMN-BASED (Doc Values)
price.dvd
999
1299
1899
doc 0
doc 1
doc 2

Only price column read, contiguous block
Sequential I/O — single disk read

One column file per field → ideal for sorting & aggregations. Same-type values stored together → better compression.

.liv File — "which docs are still alive?"

Immutable segments can't delete in-place — a .liv bitset (1 bit/doc) tracks live docs. Reclaimed on merge.

1 1 0 1 0 1
1=live   0=deleted
Write Path

Writing to Lucene

How documents go from your application into a searchable index.

Application
IndexWriter
Buffer
Segment
Disk
addDocument(doc)
Analyze + build inverted index In RAM only. Not searchable.
FLUSH Write buffer → segment files OS page cache. Searchable (NRT)!
COMMIT fsync segments + write segments_N Durable + crash-safe.
Durability

Where Does the Data Actually Live?

Data passes through three layers, each with different trade-offs between speed and safety.

JVM Heap
In-Memory Buffer
New documents analyzed and indexed entirely in application memory.
Not searchable
Not durable
Lost on process crash
flush
OS Page Cache (RAM)
Flushed Segments
Segment files in OS-managed memory. NRT reader can open and search them.
Searchable (NRT)
Not durable
Lost on OS crash / power failure
fsync
(commit)
Persistent Storage
Disk
Segment files physically written via fsync. Survives everything.
Searchable
Durable
Crash-safe & permanent

The problem: fsync is slow (~10ms per call). Calling it after every document kills throughput. But without it, data is lost on crash. Elasticsearch solves this with the translog.

Read Path

Searching in Lucene

A search must visit every segment. Per-segment search is the fundamental unit of work.

Search: "laptop"
Analyzer Term: "laptop" tokenize + lowercase (same analyzer as index time)
PER-SEGMENT Seg 0 → FST Seg 1 → FST Seg 2 → FST
Postings: [5, 12] Postings: [88, 91] Postings: [200]
.liv: 12 deleted → [5] .liv: all live → [88, 91] .liv: all live → [200]
BM25: 5→4.2 BM25: 88→5.7 91→3.1 BM25: 200→4.8
Merge → Top N
Relevance

How Lucene Scores Results: BM25

Every matching document gets a relevance score. Lucene uses BM25 — an evolution of TF-IDF with term frequency saturation.

score(q,d) = Σ IDF(t) × TF(t,d) × norm(d)
summed for each query term
IDF Inverse Document Frequency

How rare is the term? Rare terms score much higher.

"the" — 99% of docs
IDF ≈ 0.01
"elasticsearch" — 2% of docs
IDF ≈ 3.9
TF Term Frequency (with saturation)

More occurrences = higher score, but with diminishing returns.

tf=1 → 0.55 tf=3 → 0.77 tf=10 → 0.89 tf=100 → 0.99
NORM Field Length Normalization

Shorter fields rank higher. "laptop" in a 3-word title beats "laptop" in a 500-word body.

b = 0.75 controls length sensitivity (0 = ignore length)

Example: "laptop"

Doc A title: "Laptop Pro Max"
8.2
IDF
3.2
TF
1× (0.55)
NORM
3 words ↑↑
Doc B body: "This store sells ...laptop... (500 words)"
1.4
IDF
3.2
TF
1× (0.55)
NORM
500 words ↓↓
Key Insights
• Score is computed per segment (each segment has its own IDF stats)
k1 = 1.2 controls TF saturation, b = 0.75 field-length impact
• Use "explain": true in your query to see the full breakdown
• Replaced TF-IDF as default in Lucene 6+ / ES 5+
Immutability Consequences

Delete & Update in Lucene

Since segments are immutable, neither delete nor update can modify existing data in-place.

Delete

Document stays in the segment. A .liv bitset tracks who's alive.

Segment 0 — 5 docs
doc 1
✓ live
doc 2
✗ deleted
doc 3
✓ live
doc 4
✗ deleted
doc 5
✓ live
.liv
1, 0, 1, 0, 1
filtered at search time

Bytes only reclaimed during segment merge.

Update = Delete + Re-index

No in-place update. Old version marked deleted, new version indexed fresh.

Segment 0 Step 1: mark deleted
doc 42 v1
price: 100
Segment 3 (new) Step 2: re-index
doc 42 v2
price: 150

Old bytes remain until merge. Frequent updates = many dead copies across segments.

Maintenance

Segment Merge

The essential background process that keeps a Lucene index healthy.

graph LR subgraph BEFORE["Before Merge"] S0["Seg 0
50K docs"] S1["Seg 1
30K docs"] S2["Seg 2
20K docs
8K deleted"] S3["Seg 3
10K docs
2K deleted"] S4["Seg 4
5K docs"] end subgraph DURING["Merge Thread"] M["Select segments
Read live docs
Write new segment
No deleted docs!"] end subgraph AFTER["After Merge"] S0B["Seg 0
50K docs"] BIG["Seg 5 (merged)
55K live docs"] end S1 --> M S2 --> M S3 --> M S4 --> M M --> BIG style BEFORE fill:#1a1a2e,stroke:#FF8B4A,color:#E8E8F0 style DURING fill:#1a1a2e,stroke:#FEC514,color:#E8E8F0 style AFTER fill:#1a1a2e,stroke:#00D68F,color:#E8E8F0 style M fill:#12121A,stroke:#FEC514,color:#FEC514 style BIG fill:#12121A,stroke:#00D68F,color:#00D68F style S0 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A style S0B fill:#12121A,stroke:#FF8B4A,color:#FF8B4A style S1 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A style S2 fill:#12121A,stroke:#F04E98,color:#F04E98 style S3 fill:#12121A,stroke:#F04E98,color:#F04E98 style S4 fill:#12121A,stroke:#FF8B4A,color:#FF8B4A
Maintenance

Why Merge & Merge Policy

Merging is essential for keeping search performance high and reclaiming disk space.

Why Merge?

1

Fewer Segments = Faster Search

Each search visits every segment. 50 segments = 50x the overhead vs 5 segments.

2

Reclaim Deleted Docs

Deleted docs waste disk and slow searches. Merging rewrites only live docs — dead docs are gone forever.

3

Better Compression

Larger segments compress more efficiently. Delta encoding and bit packing work better with more data.

TieredMergePolicy (default)

Similar-Size Segments

Groups segments of roughly equal size for merging. Avoids merging tiny segments into huge ones.

Prefers High-Delete Segments

Segments with many deleted docs are prioritized — more space to reclaim per merge.

Max 5 GB • Background Threads

Merged segment won't exceed 5 GB. Runs on background threads — does not block indexing or search.

Part II

Lucene's Limitations

Lucene is a brilliant library, but it's just a library.
Using it in production exposes several hard problems.

The Hard Problems

What Lucene Can't Do

1. Single-Node Only

Your data must fit on one machine. No built-in distribution.

2. No High Availability

One server down = total outage. No replication, no failover.

3. Commit is Expensive

fsync after every doc? ~100 docs/sec. Unusable.

4. Not Real-Time

Docs invisible until next commit or reader reopen.

5. No HTTP API

Java library only. Need your own server layer for other languages.

6. No Schema Management

You handle field types, backward compatibility, reindexing yourself.

These are exactly the problems Elasticsearch was built to solve. ↓
Part III

Enter Elasticsearch

A distributed system built on top of Lucene that solves
every limitation we just discussed.

Solutions

How ES Solves Each Problem

Single-Node Only

Lucene runs on one machine.

Sharding → Horizontal Scaling

Split into shards (each a Lucene index), distribute across nodes. 10TB? 10 nodes with 1TB each.

No High Availability

One server down = total outage.

Replication → Fault Tolerance

Replica shards on different nodes. Primary dies? Replica promoted automatically. Zero downtime.

Commit is Expensive

fsync after every doc is too slow.

Translog → Cheap Durability

Append-only transaction log, cheap to fsync. Full Lucene commits happen infrequently in the background.

Not Real-Time

Docs invisible until commit.

Refresh → Near Real-Time

New segments written to OS page cache every 1s. Searchable via NRT reader, no fsync needed.

Architecture

Cluster, Nodes, Shards

CLUSTER: "my-app"
Node 1 (master+data)
P0 primary
Seg0 Seg1 Seg2
R1 replica
Seg0 Seg1
Node 2 (data)
P1 primary
Seg0 Seg1
R0 replica
Seg0 Seg1 Seg2
Node 3 (data)
P2 primary
Seg0
R2 replica
Seg0
Primary shard Replica shard SegLucene segment

Cluster

One or more nodes working together. Identified by a cluster name. Nodes auto-discover each other.

Node Roles

master cluster state   data stores shards
ingest pipelines   coordinating routes requests

Shard

Each shard = one Lucene index. ES distributes many Lucene indexes across machines. Each contains its own segments.

Routing

shard = hash(_routing) % num_primary_shards
Default: _routing = _id — Why primary shard count is fixed at creation.

NRT Search

Refresh: Near Real-Time Search

Lucene's trick: new segments can be opened from the filesystem cache without a full commit.

Client
ES
Buffer
Translog
Segment
Disk
PUT /products/_doc/1
Add to buffer Append translog (fsync!)
201 Created
Search → NOT FOUND YET
REFRESH refresh_interval = 1s Write new segment No fsync needed! Just FS cache.
Search again → FOUND!
CRASH
Replay from translog Lucene can't do this — ES translog adds durability
FLUSH fsync to disk Translog cleared
Durability

Translog: The Transaction Log

Refresh only writes to the filesystem cache — not durable. If the node crashes, unflushed data is lost. The translog prevents this.

graph TB subgraph WRITE["Every Write Operation"] DOC["New Document"] DOC --> BUF["In-Memory
Buffer"] DOC --> TLOG["Translog
(fsync to disk!)"] end subgraph REFRESH_S["After Refresh"] BUF2["Buffer emptied"] SEG["New Segment
(FS cache only)"] TLOG2["Translog
still kept!"] end subgraph FLUSH_S["After Flush"] SEG2["Segment on disk
(fsync'd)"] TLOG3["New empty
translog"] end BUF -->|"refresh"| SEG TLOG -->|"kept"| TLOG2 SEG -->|"flush (fsync)"| SEG2 TLOG2 -->|"truncated"| TLOG3 CRASH["💥 Crash?
Replay translog → zero data loss"] TLOG2 -.->|"recovery"| CRASH style CRASH fill:#12121A,stroke:#FF4D6A,color:#FF4D6A style WRITE fill:#1a1a2e,stroke:#FEC514,color:#E8E8F0 style REFRESH_S fill:#1a1a2e,stroke:#00BFB3,color:#E8E8F0 style FLUSH_S fill:#1a1a2e,stroke:#00D68F,color:#E8E8F0 style DOC fill:#12121A,stroke:#FEC514,color:#FEC514 style BUF fill:#12121A,stroke:#F04E98,color:#F04E98 style TLOG fill:#12121A,stroke:#F04E98,color:#F04E98 style BUF2 fill:#12121A,stroke:#555577,color:#555577 style SEG fill:#12121A,stroke:#00BFB3,color:#00BFB3 style TLOG2 fill:#12121A,stroke:#00BFB3,color:#00BFB3 style SEG2 fill:#12121A,stroke:#00D68F,color:#00D68F style TLOG3 fill:#12121A,stroke:#00D68F,color:#00D68F

How It Works

1

Dual write

Every operation goes to both in-memory buffer and translog.

2

Translog is fsync'd

Fsync'd after every op by default. Append-only writes stay fast.

3

Survives refresh

Translog kept after refresh — segment isn't committed yet.

4

Cleared on flush

Truncated when Lucene commit (flush) makes segments durable.

Write Path

A Write Request: End to End

Client
Coord Node
Primary
Replica 1
Replica 2
PUT /products/_doc/42
hash(_id) % shards
VALIDATE version + mapping check
INDEX LOCALLY buffer + translog (fsync)
PARALLEL Replicate Replicate
ACK ✓ ACK ✓
All in-sync replicas OK 201 Created
Coord Routes via hash(_id) % shards
Primary Validates, indexes, replicates in parallel
In-Sync Only replicas in the in-sync set must ACK — lagging replicas get removed from the set, not block writes
Read Path

A Search Request: Scatter & Gather (Adaptive Replica Selection)

Client
Coord Node
Shard 0
Shard 1
Shard 2
GET /products/_search q=laptop
PHASE 1 — QUERY (SCATTER) top N? top N? top N?
FST→postings→BM25 FST→postings→BM25 FST→postings→BM25
[doc IDs + scores] [doc IDs + scores] [doc IDs + scores]
Merge → global top N
PHASE 2 — FETCH (GATHER) Fetch _source Fetch _source
_source docs _source docs
200 OK {hits:[...]}
Query Each shard returns doc IDs + scores only — lightweight
Fetch Only winning shards return _source — avoids unnecessary I/O
ARS Each shard copy ranked by response_time — routes to fastest replica
Smart Routing

Adaptive Replica Selection

Each shard has multiple copies (primary + replicas). Which copy should handle the search? ARS picks the fastest one.

BEFORE Round Robin ES < 6.1
Shard 0 (R)
Node 2
5ms
Shard 0 (P)
Node 1 (busy!)
200ms
Shard 0 (R)
Node 3
8ms
Round robin hits Node 1 every 3rd request → p99 latency spikes to 200ms
AFTER Adaptive Replica Selection ES 6.1+
Shard 0 (R)
Node 2
5ms
SELECTED
Shard 0 (P)
Node 1 (busy!)
200ms
SKIPPED
Shard 0 (R)
Node 3
8ms
STANDBY
ARS avoids the slow node → consistent low latency

How ARS Ranks Each Copy

The coordinating node maintains a score for every shard copy, updated after each response.

1 Response Time

EWMA (exponentially weighted moving average) of past response times. Recent performance weighted more heavily.

2 Queue Size

Number of in-flight requests to that node. High queue = node is overloaded, penalized in ranking.

3 Service Time

EWMA of time the node actually spent processing (excludes network). Separates slow nodes from slow networks.

Formula: rank(s) = queue_size(s) × service_time(s) + response_time(s)
Lowest rank wins. Automatically adapts to GC pauses, hot nodes, disk I/O spikes, and uneven shard sizes.

Fault Tolerance & Summary

When Things Go Wrong

Failure Scenarios

Primary Shard Dies

Master promotes replica to primary → retries op → allocates new replica elsewhere.

Replica Shard Dies

Removed from in-sync set → primary continues → master rebuilds replica on another node.

Stale Primary (Network Partition)

Replicas reject writes (term mismatch) → old primary discovers replacement → ops rerouted.

Read Failure

Coordinator retries on another shard copy → if all fail, returns partial results.

Putting It All Together

Lucene ConceptES Layer
Inverted Index Core search data structure in every shard
Segment (immutable) Written on refresh (every 1s), merged in background
Commit Point Created during flush, not after every write
- Translog fills the durability gap between refreshes and commits
Lucene Index = 1 ES Shard (primary or replica)
- ES Index = collection of shards distributed across nodes

The Big Picture

Lucene gives us the search. Translog gives us durability. Refresh gives us near real-time.
Shards give us scale. Replicas give us availability. Cluster ties it all together.

|
Thank You

Questions?