Ability to withstand or recover quickly from difficult conditions.
Until recently, not so documented...

Until recently, not so documented...

1 ElasticSearch Index <=> N shards
five nodes cluster (ElasticSearch version : 1.4.0)
JMeter (slightly) randomized test plan
simple node cluster for monitoring
IRL: Server un-plugged, hard-disk drive failure, etc.
Simulation: Stopping a node
Nada!
Two types of shards:
Cluster status
What if both nodes hosting PRIMARY and REPLICAS go down?
> Unless they all go down within a short timeframe, you are fine!
If we zoom in on one node and then on one shard...
| term | matching documents | |
|---|---|---|
| dog | => | doc1, doc45 |
| eat | => | doc1, doc6, doc30 |
| whatever | => | doc5 |

New documents in the in-memory buffer, ready to commit
A new segment is added to the commit point and the buffer is cleared

Documents are being written in buffer AND translog

The reader has been refreshed, buffer is cleared but not the translog

Other documents being ingested

At some point, the translog is fsync'ed = "commit"
Sure you could propose to fsync more frequently… …at the cost of indexing performance! and it is unlikely to provide more reliability at the end of the day
Writes to a file will not survive a reboot/crash
until the file has been fsync'ed to disk
Conjunction of the following mechanisms lowers the odds of a loss:
Maintain the global cluster state:
(in critical clusters)
We want to prevent a split brain from emerging
minimum_master_nodes = (n / 2) + 1
The brain-melting issue
Issue #2488 on Github
Fixed in v1.4.0
If a node takes a long time to answer?
> It is considered down.
Memory hungry software
Result: very large Heap result in long GCs
Reason: caches lots of data
Example: filters
Used for :
Recommendations :
Use doc_values instead of fielddata
Beware of caches sizes!
Don't make it swap!
Tests, tests, tests!
OutOfMemoryException = node killer
ElasticSearch lets you avoid it: Circuit-breakers
- v1.0 : fielddata circuit-breaker
- v1.4: request and parent (global) circuit-breakers
Hardware failure, JVM bug, travelling over the wire...
If not detected, could be replicated everywhere
Let's try with a corrupted snapshot...
Error! (and red status)
Since Lucene v4.8...
Lucene 4.8 <=> ElasticSearch 1.2.0
1.3.X : snapshot/restore and recovery
1.4.0 : translog!
That's really not good for the node
Feature added in v0.90.4
Enabled by default since v1.3.0
Cluster-aware of disk usage
There are 2 tiers:
Since mid-2014 :
From :
To :
Use cases summary