Running Elasticsearch Cluster On Spot Instances

Contributor: Azitabh Ajit

Elasticsearch is great. In fact, it is the most used datastore in its category as per db-engines. It has comfortably outshone apache-solr in terms of adoption since its launch in 2010.

It has done so despite using the same underlining technology, i.e. apache-lucene. Along with many other reasons, it has been able to do so because of scalability and fault tolerance being built into its core. These are the qualities that enabled us to consider running elasticsearch cluster on spot instances.

The need

Elasticsearch, like apache-solr, is resource intensive. It needs a good amount of memory to deliver the performance desired for powering a low latency website like Housing. We at Housing, rely on elasticsearch to deliver the most critical pages on our website. This requires a cluster with several nodes with each node having 16G(this works best for us) memory. If we ran all these nodes on on-demand ec2 instances, this would cost us few thousand dollars every month. We had a couple of options at our disposal to bring this recurring cost down:

    1. Use reserved instances.
    2. Look for using spot instances.

To maximize the savings without compromising on the availability, we decided to use a mixed strategy. Those not aware of what spot/reserved instances are or how they offer cost-benefits can go through their respective documentations: spot-documentation and reserved-documentation.

The checkpoints

While trying to reduce cost, we had to be absolutely sure of the following:

    1. The cluster should be able to auto-heal itself in the worst possible scenario presented by the termination of spot instances.
    2. There should be no compromise on the availability of the cluster.
    3. There should never be any data loss.

Based on the above requirements, we decided to use different strategies for different components of the cluster.

Treatment meted to different components

As per the best practices recommended here, we have two types of nodes in our cluster:

    1. Master-eligible nodes: These are set of 3 (to avoid split-brain) t2-small instances capable of acting as master. These instances don’t hold any data and don’t do any calculation, but are critical for the availability of the cluster. Since instances don’t incur much cost either, we didn’t move these to spot but used reserved instances instead.

  1. Data nodes: These instances store all the data and do coordination for queries involving multiple shards for all the user requests. It’s these instances which are chiefly responsible for the cost incurred. We decided to use a mixed approach here. This means we kept some of these nodes to reserved instances and moved rest of the nodes to spot instances. The exact ratio in which this was done will become clear in sections coming ahead.

Ensuring availability

Availability has two components:

    1. Read availability: Cluster is able to server read request if at least one copy of all the shards is available.
    2. Write availability: Similar to read availability, elasticsearch by default is available for write if one copy(primary) of all the shards are available. This can be changed to enhance consistency, but we have kept the default behaviour.

Our use case is very read-heavy, so we have kept 3 replicas of all the shards. This means that we have 4 copies of data. Elasticsearch by design never keeps two copies of data on the same node, so this also means that we can afford the simultaneous failure of 3 nodes in our cluster while maintaining availability. We can, in fact, afford to lose more nodes if the failure isn’t simultaneous. This is because elasticsearch will create fresh copies of data from available copies in case of failure of a node and will place it on other available nodes in the cluster as long as there are nodes in the cluster not holding the same copy of data.

The takeaway: We can afford the simultanuous failure of 3 nodes in the cluster.

Mixed approach for data nodes

We decided to keep half of our data nodes on reserved instances. This is no magic number. This is just because we are making our move gradually. So the number of instances on spot might increase in future.

We moved the other half to a fleet of spot instances. We took care of the following to ensure availability:

    1. We added instances of various types and capacities in the fleet and allowed the creation of spots in all available availability zones.
    2. We kept the bid price to the highest to suffer the least.
    3. We kept the allocation strategy to diversified.

Looking at the price history of spot instances, it is highly unlikely that the price of multiple instance types will ever touch the bid price simultaneously in all the availability zones. Few terminations can happen one by one and elasticsearch is capable of self-healing itself in such scenarios.

The outcome

We have been able to reduce the cost to less than half of the original value. We are running this way for quite some time now and we are yet to face any issue.