Managing Cloudbees ElasticSearch

ElasticSearch is one of the most popular NoSQL databases there is. It’s fast, it’s easy to setup and it’s easy to get data in and out of your cluster. Coupled with Kibana, it’s just as easy to visualize your data with nice charts and graphs. All of this is undoubtedly why Cloudbees chose to couple their Jenkins setup with the ELK stack. More importantly, it’s one of the reasons we chose to use it in our dev environment.

Allow me to set the stage a bit. On a given day our dev team runs around 1400-1500 Jenkins jobs. Some are stand-alone builds and a good number of them are pipeline jobs. That’s the average for us. When we’re crazy busy, that number increases well past 2000 jobs. The analytics provided by Kibana are critical for us but the charting is lightly used (mostly in monthly cadence or executive meetings). This makes our cluster primarily write heavy.

Unfortunately the default setup for Cloudbees ElasticSearch is not tailored for for speed or growth. After awhile your servers will become bogged down with unnecessary garbage collection and the JVMs will crater under pressure. If you don’t mind wasting company resources you can just add more servers to the cluster. Under the instructed default setup, my cluster of three VMs was crashing regularly and the datastore was less than 10 gig. I refused to believe that 3 four-core VMs with 32GB of ram couldn’t handle 10gb so I rolled my sleeves up and started digging into the configuration.

I’m specifying Cloudbees ElasticSearch in this article because they’ve coupled Jenkins with ES version 1.7.5. To put that in context, the current stable version (as of this writing) is 6.4.0 and the advice given here may or may not apply to the current release. Without further ado, let’s dig in.

How To Configure Your JVM

The amount of RAM you configure for java can be a double-edged sword. Too little RAM and your jvm crashes during garbage collection. Too much memory and you endure the long pauses of application death. The trick is finding that sweet spot. For our enviornment, it is 32GB of RAM. That’s 20 for the JVM and the remaining 12 for the [linux] OS. You can configure the heap using the ES_HEAP_SIZE variable or with the ES_JAVA_OPTS variable. If you want to err on the side of caution you can aim on the high end, just be aware you want to deallocate what your not using later. You are monitoring your JVMs right?

In addition to finding the best memory values, you’ll want to add the following java startup parameters:

-XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark

These options are not part of the default CB setup so you have to add them yourself. Doing this will instruct the garbage collector to collect the young generation memory before it does a full garbage collection or initiating the CMS remark phase. As a result, this heavily improves the GC performance since there’s no need to check references between the young generation and the old generation pools. The GC can get right to work collecting what needs.

Configure Your Storage Disk and Disk Options

Are you running SSDs? If not, you should probably beg for them. They’re worth the money. Sadly, we have to be budget conscious and our environment is VM’d with a SAN array. It’s still fast but SSDs would make a physical cluster the shiny sports car it deserves to be. Because we run VMs without traditional disks, the following Linux disk option is essential:

echo noop > /sys/block/hda/queue/scheduler

Setting your disk scheduler to “noop” allows the hypervisor to fully manage the disk activity as opposed to letting Linux do it. Since your hypervisor is going to do it anyway, why do you want algorithms competing with each other? The heavier your write activity, the more you’ll benefit from this setting.

The most important thing you can configure for your ElasticSearch disk storage is the index sharding value in your yml file. “Shards” are ElasticSearch’s way splitting up indexes for better performance. If your’e only using your ElasticSearch cluster with Jenkins, the default number_of_shards configuration at a setting of 5 is not only misconfigured, it’s insane. As I mentioned before, we run around 1400 jobs on average per day. Even with that number of jobs, our indexes are typically 1-3 meg since they just contain meta-data for the builds and your Jenkins performance statistics. Some indexes will go a bit higher but not by much and we even have indexes that go as low as a few hundred kb. Think about that last part for a moment: a few hundred kilobytes for an index. If such a small index is split into 5 shards, what’s that going to do for your JVM overhead?

Configuring our number_of_shards variable with a value of 1 was absolutely essential in stabilizing the cluster. Two shards is probably optimal but I wouldn’t tread past three without carefully evaluating the index sizes and read/write needs for the cluster. Five shards makes sense if your indexes are in the triple digits for meg size but keep this in mind: The greater the shard number the greater the burden on your JVM and the larger your cluster needs to be.

Is Your Cluster Growing?

Once you hit a 3-node cluster due to growth, do yourself a favor and configure a 4th node to be the dedicated master. The master should *not* store data. Just allow it to manage all of the get and put requests. With a node configured for this purpose, you stand a much greater chance of stability. Let any other node in the cluster crash but you want your master alive at all times to field requests to the living nodes.

If you’re using your ElasticSearch cluster for indexing data besides Jenkins, you should probably upgrade to a 6.x version. Cloudbees won’t support it but they do offer a new plugin for connectivity. With a newer ES version, you’ll have a cluster that’s less buggy and offers better memory management.