We have been working with Elastic Search for a number of months now. We noticed that sometimes CPU spikes to 100% for a single index and the only way to solve the issue is to delete and rebuild the index. Last time this happens to me I decided to dig deeper. What I found that Elastic Search is appears to be sensitive to improper shutdown. In other words, a process is killed instead of letting it shutdown gracefully by stopping the service.
The symptom as I mentioned is very high CPU usage. I ran the following command
I found that some indexes were in the initializing state. It appears that they were corrupted. I went to the folder for those shards. They would be located under Elastic Search install folder, then data folder, then nodes, then node number such as 0, then indices, then index name, then a number which is a shard number, then translog folder. Something like the following
What I found there is a translog file with the word “recovering” appended to the name. What I did is stop the service and delete all the translog files with the “recovering” at the end of the name for all the shards stuck in initializing state. Then I restarted Elastic Search service. Voila. The CPU issue is gone. There is probably some data loss associated with the solution, I am certain. However, based on small translog size and the data I queried afterwards, the loss appears to be minimal if any.
I hope this helps someone else.
Thanks and enjoy,