This is the first post describing our on-going experiments focusing on dev/ops for containerized/clustered backend systems.
Oops - I was looking this morning at one of my pre-production Ochothon clusters and noticed something was off:
This alerts is triggered by the automated cluster sanity-check which is part of my Zookeeper image (the coordinating pod issues a MNTR to each server within its cluster and make sure its state is either leader or follower).
Sure enough my Kafka brokers started to complain as well. So, what do you do in such a case ? Well, in the prehistoric times before Mesos, Marathon, Kubernetes and Ochopod we would go log into each VM and tinker around ... probably breaking stuff along the way. Not anymore !
First, I looked at what node is having an issue. I simply asked for the internal pod log from my CLI:
Exit 1 .. probably some local hiccup, typically a corruption of the snapshot files in /var/lib. The other nodes are fine. OK, let's take stop node #1, wipe out its data and restart it gracefully. Still from my CLI:
At this point node #1 will still be up (e.g the container and corresponding Mesos task are still up & running), but the sub-process it manages will be turned off. The CLI port command is handy to find out on what Mesos slaves my containers are running from:
After logging into the container (which now does not run its /bin/zkServer.sh process) and removing /var/lib/zookeeper/version-2 we can switch it back on. Let's look again at the summary:
Phew, that did the trick and the ensemble is now healthy again! Actually I might hook soon the health check to some alerting system such as PagerDuty.