Clusterf*ck

The Clustering Clinc

Presented by Dani Traphagen / @dtrapezoid Michael Perrone & Sebastian Estevez / @syllogistic

Introduction

There were some serious issues with this cluster of pain. We rolled up our sleeves and got down to bidness, trouble shooting this hot mess.

Arriving at AHA MOMENT #1

We 1st Viewed Opscenter and the cluster topology next we...

Down arrow

Ran Jmeter

Just to check we could fire it up.

Cassandra.yaml

Seed nodes weren't set the same, but that wasn't it, because that only matters during bootstrapping.

Red Herring

When in doubt, CHECK ALL THE FILES!

Checked Cassandra log and ran Cassandra in the foreground. AHA HUGEFILE is HUGE!

lightbulb

When we got the down node up:

Datacenter 2 only showed 6 of 8 agents so checked to see datastax agent was running (restarted in case it wasn't running). http://slid.es.

Check moar things!

  • Check DataStax agent log
  • Nodetool Status
  • Agent Config
  • Cassandra.yaml

Arriving at AHA MOMENT #1

  • The snitch is a lie
  • Was set as Ec2Snitch
  • Should have been Ec2MultiRegion

Our final solo AHA Moment #3

  • Heap size is a lie
  • All the heaps were commented out
  • Except for 2 which were different values