Cassandra Monitoring

This section describes some of the metrics that Meridian collects from a Cassandra cluster. JMX must be enabled on the Cassandra nodes and made accessible from Meridian in order to collect these metrics (see Enabling JMX Authentication and Authorization in the Cassandra documentation).

The data collection process is bound to the agent IP interface whose service name is JMX-Cassandra. The JMXCollector retrieves MBean entities from the Cassandra node.

Client connections

Cassandra collects the number of active client connections from org.apache.cassandra.metrics.Client:

Name	Description
connectedNativeClients	Metrics for connected native clients.
connectedThriftClients	Metrics for connected thrift clients.

Name

Description

connectedNativeClients

Metrics for connected native clients.

connectedThriftClients

Metrics for connected thrift clients.

Compacted bytes

Cassandra collects the following compaction manager metric from org.apache.cassandra.metrics.Compaction:

Name	Description
BytesCompacted	Number of bytes compacted since node started.

Name

Description

BytesCompacted

Number of bytes compacted since node started.

Compaction tasks

Cassandra collects the following compaction manager metrics from org.apache.cassandra.metrics.Compaction:

Name	Description
CompletedTasks	Estimated number of completed compaction tasks.
PendingTasks	Estimated number of pending compaction tasks.

Name

Description

CompletedTasks

Estimated number of completed compaction tasks.

PendingTasks

Estimated number of pending compaction tasks.

Storage load

Cassandra collects the following storage load metric from org.apache.cassandra.metrics.Storage:

Name	Description
Load	Total disk space (in bytes) that this node uses.

Storage exceptions

Cassandra collects the following storage exception metric from org.apache.cassandra.metrics.Storage:

Name	Description
Exceptions	Number of unhandled exceptions since start of this Cassandra instance.

Name

Description

Exceptions

Number of unhandled exceptions since start of this Cassandra instance.

Dropped messages

Any messages run after a given timeout (set per message type) are discarded. The number of dropped metrics across different message queues is a good indication of whether a given cluster can handle its load.

Cassandra measures this by gathering metrics data from org.apache.cassandra.metrics.DroppedMessage:

Name Description Stage

Name	Description	Stage
Mutation	If a write message is processed after its timeout (`write_request_timeout_in_ms`), it either sent a failure message to the client, or it met its requested consistency level and will rely on hinted handoff and read repairs to do the mutation if it succeeded.	MutationStage
Counter_Mutation	If a write message is processed after its timeout (`write_request_timeout_in_ms`), it either sent a failure message to the client, or it met its requested consistency level and will rely on hinted handoff and read repairs to do the mutation if it succeeded.	MutationStage
Read_Repair	Times out after `write_request_timeout_in_ms`.	MutationStage
Read	Times out after `read_request_timeout_in_ms`. After this point, an error is returned to the client and no further messages should be read.	ReadStage
Range_Slice	Times out after `range_request_timeout_in_ms`.	ReadStage
Request_Response	Times out after `request_timeout_in_ms`. Indicates that the response was completed and sent back, but not before the timeout.	RequestResponseStage

Mutation

If a write message is processed after its timeout (write_request_timeout_in_ms), it either sent a failure message to the client, or it met its requested consistency level and will rely on hinted handoff and read repairs to do the mutation if it succeeded.

MutationStage

Counter_Mutation

MutationStage

Read_Repair

Times out after write_request_timeout_in_ms.

MutationStage

Read

Times out after read_request_timeout_in_ms. After this point, an error is returned to the client and no further messages should be read.

ReadStage

Range_Slice

Times out after range_request_timeout_in_ms.

ReadStage

Request_Response

Times out after request_timeout_in_ms. Indicates that the response was completed and sent back, but not before the timeout.

RequestResponseStage

Thread pools

Apache Cassandra is based on a staged event-driven architecture (SEDA). This separates different operations into stages. Each stage is loosely coupled using a messaging service, and each uses queues and thread pools to group and run its tasks.

The documentation for Cassandra thread pool monitoring originated from the Pythian Guide to Cassandra Thread Pools.

Collected thread pool metrics
Name	Description
ActiveTasks	Tasks that are currently running.
CompletedTasks	Tasks that have finished.
CurrentlyBlockedTasks	Tasks that are blocked due to a full queue.
PendingTasks	Tasks that are queued to run.

Memtable FlushWriter

You can use using org.apache.cassandra.metrics.ThreadPools to sort and write memtables to disk.

Most of the time, memtable issues are caused by overrunning disk capability. Sorting can cause issues as well; they are usually accompanied by a high load, but a small number of actual flushes (as seen in cfstats). Substantial rows with long column names—or something inserting many large values into a CQL collection—can cause these problems.

If disk capabilities are being overrun, you should either add nodes to reduce the load, or update the node’s configuration.