High memory usage with collectd
collectd itself is intended as lightweight collecting agent for metrics and events. In larger infrastructure, the data is sent over the network to a central point, where data is stored and processed further.
This introduces a potential issue: what happens, if the remote endpoint to write data to is not available. The traditional network plugin uses UDP, which is by definition unreliable.
Collectd has a queue of values to be written to an output plugin, such
was write_http
or amqp1
. At the time, when metrics should be
written, collectd iterates on that queue and tries to write this data
to the endpoint. If writing was successful, the data is removed from
the queue. The little word if also hints, there is a chance that data
doesn't get removed. The question is: what happens, or what should be
done?
There is no easy answer to this. Some people tend to ignore missed
metrics, some don't. The way to address this is to cap the queue at a
given length and to remove oldest data when new comes in. The parameters
are WriteQueueLimitHigh
and WriteQueueLimitLow
. If they are unset,
the queue is not limited and will grow until memory is out. For
predictability reasons, you should set these two values to the same
number. To get the right value for this parameter, it would require a
bit of experimentation. If values are dropped, one would see that in
the log file.
When collectd is configured as part of Red Hat OpenStack Platform, the following config snippet can be used:
parameter_defaults:
ExtraConfig:
collectd::write_queue_limit_high: 100
collectd::write_queue_limit_low: 100
Another parameter can be used to limit explicitly the queue length in
case the amqp1 plugin is used for sending out data: the SendQueueLimit
parameter, which is used for the same purpose, but can differ from the
global WriteQueueLimitHigh
and WriteQueueLimitLow
.
parameter_defaults:
ExtraConfig:
collectd::plugin::amqp1::send_queue_limit: 7500
In almost all cases, the issue of collectd using much memory could be tracked down to a write endpoint not being available, dropping data occasionally, etc.