Debugging OpenStack Metrics
Overview
We'll do the following steps:
- check if metrics are getting in
- if not, check if ceilometer is running
- check if gnocchi is running
- query gnocchi for metrics
- check alarming
- further debugging
Check that metrics are getting in
openstack server list
example:
$ openstack server list -c ID -c Name -c Status
+--------------------------------------+-------+--------+
| ID | Name | Status |
+--------------------------------------+-------+--------+
| 7e087c31-7e9c-47f7-a4c4-ebcc20034faa | foo-4 | ACTIVE |
| 977fd250-75d4-4da3-a37c-bc7649047151 | foo-2 | ACTIVE |
| a1182f44-7163-4ad9-89ed-36611f75bac7 | foo-5 | ACTIVE |
| a91951b3-ff4e-46a0-a5fd-20064f02afc9 | foo-3 | ACTIVE |
| f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f | foo-1 | ACTIVE |
+--------------------------------------+-------+--------+
We'll use f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f
. That's the uuid of the server foo-1
. For convenience, the server uuid and the resource ID used in openstack metric resource
are the same.
$ openstack metric resource show f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f
+-----------------------+-------------------------------------------------------------------+
| Field | Value |
+-----------------------+-------------------------------------------------------------------+
| created_by_project_id | 6a180caa64894f04b1d04aeed1cb920e |
| created_by_user_id | 39d9e30374a74fe8b58dee9e1dcd7382 |
| creator | 39d9e30374a74fe8b58dee9e1dcd7382:6a180caa64894f04b1d04aeed1cb920e |
| ended_at | None |
| id | f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f |
| metrics | cpu: fe84c113-f98d-42ee-84fd-bdf360842bdb |
| | disk.ephemeral.size: ad79f268-5f56-4ff8-8ece-d1f170621217 |
| | disk.root.size: 6e021f8c-ead0-46e4-bd26-59131318e6a2 |
| | memory.usage: b768ec46-5e49-4d9a-b00d-004f610c152d |
| | memory: 1a4e720a-2151-4265-96cf-4daf633611b2 |
| | vcpus: 68654bc0-8275-4690-9433-27fe4a3aef9e |
| original_resource_id | f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f |
| project_id | 8d077dbea6034e5aa45c0146d1feac5f |
| revision_end | None |
| revision_start | 2021-11-09T10:00:46.241527+00:00 |
| started_at | 2021-11-09T09:29:12.842149+00:00 |
| type | instance |
| user_id | 65bfeb6cc8ec4df4a3d53550f7e99a5a |
+-----------------------+-------------------------------------------------------------------+
This list shows the metrics associated with the instance.
You are done here.
Checking if ceilometer is running
$ ssh controller-0 -l root
$ podman ps --format "{{.Names}} {{.Status}}" | grep ceilometer
ceilometer_agent_central Up 2 hours ago
ceilometer_agent_notification Up 2 hours ago
On compute nodes, there should be ceilometer_agent_compute running
$ podman ps --format "{{.Names}} {{.Status}}" | grep ceilometer
ceilometer_agent_compute Up 2 hours ago
The metrics are being sent from ceilometer to a remote defined in
/var/lib/config-data/puppet-generated/ceilometer/etc/ceilometer/pipeline.yaml
, which may look similar to the following file
---
sources:
- name: meter_source
meters:
- "*"
sinks:
- meter_sink
sinks:
- name: meter_sink
publishers:
- gnocchi://?filter_project=service&archive_policy=ceilometer-high-rate
- notifier://172.17.1.40:5666/?driver=amqp&topic=metering
In this case, data is sent to both STF and Gnocchi. Next step is to check
if there are any errors happening. On controllers and computes, ceilometer
logs are found in /var/log/containers/ceilometer/
.
The agent-notification.log
shows logs from publishing data, as well as
errors if sending out metrics or logs fails for some reason.
If there are any errors in the log file, it is likely that metrics are not being delivered to the remote.
2021-11-16 07:01:07.063 16 ERROR oslo_messaging.notify.messaging File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 136, in _send_notification
2021-11-16 07:01:07.063 16 ERROR oslo_messaging.notify.messaging retry=retry)
2021-11-16 07:01:07.063 16 ERROR oslo_messaging.notify.messaging File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_amqp1.py", line 295, in wrap
2021-11-16 07:01:07.063 16 ERROR oslo_messaging.notify.messaging return func(self, *args, **kws)
2021-11-16 07:01:07.063 16 ERROR oslo_messaging.notify.messaging File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_amqp1.py", line 397, in send_notification
2021-11-16 07:01:07.063 16 ERROR oslo_messaging.notify.messaging raise rc
2021-11-16 07:01:07.063 16 ERROR oslo_messaging.notify.messaging oslo_messaging.exceptions.MessageDeliveryFailure: Notify message sent to <Target topic=event.sample> failed: timed out
In this case, it failes to send messages to the STF instance. The following example shows the gnocchi api not responding or not being accessible
2021-11-16 10:38:07.707 16 ERROR ceilometer.publisher.gnocchi [-] <html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>
(HTTP 503): gnocchiclient.exceptions.ClientException: <html><body><h1>503 Service Unavailable</h1>
For more gnocchi debugging, see the gnocchi section.
Gnocchi
Gnocchi sits on controller nodes and consists of three separate containers, gnocchi_metricd, gnocchi_statsd, and gnocchi_api. The latter is for the interaction with the outside world, such as ingesting metrics or returning measurements.
Gnocchi metricd are used for re-calculating metrics, downsampling for lower
granularity, etc. Gnocchi logfiles are found under /var/log/containers/gnocchi
and the gnocchi API is hooked into httpd, thus the logfiles are
stored under /var/log/containers/httpd/gnocchi-api/
. The corresponding files
there are either gnocchi_wsgi_access.log
or gnocchi_wsgi_error.log
.
In the case from above (ceilometer section), where ceilometer could not send metrics to gnocchi, one would also observe log output for the gnocchi API.
Retrieving metrics from Gnocchi
For a starter, let's see which resources there are.
openstack server list -c ID -c Name -c Status
+--------------------------------------+-------+--------+
| ID | Name | Status |
+--------------------------------------+-------+--------+
| 7e087c31-7e9c-47f7-a4c4-ebcc20034faa | foo-4 | ACTIVE |
| 977fd250-75d4-4da3-a37c-bc7649047151 | foo-2 | ACTIVE |
| a1182f44-7163-4ad9-89ed-36611f75bac7 | foo-5 | ACTIVE |
| a91951b3-ff4e-46a0-a5fd-20064f02afc9 | foo-3 | ACTIVE |
| f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f | foo-1 | ACTIVE |
+--------------------------------------+-------+--------+
To show which metrics are stored for the vm foo-1
one would use the following
command
openstack metric resource show f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f --max-width 75
+-----------------------+-------------------------------------------------+
| Field | Value |
+-----------------------+-------------------------------------------------+
| created_by_project_id | 6a180caa64894f04b1d04aeed1cb920e |
| created_by_user_id | 39d9e30374a74fe8b58dee9e1dcd7382 |
| creator | 39d9e30374a74fe8b58dee9e1dcd7382:6a180caa64894f |
| | 04b1d04aeed1cb920e |
| ended_at | None |
| id | f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f |
| metrics | cpu: fe84c113-f98d-42ee-84fd-bdf360842bdb |
| | disk.ephemeral.size: |
| | ad79f268-5f56-4ff8-8ece-d1f170621217 |
| | disk.root.size: |
| | 6e021f8c-ead0-46e4-bd26-59131318e6a2 |
| | memory.usage: |
| | b768ec46-5e49-4d9a-b00d-004f610c152d |
| | memory: 1a4e720a-2151-4265-96cf-4daf633611b2 |
| | vcpus: 68654bc0-8275-4690-9433-27fe4a3aef9e |
| original_resource_id | f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f |
| project_id | 8d077dbea6034e5aa45c0146d1feac5f |
| revision_end | None |
| revision_start | 2021-11-09T10:00:46.241527+00:00 |
| started_at | 2021-11-09T09:29:12.842149+00:00 |
| type | instance |
| user_id | 65bfeb6cc8ec4df4a3d53550f7e99a5a |
+-----------------------+-------------------------------------------------+
To view the memory usage between Nov 18 2021 17:00 UTC and 17:05 UTC, one would issue this command:
openstack metric measures show --start 2021-11-18T17:00:00 \
--stop 2021-11-18T17:05:00 \
--aggregation mean
b768ec46-5e49-4d9a-b00d-004f610c152d
+---------------------------+-------------+-------------+
| timestamp | granularity | value |
+---------------------------+-------------+-------------+
| 2021-11-18T17:00:00+00:00 | 3600.0 | 28.87890625 |
| 2021-11-18T17:00:00+00:00 | 60.0 | 28.87890625 |
| 2021-11-18T17:01:00+00:00 | 60.0 | 28.87890625 |
| 2021-11-18T17:02:00+00:00 | 60.0 | 28.87890625 |
| 2021-11-18T17:03:00+00:00 | 60.0 | 28.87890625 |
| 2021-11-18T17:04:00+00:00 | 60.0 | 28.87890625 |
| 2021-11-18T17:00:14+00:00 | 1.0 | 28.87890625 |
| 2021-11-18T17:00:44+00:00 | 1.0 | 28.87890625 |
| 2021-11-18T17:01:14+00:00 | 1.0 | 28.87890625 |
| 2021-11-18T17:01:44+00:00 | 1.0 | 28.87890625 |
| 2021-11-18T17:02:14+00:00 | 1.0 | 28.87890625 |
| 2021-11-18T17:02:44+00:00 | 1.0 | 28.87890625 |
| 2021-11-18T17:03:14+00:00 | 1.0 | 28.87890625 |
| 2021-11-18T17:03:44+00:00 | 1.0 | 28.87890625 |
| 2021-11-18T17:04:14+00:00 | 1.0 | 28.87890625 |
| 2021-11-18T17:04:44+00:00 | 1.0 | 28.87890625 |
+---------------------------+-------------+-------------+
This shows, the data is available with granularity 3600, 60 and 1 sec. The memory usage does not change over the time, that's why the values don't change. Please note, if you'd be asking for values with the granularity of 300, the result will be empty
$ openstack metric measures show --start 2021-11-18T17:00:00 \
--stop 2021-11-18T17:05:00 \
--aggregation mean \
--granularity 300
b768ec46-5e49-4d9a-b00d-004f610c152d
Aggregation method 'mean' at granularity '300.0' for metric b768ec46-5e49-4d9a-b00d-004f610c152d does not exist (HTTP 404)
More info about the metric can be actually listed by using
openstack metric show --resource-id f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f \
memory.usage \
--max-width 75
+--------------------------------+----------------------------------------+
| Field | Value |
+--------------------------------+----------------------------------------+
| archive_policy/name | ceilometer-high-rate |
| creator | 39d9e30374a74fe8b58dee9e1dcd7382:6a180 |
| | caa64894f04b1d04aeed1cb920e |
| id | b768ec46-5e49-4d9a-b00d-004f610c152d |
| name | memory.usage |
| resource/created_by_project_id | 6a180caa64894f04b1d04aeed1cb920e |
| resource/created_by_user_id | 39d9e30374a74fe8b58dee9e1dcd7382 |
| resource/creator | 39d9e30374a74fe8b58dee9e1dcd7382:6a180 |
| | caa64894f04b1d04aeed1cb920e |
| resource/ended_at | None |
| resource/id | f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f |
| resource/original_resource_id | f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f |
| resource/project_id | 8d077dbea6034e5aa45c0146d1feac5f |
| resource/revision_end | None |
| resource/revision_start | 2021-11-09T10:00:46.241527+00:00 |
| resource/started_at | 2021-11-09T09:29:12.842149+00:00 |
| resource/type | instance |
| resource/user_id | 65bfeb6cc8ec4df4a3d53550f7e99a5a |
| unit | MB |
+--------------------------------+----------------------------------------+
It shows in this case, the used archive policy is ceilometer-high-rate.
openstack metric archive-policy show ceilometer-high-rate --max-width 75
+---------------------+---------------------------------------------------+
| Field | Value |
+---------------------+---------------------------------------------------+
| aggregation_methods | mean, rate:mean |
| back_window | 0 |
| definition | - timespan: 1:00:00, granularity: 0:00:01, |
| | points: 3600 |
| | - timespan: 1 day, 0:00:00, granularity: 0:01:00, |
| | points: 1440 |
| | - timespan: 365 days, 0:00:00, granularity: |
| | 1:00:00, points: 8760 |
| name | ceilometer-high-rate |
+---------------------+---------------------------------------------------+
That means, in this case, the aggregation methods one could use for querying the metrics are just mean and rate:mean. Other methods could include min or max.
Alarming
Alarms can be retrieved by issuing
$ openstack alarm list
To create an alarm, for example based on disk.ephemeral.size, one would use something like
openstack alarm create --alarm-action 'log://' \
--ok-action 'log://' \
--comparison-operator ge \
--evaluation-periods 1 \
--granularity 60 \
--aggregation-method mean \
--metric disk.ephemeral.size \
--resource-id f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f \
--name ephemeral \
-t gnocchi_resources_threshold \
--resource-type instance \
--threshold 1
+---------------------------+----------------------------------------+
| Field | Value |
+---------------------------+----------------------------------------+
| aggregation_method | mean |
| alarm_actions | ['log:'] |
| alarm_id | 994a1710-98e8-495f-89b5-f14349575c96 |
| comparison_operator | ge |
| description | gnocchi_resources_threshold alarm rule |
| enabled | True |
| evaluation_periods | 1 |
| granularity | 60 |
| insufficient_data_actions | [] |
| metric | disk.ephemeral.size |
| name | ephemeral |
| ok_actions | ['log:'] |
| project_id | 8d077dbea6034e5aa45c0146d1feac5f |
| repeat_actions | False |
| resource_id | f0a62c20-2304-4c8c-aaf5-f5bc9d385b5f |
| resource_type | instance |
| severity | low |
| state | insufficient data |
| state_reason | Not evaluated yet |
| state_timestamp | 2021-11-22T10:16:15.250720 |
| threshold | 1.0 |
| time_constraints | [] |
| timestamp | 2021-11-22T10:16:15.250720 |
| type | gnocchi_resources_threshold |
| user_id | 65bfeb6cc8ec4df4a3d53550f7e99a5a |
+---------------------------+----------------------------------------+
The state here insufficient data
states, the data gathered or
stored is not sufficient to compare against. There is also a state
reason given, in this case Not evaluated yet
, which gives an explanation.
Another valid reason could be No datapoint for granularity 60
.
Further debugging
On OpenStack installations deployed via Tripleo aka OSP Director, the log files are located
on the separate nodes under /var/log/containers/{service_name}/
. The config files for
the services are stored under /var/lib/config-data/puppet-generated/<service_name>
and are mounted into the containers.