Ceilometer & Aodh 深度解析:OpenStack 监控告警体系

Ceilometer & Aodh 深度解析:OpenStack 监控告警体系

监控体系演进

OpenStack 监控体系经历了多次重构:

1
2
3
4
5
6
7
8
9
10
11
12
13
早期(Havana ~ Mitaka):
Ceilometer = 采集 + 存储 + 告警(大一统,性能差)

拆分后(Newton+):
Ceilometer → 纯采集层(Polling + Notification
Gnocchi → 时序数据存储(替代 MongoDB
Aodh → 告警引擎
Panko → 事件存储(已废弃)

现代方案(生产推荐):
Ceilometer → 采集
Prometheus → 存储 + 告警(openstack-exporter)
Grafana → 可视化

Ceilometer 架构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
OpenStack 服务(Nova/Neutron/Cinder...)
│ oslo.messaging 通知

ceilometer-notification-agent(事件驱动采集)


ceilometer-polling-agent(轮询采集)
│ 定时调用各服务 API

数据管道(Pipeline)
├── 转换(Transformer)
└── 发布(Publisher)
├── Gnocchi(时序存储)
├── Kafka
└── UDP/HTTP

两种采集方式

1. Notification(事件驱动)

1
2
3
4
5
6
7
8
9
10
# ceilometer/dispatcher/gnocchi.py
# 监听 oslo.messaging 通知总线
# Nova 创建/删除 VM 时会发送通知,Ceilometer 接收并记录

# 典型通知事件
compute.instance.create.end
compute.instance.delete.end
compute.instance.resize.end
network.create.end
volume.create.end

2. Polling(轮询)

1
2
3
4
5
6
7
8
9
# ceilometer/polling/manager.py
# 定时调用各服务 API 获取指标

# 典型轮询指标
cpu # CPU 使用时间(累计纳秒)
cpu_util # CPU 使用率(%)
memory.usage # 内存使用量(MB)
disk.read.bytes # 磁盘读取字节数
network.incoming.bytes # 网络入流量

Gnocchi:时序数据存储

Gnocchi 是专为 OpenStack 设计的时序数据库:

1
2
3
4
5
6
7
8
9
10
gnocchi-api(REST API)

├── Resource(资源,如一台 VM)
│ └── Metric(指标,如 cpu_util)
│ └── Measure(测量值,时间戳 + 值)

└── Archive Policy(归档策略)
├── 1分钟精度,保留7
├── 1小时精度,保留30
└── 1天精度,保留1

归档策略

1
2
3
4
5
6
7
8
9
10
11
# 创建归档策略
openstack metric archive-policy create high-res \
--definition "granularity:60s,timespan:7d" \
--definition "granularity:3600s,timespan:30d" \
--definition "granularity:86400s,timespan:365d"

# 查询指标
openstack metric measures show <metric-id> \
--start 2026-04-01T00:00:00 \
--stop 2026-04-14T00:00:00 \
--granularity 3600

Aodh:告警引擎

Aodh 基于 Gnocchi 中的指标数据触发告警:

1
2
3
4
5
6
7
8
aodh-evaluator(定期评估告警规则)

├── 查询 Gnocchi 指标数据
├── 与阈值比较
└── 触发告警动作
├── HTTP 回调(Webhook)
├── Heat 扩缩容信号
└── Zaqar 消息队列通知

告警类型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 1. 阈值告警(基于 Gnocchi 指标)
openstack alarm create \
--type gnocchi_resources_threshold \
--name cpu-high-alarm \
--metric cpu_util \
--threshold 80 \
--comparison-operator gt \
--aggregation-method mean \
--evaluation-periods 3 \
--granularity 60 \
--resource-type instance \
--resource-id <instance-uuid> \
--alarm-action http://webhook.example.com/scale-up

# 2. 复合告警(AND/OR 组合)
openstack alarm create \
--type composite \
--name complex-alarm \
--composite-rule '{
"or": [
{"threshold": 90, "metric": "cpu_util", ...},
{"threshold": 95, "metric": "memory.usage", ...}
]
}'

与 Prometheus 集成(现代方案)

生产环境越来越多地用 Prometheus 替代 Gnocchi/Aodh:

openstack-exporter

1
2
3
4
5
6
7
8
9
# docker-compose.yml
services:
openstack-exporter:
image: ghcr.io/openstack-exporter/openstack-exporter:latest
volumes:
- ./clouds.yaml:/etc/openstack/clouds.yaml
ports:
- "9180:9180"
command: --cloud mycloud
1
2
3
4
5
6
7
8
9
10
11
# clouds.yaml
clouds:
mycloud:
auth:
auth_url: http://keystone:5000/v3
username: admin
password: secret
project_name: admin
user_domain_name: Default
project_domain_name: Default
region_name: RegionOne

关键指标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Nova 指标
openstack_nova_agent_state{hostname, service, adminState}
openstack_nova_server_status{id, name, status, tenant_id}
openstack_nova_vcpus_available
openstack_nova_vcpus_used
openstack_nova_memory_available_bytes
openstack_nova_memory_used_bytes

# Neutron 指标
openstack_neutron_agent_state{hostname, service}
openstack_neutron_network_count{tenant_id}
openstack_neutron_floating_ips{tenant_id, status}

# Cinder 指标
openstack_cinder_agent_state{hostname, service}
openstack_cinder_volume_status{id, status, size}
openstack_cinder_pool_capacity_gb_free{name, backend}

Prometheus 告警规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# openstack-alerts.yaml
groups:
- name: openstack.nova
rules:
- alert: NovaComputeServiceDown
expr: openstack_nova_agent_state{service="nova-compute"} == 0
for: 2m
labels:
severity: critical
component: nova
annotations:
summary: "Nova Compute 服务宕机"
description: "节点 {{ $labels.hostname }} 的 nova-compute 已宕机超过 2 分钟"

- alert: NovaHighVCPUUsage
expr: |
openstack_nova_vcpus_used / openstack_nova_vcpus_available > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "Nova vCPU 使用率超过 90%"

- name: openstack.neutron
rules:
- alert: NeutronAgentDown
expr: openstack_neutron_agent_state == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Neutron Agent 宕机: {{ $labels.service }} on {{ $labels.hostname }}"

- name: openstack.cinder
rules:
- alert: CinderPoolLowCapacity
expr: |
openstack_cinder_pool_capacity_gb_free /
(openstack_cinder_pool_capacity_gb_free + openstack_cinder_pool_capacity_gb_used) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Cinder 存储池 {{ $labels.name }} 剩余容量不足 10%"

Grafana Dashboard

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// 关键面板配置示例
{
"title": "OpenStack 集群概览",
"panels": [
{
"title": "vCPU 使用率",
"type": "gauge",
"targets": [{
"expr": "openstack_nova_vcpus_used / openstack_nova_vcpus_available * 100"
}],
"thresholds": [{"value": 80, "color": "yellow"}, {"value": 90, "color": "red"}]
},
{
"title": "服务健康状态",
"type": "stat",
"targets": [{
"expr": "count(openstack_nova_agent_state == 0) by (service)"
}]
}
]
}

生产监控体系建议

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
数据采集层:
openstack-exporter → Prometheus(15s 采集间隔)
node-exporter → Prometheus(宿主机指标)
libvirt-exporter → Prometheus(KVM 虚拟机指标)

存储层:
Prometheus(短期,15天)
Thanos / VictoriaMetrics(长期,1年+)

告警层:
Alertmanager(去重、分组、路由)
→ 钉钉/企业微信/PagerDuty

可视化层:
Grafana(Dashboard)
→ OpenStack 集群概览
→ 每个组件详细面板
→ 容量规划趋势图