Ceilometer & Aodh 深度解析:OpenStack 监控告警体系
监控体系演进
OpenStack 监控体系经历了多次重构:
1 2 3 4 5 6 7 8 9 10 11 12 13
| 早期(Havana ~ Mitaka): Ceilometer = 采集 + 存储 + 告警(大一统,性能差)
拆分后(Newton+): Ceilometer → 纯采集层(Polling + Notification) Gnocchi → 时序数据存储(替代 MongoDB) Aodh → 告警引擎 Panko → 事件存储(已废弃)
现代方案(生产推荐): Ceilometer → 采集 Prometheus → 存储 + 告警(openstack-exporter) Grafana → 可视化
|
Ceilometer 架构
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| OpenStack 服务(Nova/Neutron/Cinder...) │ oslo.messaging 通知 ▼ ceilometer-notification-agent(事件驱动采集) │ ▼ ceilometer-polling-agent(轮询采集) │ 定时调用各服务 API ▼ 数据管道(Pipeline) ├── 转换(Transformer) └── 发布(Publisher) ├── Gnocchi(时序存储) ├── Kafka └── UDP/HTTP
|
两种采集方式
1. Notification(事件驱动)
1 2 3 4 5 6 7 8 9 10
|
compute.instance.create.end compute.instance.delete.end compute.instance.resize.end network.create.end volume.create.end
|
2. Polling(轮询)
1 2 3 4 5 6 7 8 9
|
cpu cpu_util memory.usage disk.read.bytes network.incoming.bytes
|
Gnocchi:时序数据存储
Gnocchi 是专为 OpenStack 设计的时序数据库:
1 2 3 4 5 6 7 8 9 10
| gnocchi-api(REST API) │ ├── Resource(资源,如一台 VM) │ └── Metric(指标,如 cpu_util) │ └── Measure(测量值,时间戳 + 值) │ └── Archive Policy(归档策略) ├── 1分钟精度,保留7天 ├── 1小时精度,保留30天 └── 1天精度,保留1年
|
归档策略
1 2 3 4 5 6 7 8 9 10 11
| openstack metric archive-policy create high-res \ --definition "granularity:60s,timespan:7d" \ --definition "granularity:3600s,timespan:30d" \ --definition "granularity:86400s,timespan:365d"
openstack metric measures show <metric-id> \ --start 2026-04-01T00:00:00 \ --stop 2026-04-14T00:00:00 \ --granularity 3600
|
Aodh:告警引擎
Aodh 基于 Gnocchi 中的指标数据触发告警:
1 2 3 4 5 6 7 8
| aodh-evaluator(定期评估告警规则) │ ├── 查询 Gnocchi 指标数据 ├── 与阈值比较 └── 触发告警动作 ├── HTTP 回调(Webhook) ├── Heat 扩缩容信号 └── Zaqar 消息队列通知
|
告警类型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| openstack alarm create \ --type gnocchi_resources_threshold \ --name cpu-high-alarm \ --metric cpu_util \ --threshold 80 \ --comparison-operator gt \ --aggregation-method mean \ --evaluation-periods 3 \ --granularity 60 \ --resource-type instance \ --resource-id <instance-uuid> \ --alarm-action http://webhook.example.com/scale-up
openstack alarm create \ --type composite \ --name complex-alarm \ --composite-rule '{ "or": [ {"threshold": 90, "metric": "cpu_util", ...}, {"threshold": 95, "metric": "memory.usage", ...} ] }'
|
与 Prometheus 集成(现代方案)
生产环境越来越多地用 Prometheus 替代 Gnocchi/Aodh:
openstack-exporter
1 2 3 4 5 6 7 8 9
| services: openstack-exporter: image: ghcr.io/openstack-exporter/openstack-exporter:latest volumes: - ./clouds.yaml:/etc/openstack/clouds.yaml ports: - "9180:9180" command: --cloud mycloud
|
1 2 3 4 5 6 7 8 9 10 11
| clouds: mycloud: auth: auth_url: http://keystone:5000/v3 username: admin password: secret project_name: admin user_domain_name: Default project_domain_name: Default region_name: RegionOne
|
关键指标
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| # Nova 指标 openstack_nova_agent_state{hostname, service, adminState} openstack_nova_server_status{id, name, status, tenant_id} openstack_nova_vcpus_available openstack_nova_vcpus_used openstack_nova_memory_available_bytes openstack_nova_memory_used_bytes
# Neutron 指标 openstack_neutron_agent_state{hostname, service} openstack_neutron_network_count{tenant_id} openstack_neutron_floating_ips{tenant_id, status}
# Cinder 指标 openstack_cinder_agent_state{hostname, service} openstack_cinder_volume_status{id, status, size} openstack_cinder_pool_capacity_gb_free{name, backend}
|
Prometheus 告警规则
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
| groups: - name: openstack.nova rules: - alert: NovaComputeServiceDown expr: openstack_nova_agent_state{service="nova-compute"} == 0 for: 2m labels: severity: critical component: nova annotations: summary: "Nova Compute 服务宕机" description: "节点 {{ $labels.hostname }} 的 nova-compute 已宕机超过 2 分钟"
- alert: NovaHighVCPUUsage expr: | openstack_nova_vcpus_used / openstack_nova_vcpus_available > 0.9 for: 10m labels: severity: warning annotations: summary: "Nova vCPU 使用率超过 90%"
- name: openstack.neutron rules: - alert: NeutronAgentDown expr: openstack_neutron_agent_state == 0 for: 2m labels: severity: critical annotations: summary: "Neutron Agent 宕机: {{ $labels.service }} on {{ $labels.hostname }}"
- name: openstack.cinder rules: - alert: CinderPoolLowCapacity expr: | openstack_cinder_pool_capacity_gb_free / (openstack_cinder_pool_capacity_gb_free + openstack_cinder_pool_capacity_gb_used) < 0.1 for: 5m labels: severity: warning annotations: summary: "Cinder 存储池 {{ $labels.name }} 剩余容量不足 10%"
|
Grafana Dashboard
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| { "title": "OpenStack 集群概览", "panels": [ { "title": "vCPU 使用率", "type": "gauge", "targets": [{ "expr": "openstack_nova_vcpus_used / openstack_nova_vcpus_available * 100" }], "thresholds": [{"value": 80, "color": "yellow"}, {"value": 90, "color": "red"}] }, { "title": "服务健康状态", "type": "stat", "targets": [{ "expr": "count(openstack_nova_agent_state == 0) by (service)" }] } ] }
|
生产监控体系建议
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| 数据采集层: openstack-exporter → Prometheus(15s 采集间隔) node-exporter → Prometheus(宿主机指标) libvirt-exporter → Prometheus(KVM 虚拟机指标)
存储层: Prometheus(短期,15天) Thanos / VictoriaMetrics(长期,1年+)
告警层: Alertmanager(去重、分组、路由) → 钉钉/企业微信/PagerDuty
可视化层: Grafana(Dashboard) → OpenStack 集群概览 → 每个组件详细面板 → 容量规划趋势图
|