基于Go语言开发的NVIDIA GPU Prometheus监控导出器,提供GPU硬件指标和进程级显存监控。
nvidia_smi_up - nvidia-smi可用性(1=正常,0=异常)nvidia_gpu_info - GPU设备信息nvidia_gpu_utilization_percent - GPU计算利用率nvidia_gpu_memory_used_bytes - GPU已用显存nvidia_gpu_memory_total_bytes - GPU总显存nvidia_gpu_memory_free_bytes - GPU空闲显存nvidia_gpu_temperature_celsius - GPU温度nvidia_gpu_power_watts - GPU功耗nvidia_gpu_power_limit_watts - GPU功耗上限nvidia_gpu_fan_speed_percent - GPU风扇转速nvidia_gpu_process_memory_bytes - 进程GPU显存使用量nvidia_gpu_process_info - 进程信息(命令行、用户名、工作目录)nvidia_gpu_process_count - 每个GPU的进程数量nvidia_exporter_collect_duration_seconds - 采集耗时nvidia_exporter_collect_errors_total - 采集错误次数nvidia_exporter_last_collect_timestamp_seconds - 最后采集时间戳nvidia-smi 命令# 克隆仓库
git clone https://github.com/yuwen/gpu-exporter.git
cd gpu-exporter
# 编译
go build -o gpu-exporter .
# 运行
./gpu-exporter
# 编译
make build
# 安装(复制到/usr/local/bin并安装systemd服务)
sudo make install
# 卸载
sudo make uninstall
./gpu-exporter [参数]
参数说明:
-web.listen-address string
监听地址和端口 (默认 ":9101")
-web.telemetry-path string
指标导出路径 (默认 "/metrics")
-nvidia.timeout duration
nvidia-smi命令超时时间 (默认 5s)
-version
显示版本信息
# 使用自定义端口启动
./gpu-exporter -web.listen-address=":9102"
# 设置nvidia-smi超时时间
./gpu-exporter -nvidia.timeout=10s
# 查看版本
./gpu-exporter -version
# 查看指标
curl http://localhost:9101/metrics
# 健康检查
curl http://localhost:9101/health
创建 /etc/systemd/system/gpu-exporter.service:
[Unit]
Description=NVIDIA GPU Exporter
After=network.target
[Service]
Type=simple
User=nobody
ExecStart=/usr/local/bin/gpu-exporter
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
启用并启动服务:
sudo cp gpu-exporter /usr/local/bin/
sudo chmod +x /usr/local/bin/gpu-exporter
sudo systemctl daemon-reload
sudo systemctl enable gpu-exporter
sudo systemctl start gpu-exporter
sudo systemctl status gpu-exporter
查看日志:
sudo journalctl -u gpu-exporter -f
预配置的Prometheus配置文件在 prometheus/ 目录中。
1. 复制配置文件:
sudo mkdir -p /etc/prometheus
sudo cp prometheus/prometheus.yml /etc/prometheus/
sudo cp prometheus/alerts.yml /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus
2. 修改prometheus.yml中的目标地址:
scrape_configs:
- job_name: 'gpu-exporter'
static_configs:
- targets:
- 'gpu-server-01:9101' # 替换为你的GPU服务器地址
- 'gpu-server-02:9101'
3. 重启Prometheus:
sudo systemctl restart prometheus
4. 验证:
访问 http://localhost:9090 并检查:
prometheus/prometheus.yml - Prometheus主配置文件prometheus/alerts.yml - GPU监控告警规则prometheus/gpu-servers.yml.example - 文件服务发现示例(YAML格式)prometheus/gpu-servers.json.example - 文件服务发现示例(JSON格式)prometheus/PROMETHEUS_SETUP.md - 详细配置指南配置包含完整的告警规则:
健康告警:
硬件告警:
利用率告警:
进程告警:
静态配置(简单):
static_configs:
- targets: ['gpu-server:9101']
文件服务发现(推荐):
file_sd_configs:
- files: ['/etc/prometheus/targets/*.yml']
refresh_interval: 30s
Consul服务发现:
consul_sd_configs:
- server: 'consul:8500'
详细说明请查看 prometheus/PROMETHEUS_SETUP.md。
# 检查nvidia-smi是否正常 nvidia_smi_up # GPU显存使用率 (nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) * 100 # 按用户统计总显存使用量 sum by (username) (nvidia_gpu_process_memory_bytes) # 跨多个GPU运行的进程(同一PID出现在不同GPU上) count by (pid, process_name, username) (nvidia_gpu_process_memory_bytes) > 1 # GPU温度超过70°C nvidia_gpu_temperature_celsius > 70 # GPU功耗使用率 (nvidia_gpu_power_watts / nvidia_gpu_power_limit_watts) * 100 # 显存占用TOP 5的进程 topk(5, nvidia_gpu_process_memory_bytes) # 空闲GPU(利用率低但显存被占用) nvidia_gpu_utilization_percent < 5 and nvidia_gpu_memory_used_bytes > 0 # 每个用户占用的GPU数量 count by (username) ( sum by (gpu_index, username) (nvidia_gpu_process_memory_bytes) > 0 ) # 所有GPU的平均利用率 avg(nvidia_gpu_utilization_percent)
grafana/ 目录中包含预构建的Grafana仪表板。
完整版仪表板包含以下面板:
第一行 - 概览状态:
第二行 - 硬件指标:
第三行 - 资源使用:
第四行 - 进程监控:
第五行 - 详细视图:
简化版专注于显存监控和进程信息:
面板1 - GPU显存使用趋势(折线图)
面板2 - GPU显存概览表
面板3 - GPU进程详情
grafana/gpu-dashboard.json - 完整版grafana/gpu-memory-dashboard.json - 简化版# 替换为你的Grafana URL和API密钥
GRAFANA_URL="http://localhost:3000"
GRAFANA_API_KEY="your-api-key"
curl -X POST \
-H "Authorization: Bearer ${GRAFANA_API_KEY}" \
-H "Content-Type: application/json" \
-d @grafana/gpu-dashboard.json \
"${GRAFANA_URL}/api/dashboards/db"
创建 /etc/grafana/provisioning/dashboards/gpu-exporter.yaml:
apiVersion: 1
providers:
- name: 'GPU Exporter'
orgId: 1
folder: '硬件监控'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards/gpu-exporter
复制dashboard文件:
sudo mkdir -p /var/lib/grafana/dashboards/gpu-exporter
sudo cp grafana/*.json /var/lib/grafana/dashboards/gpu-exporter/
sudo chown -R grafana:grafana /var/lib/grafana/dashboards
sudo systemctl restart grafana-server
仪表板包含以下过滤变量:
你可以自定义仪表板:
Prometheus告警规则示例:
groups:
- name: gpu_alerts
rules:
- alert: NvidiaSmiDown
expr: nvidia_smi_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }}上nvidia-smi不可用"
- alert: GPUHighTemperature
expr: nvidia_gpu_temperature_celsius > 80
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu_index }}温度为{{ $value }}°C"
- alert: GPUMemoryFull
expr: (nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) > 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu_index }}显存使用率为{{ $value | humanizePercentage }}"
- alert: GPUHighPower
expr: (nvidia_gpu_power_watts / nvidia_gpu_power_limit_watts) > 0.9
for: 10m
labels:
severity: info
annotations:
summary: "GPU {{ $labels.gpu_index }}功耗使用率较高"
确保已安装NVIDIA驱动且nvidia-smi在PATH中:
which nvidia-smi
nvidia-smi
Exporter需要权限读取 /proc/{pid}/ 以获取进程信息。使用具有适当权限的用户运行,或使用 sudo。
如果工作目录显示为"unknown",参考 grafana/WORKDIR_TROUBLESHOOTING.md。
增加Prometheus的抓取间隔或调整nvidia超时时间:
./gpu-exporter -nvidia.timeout=10s
gpu-exporter/ ├── main.go # 主程序入口 ├── collector/ │ ├── nvidia.go # nvidia-smi封装 │ └── collector.go # Prometheus采集器 ├── grafana/ │ ├── gpu-dashboard.json # 完整版Grafana仪表板 │ ├── gpu-memory-dashboard.json # 简化版仪表板 │ └── WORKDIR_TROUBLESHOOTING.md # 工作目录故障排查 ├── prometheus/ │ ├── prometheus.yml # Prometheus配置 │ ├── alerts.yml # 告警规则 │ ├── gpu-servers.yml.example # 文件SD示例(YAML) │ ├── gpu-servers.json.example # 文件SD示例(JSON) │ ├── PROMETHEUS_SETUP.md # 详细配置指南 │ ├── INTEGRATION_GUIDE.md # 整合指南 │ └── README.md # Prometheus配置说明 ├── systemd/ │ └── gpu-exporter.service # Systemd服务文件 ├── go.mod # Go模块依赖 ├── go.sum # Go模块校验和 ├── Makefile # 构建和安装脚本 ├── REQUIREMENTS.md # 需求文档 ├── README.md # 英文文档 └── README_zh.md # 中文文档(本文件)
go build -o gpu-exporter .
go test ./...
# Linux AMD64
GOOS=linux GOARCH=amd64 go build -o gpu-exporter-linux-amd64 .
# Linux ARM64
GOOS=linux GOARCH=arm64 go build -o gpu-exporter-linux-arm64 .
MIT License
欢迎贡献!请随时提交Pull Request。
yuwen