Prometheus exporter for NVIDIA GPU metrics, including hardware utilization and per-process GPU memory usage.
nvidia_smi_up - nvidia-smi availability (1=up, 0=down)nvidia_gpu_info - GPU device informationnvidia_gpu_utilization_percent - GPU compute utilizationnvidia_gpu_memory_used_bytes - GPU memory usednvidia_gpu_memory_total_bytes - GPU total memorynvidia_gpu_memory_free_bytes - GPU free memorynvidia_gpu_temperature_celsius - GPU temperaturenvidia_gpu_power_watts - GPU power usagenvidia_gpu_power_limit_watts - GPU power limitnvidia_gpu_fan_speed_percent - GPU fan speednvidia_gpu_process_memory_bytes - Process GPU memory usagenvidia_gpu_process_info - Process information (cmdline, username)nvidia_gpu_process_count - Number of processes per GPUnvidia_exporter_collect_duration_seconds - Collection durationnvidia_exporter_collect_errors_total - Collection errors countnvidia_exporter_last_collect_timestamp_seconds - Last collection timestampnvidia-smi command available# Clone the repository
git clone https://github.com/yuwen/gpu-exporter.git
cd gpu-exporter
# Build
go build -o gpu-exporter .
# Run
./gpu-exporter
Download the latest release from the releases page.
chmod +x gpu-exporter
./gpu-exporter
./gpu-exporter [flags]
Flags:
-web.listen-address string
Address to listen on for web interface and telemetry (default ":9101")
-web.telemetry-path string
Path under which to expose metrics (default "/metrics")
-nvidia.timeout duration
Timeout for nvidia-smi command (default 5s)
-version
Print version information
# Start exporter on custom port
./gpu-exporter -web.listen-address=":9102"
# Use custom timeout for nvidia-smi
./gpu-exporter -nvidia.timeout=10s
# Check version
./gpu-exporter -version
# View metrics
curl http://localhost:9101/metrics
# Check health
curl http://localhost:9101/health
Create /etc/systemd/system/gpu-exporter.service:
[Unit]
Description=NVIDIA GPU Exporter
After=network.target
[Service]
Type=simple
User=nobody
ExecStart=/usr/local/bin/gpu-exporter
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
Enable and start the service:
sudo cp gpu-exporter /usr/local/bin/
sudo chmod +x /usr/local/bin/gpu-exporter
sudo systemctl daemon-reload
sudo systemctl enable gpu-exporter
sudo systemctl start gpu-exporter
sudo systemctl status gpu-exporter
View logs:
sudo journalctl -u gpu-exporter -f
Pre-configured Prometheus configuration files are included in the prometheus/ directory.
1. Copy configuration files:
sudo mkdir -p /etc/prometheus
sudo cp prometheus/prometheus.yml /etc/prometheus/
sudo cp prometheus/alerts.yml /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus
2. Update targets in prometheus.yml:
scrape_configs:
- job_name: 'gpu-exporter'
static_configs:
- targets:
- 'gpu-server-01:9101' # Replace with your server addresses
- 'gpu-server-02:9101'
3. Restart Prometheus:
sudo systemctl restart prometheus
4. Verify:
Visit http://localhost:9090 and check:
prometheus/prometheus.yml - Main Prometheus configurationprometheus/alerts.yml - Alert rules for GPU monitoringprometheus/gpu-servers.yml.example - File-based service discovery example (YAML)prometheus/gpu-servers.json.example - File-based service discovery example (JSON)prometheus/PROMETHEUS_SETUP.md - Detailed setup guideThe configuration includes comprehensive alert rules:
Health Alerts:
Hardware Alerts:
Utilization Alerts:
Process Alerts:
Static Configuration (simple):
static_configs:
- targets: ['gpu-server:9101']
File-based Discovery (recommended):
file_sd_configs:
- files: ['/etc/prometheus/targets/*.yml']
refresh_interval: 30s
Consul Discovery:
consul_sd_configs:
- server: 'consul:8500'
For detailed instructions, see prometheus/PROMETHEUS_SETUP.md.
# Check if nvidia-smi is up nvidia_smi_up # GPU memory utilization rate (nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) * 100 # Total memory used by all processes of a specific user sum by (username) (nvidia_gpu_process_memory_bytes) # Processes running on multiple GPUs (same PID on different GPUs) count by (pid, process_name, username) (nvidia_gpu_process_memory_bytes) > 1 # GPU temperature over 70°C nvidia_gpu_temperature_celsius > 70 # GPU power usage rate (nvidia_gpu_power_watts / nvidia_gpu_power_limit_watts) * 100 # Top 5 processes by memory usage topk(5, nvidia_gpu_process_memory_bytes) # Idle GPUs (low utilization but memory occupied) nvidia_gpu_utilization_percent < 5 and nvidia_gpu_memory_used_bytes > 0 # Number of GPUs per user count by (username) ( sum by (gpu_index, username) (nvidia_gpu_process_memory_bytes) > 0 ) # Average GPU utilization across all GPUs avg(nvidia_gpu_utilization_percent)
Pre-built Grafana dashboards are included in the grafana/ directory.
The full dashboard (grafana/gpu-dashboard.json) includes the following panels:
Row 1 - Overview Stats:
Row 2 - Hardware Metrics:
Row 3 - Resource Usage:
Row 4 - Process Monitoring:
Row 5 - Detailed View:
The simplified dashboard (grafana/gpu-memory-dashboard.json) focuses on memory monitoring and process information:
Panel 1 - GPU Memory Trend (Line Chart)
Panel 2 - GPU Memory Overview Table
Panel 3 - GPU Process Details
grafana/gpu-dashboard.json - Full-featured dashboardgrafana/gpu-memory-dashboard.json - Simplified dashboard# Replace with your Grafana URL and API key
GRAFANA_URL="http://localhost:3000"
GRAFANA_API_KEY="your-api-key"
curl -X POST \
-H "Authorization: Bearer ${GRAFANA_API_KEY}" \
-H "Content-Type: application/json" \
-d @grafana/gpu-dashboard.json \
"${GRAFANA_URL}/api/dashboards/db"
Create /etc/grafana/provisioning/dashboards/gpu-exporter.yaml:
apiVersion: 1
providers:
- name: 'GPU Exporter'
orgId: 1
folder: 'Hardware Monitoring'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards/gpu-exporter
Copy the dashboard file:
sudo mkdir -p /var/lib/grafana/dashboards/gpu-exporter
sudo cp grafana/gpu-dashboard.json /var/lib/grafana/dashboards/gpu-exporter/
sudo chown -R grafana:grafana /var/lib/grafana/dashboards
sudo systemctl restart grafana-server
The dashboard includes the following variables for filtering:
You can customize the dashboard by:
Example Prometheus alerting rules:
groups:
- name: gpu_alerts
rules:
- alert: NvidiaSmiDown
expr: nvidia_smi_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "nvidia-smi unavailable on {{ $labels.instance }}"
- alert: GPUHighTemperature
expr: nvidia_gpu_temperature_celsius > 80
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu_index }} temperature is {{ $value }}°C"
- alert: GPUMemoryFull
expr: (nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) > 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu_index }} memory usage is {{ $value | humanizePercentage }}"
- alert: GPUHighPower
expr: (nvidia_gpu_power_watts / nvidia_gpu_power_limit_watts) > 0.9
for: 10m
labels:
severity: info
annotations:
summary: "GPU {{ $labels.gpu_index }} power usage is high"
Ensure NVIDIA drivers are installed and nvidia-smi is in PATH:
which nvidia-smi
nvidia-smi
The exporter needs permission to read /proc/{pid}/ for process information. Run as a user with appropriate permissions or use sudo.
If working directory shows "unknown", refer to grafana/WORKDIR_TROUBLESHOOTING.md.
Increase the scrape interval in Prometheus or adjust the nvidia timeout:
./gpu-exporter -nvidia.timeout=10s
gpu-exporter/ ├── main.go # Main entry point ├── collector/ │ ├── nvidia.go # nvidia-smi wrapper │ └── collector.go # Prometheus collector ├── grafana/ │ ├── gpu-dashboard.json # Full-featured Grafana dashboard │ ├── gpu-memory-dashboard.json # Simplified dashboard │ └── WORKDIR_TROUBLESHOOTING.md # Working directory troubleshooting ├── prometheus/ │ ├── prometheus.yml # Prometheus configuration │ ├── alerts.yml # Alert rules │ ├── gpu-servers.yml.example # File SD example (YAML) │ ├── gpu-servers.json.example # File SD example (JSON) │ ├── PROMETHEUS_SETUP.md # Detailed setup guide │ ├── INTEGRATION_GUIDE.md # Integration guide │ └── README.md # Prometheus configuration README ├── systemd/ │ └── gpu-exporter.service # Systemd service file ├── go.mod # Go module dependencies ├── go.sum # Go module checksums ├── Makefile # Build and installation scripts ├── REQUIREMENTS.md # Requirements documentation ├── README.md # English documentation └── README_zh.md # Chinese documentation
go build -o gpu-exporter .
go test ./...
# Linux AMD64
GOOS=linux GOARCH=amd64 go build -o gpu-exporter-linux-amd64 .
# Linux ARM64
GOOS=linux GOARCH=arm64 go build -o gpu-exporter-linux-arm64 .
MIT License
Contributions are welcome! Please feel free to submit a Pull Request.
yuwen