logo
0
0
Login
XGQ<1439890861@qq.com>
Prometheus's GPU monitor

GPU Exporter

English | 中文

Prometheus exporter for NVIDIA GPU metrics, including hardware utilization and per-process GPU memory usage.

Features

  • 🔍 GPU Hardware Metrics: Temperature, utilization, memory, power, fan speed
  • 📊 Process-Level Monitoring: Track GPU memory usage per process with user information
  • 🎯 Multi-GPU Support: Monitor all GPUs on the system
  • Lightweight: Low CPU and memory footprint
  • 🔧 Easy Deployment: Single binary with systemd support

Metrics

Health Check

  • nvidia_smi_up - nvidia-smi availability (1=up, 0=down)

GPU Hardware Metrics

  • nvidia_gpu_info - GPU device information
  • nvidia_gpu_utilization_percent - GPU compute utilization
  • nvidia_gpu_memory_used_bytes - GPU memory used
  • nvidia_gpu_memory_total_bytes - GPU total memory
  • nvidia_gpu_memory_free_bytes - GPU free memory
  • nvidia_gpu_temperature_celsius - GPU temperature
  • nvidia_gpu_power_watts - GPU power usage
  • nvidia_gpu_power_limit_watts - GPU power limit
  • nvidia_gpu_fan_speed_percent - GPU fan speed

Process Metrics

  • nvidia_gpu_process_memory_bytes - Process GPU memory usage
  • nvidia_gpu_process_info - Process information (cmdline, username)
  • nvidia_gpu_process_count - Number of processes per GPU

Exporter Metrics

  • nvidia_exporter_collect_duration_seconds - Collection duration
  • nvidia_exporter_collect_errors_total - Collection errors count
  • nvidia_exporter_last_collect_timestamp_seconds - Last collection timestamp

Requirements

  • Linux operating system
  • NVIDIA GPU with driver installed
  • nvidia-smi command available

Installation

Build from Source

# Clone the repository git clone https://github.com/yuwen/gpu-exporter.git cd gpu-exporter # Build go build -o gpu-exporter . # Run ./gpu-exporter

Download Binary

Download the latest release from the releases page.

chmod +x gpu-exporter ./gpu-exporter

Usage

Command Line Options

./gpu-exporter [flags] Flags: -web.listen-address string Address to listen on for web interface and telemetry (default ":9101") -web.telemetry-path string Path under which to expose metrics (default "/metrics") -nvidia.timeout duration Timeout for nvidia-smi command (default 5s) -version Print version information

Example

# Start exporter on custom port ./gpu-exporter -web.listen-address=":9102" # Use custom timeout for nvidia-smi ./gpu-exporter -nvidia.timeout=10s # Check version ./gpu-exporter -version

Test Metrics

# View metrics curl http://localhost:9101/metrics # Check health curl http://localhost:9101/health

Systemd Service

Create /etc/systemd/system/gpu-exporter.service:

[Unit] Description=NVIDIA GPU Exporter After=network.target [Service] Type=simple User=nobody ExecStart=/usr/local/bin/gpu-exporter Restart=on-failure RestartSec=5s [Install] WantedBy=multi-user.target

Enable and start the service:

sudo cp gpu-exporter /usr/local/bin/ sudo chmod +x /usr/local/bin/gpu-exporter sudo systemctl daemon-reload sudo systemctl enable gpu-exporter sudo systemctl start gpu-exporter sudo systemctl status gpu-exporter

View logs:

sudo journalctl -u gpu-exporter -f

Prometheus Configuration

Pre-configured Prometheus configuration files are included in the prometheus/ directory.

Quick Setup

1. Copy configuration files:

sudo mkdir -p /etc/prometheus sudo cp prometheus/prometheus.yml /etc/prometheus/ sudo cp prometheus/alerts.yml /etc/prometheus/ sudo chown -R prometheus:prometheus /etc/prometheus

2. Update targets in prometheus.yml:

scrape_configs: - job_name: 'gpu-exporter' static_configs: - targets: - 'gpu-server-01:9101' # Replace with your server addresses - 'gpu-server-02:9101'

3. Restart Prometheus:

sudo systemctl restart prometheus

4. Verify:

Visit http://localhost:9090 and check:

  • Status → Targets: GPU exporter targets should be UP
  • Status → Rules: Alert rules should be loaded

Configuration Files

  • prometheus/prometheus.yml - Main Prometheus configuration
  • prometheus/alerts.yml - Alert rules for GPU monitoring
  • prometheus/gpu-servers.yml.example - File-based service discovery example (YAML)
  • prometheus/gpu-servers.json.example - File-based service discovery example (JSON)
  • prometheus/PROMETHEUS_SETUP.md - Detailed setup guide

Alert Rules Included

The configuration includes comprehensive alert rules:

Health Alerts:

  • nvidia-smi unavailable
  • GPU Exporter down

Hardware Alerts:

  • High/critical GPU temperature
  • Memory usage warnings
  • Fan failure detection

Utilization Alerts:

  • Idle GPUs with processes
  • Resource underutilization

Process Alerts:

  • Too many processes per GPU
  • Excessive memory consumption

Service Discovery Options

Static Configuration (simple):

static_configs: - targets: ['gpu-server:9101']

File-based Discovery (recommended):

file_sd_configs: - files: ['/etc/prometheus/targets/*.yml'] refresh_interval: 30s

Consul Discovery:

consul_sd_configs: - server: 'consul:8500'

For detailed instructions, see prometheus/PROMETHEUS_SETUP.md.

PromQL Query Examples

# Check if nvidia-smi is up nvidia_smi_up # GPU memory utilization rate (nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) * 100 # Total memory used by all processes of a specific user sum by (username) (nvidia_gpu_process_memory_bytes) # Processes running on multiple GPUs (same PID on different GPUs) count by (pid, process_name, username) (nvidia_gpu_process_memory_bytes) > 1 # GPU temperature over 70°C nvidia_gpu_temperature_celsius > 70 # GPU power usage rate (nvidia_gpu_power_watts / nvidia_gpu_power_limit_watts) * 100 # Top 5 processes by memory usage topk(5, nvidia_gpu_process_memory_bytes) # Idle GPUs (low utilization but memory occupied) nvidia_gpu_utilization_percent < 5 and nvidia_gpu_memory_used_bytes > 0 # Number of GPUs per user count by (username) ( sum by (gpu_index, username) (nvidia_gpu_process_memory_bytes) > 0 ) # Average GPU utilization across all GPUs avg(nvidia_gpu_utilization_percent)

Grafana Dashboard

Pre-built Grafana dashboards are included in the grafana/ directory.

Full-Featured Dashboard

The full dashboard (grafana/gpu-dashboard.json) includes the following panels:

Row 1 - Overview Stats:

  • nvidia-smi Status (UP/DOWN indicator)
  • Total GPUs count
  • Average GPU Utilization gauge
  • Average Memory Usage gauge

Row 2 - Hardware Metrics:

  • GPU Utilization timeline (all GPUs)
  • GPU Temperature timeline (all GPUs)

Row 3 - Resource Usage:

  • GPU Memory Usage (stacked chart with used/free)
  • GPU Power Usage (with limit lines)

Row 4 - Process Monitoring:

  • Memory Usage by User (pie chart)
  • Top 10 Processes by Memory (table)

Row 5 - Detailed View:

  • GPU Details table (utilization, memory, temperature, power, fan speed)

Simplified Dashboard (Memory Focus)

The simplified dashboard (grafana/gpu-memory-dashboard.json) focuses on memory monitoring and process information:

Panel 1 - GPU Memory Trend (Line Chart)

  • Total memory: Blue dashed line (reference)
  • Used memory: Red solid line
  • Free memory: Green solid line

Panel 2 - GPU Memory Overview Table

  • Total memory, used memory, free memory per GPU
  • Usage percentage with color-coded alerts

Panel 3 - GPU Process Details

  • GPU, PID, process name, username
  • Working directory (pwdx functionality)
  • Memory usage with progress bar

Import Dashboard

Method 1: Import via Grafana UI

  1. Open Grafana and navigate to DashboardsImport
  2. Click Upload JSON file
  3. Select dashboard file:
    • grafana/gpu-dashboard.json - Full-featured dashboard
    • grafana/gpu-memory-dashboard.json - Simplified dashboard
  4. Select your Prometheus datasource
  5. Click Import

Method 2: Import via API

# Replace with your Grafana URL and API key GRAFANA_URL="http://localhost:3000" GRAFANA_API_KEY="your-api-key" curl -X POST \ -H "Authorization: Bearer ${GRAFANA_API_KEY}" \ -H "Content-Type: application/json" \ -d @grafana/gpu-dashboard.json \ "${GRAFANA_URL}/api/dashboards/db"

Method 3: Provisioning (Recommended for Production)

Create /etc/grafana/provisioning/dashboards/gpu-exporter.yaml:

apiVersion: 1 providers: - name: 'GPU Exporter' orgId: 1 folder: 'Hardware Monitoring' type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /var/lib/grafana/dashboards/gpu-exporter

Copy the dashboard file:

sudo mkdir -p /var/lib/grafana/dashboards/gpu-exporter sudo cp grafana/gpu-dashboard.json /var/lib/grafana/dashboards/gpu-exporter/ sudo chown -R grafana:grafana /var/lib/grafana/dashboards sudo systemctl restart grafana-server

Dashboard Variables

The dashboard includes the following variables for filtering:

  • Data Source: Select your Prometheus datasource
  • Instance: Filter by GPU server instance
  • GPU: Select specific GPU(s) or view all

Customization

You can customize the dashboard by:

  1. Adjusting refresh interval (default: 5s)
  2. Modifying threshold values for alerts (temperature, memory, etc.)
  3. Adding/removing panels based on your needs
  4. Changing time range (default: last 1 hour)

Alerting Rules

Example Prometheus alerting rules:

groups: - name: gpu_alerts rules: - alert: NvidiaSmiDown expr: nvidia_smi_up == 0 for: 1m labels: severity: critical annotations: summary: "nvidia-smi unavailable on {{ $labels.instance }}" - alert: GPUHighTemperature expr: nvidia_gpu_temperature_celsius > 80 for: 5m labels: severity: warning annotations: summary: "GPU {{ $labels.gpu_index }} temperature is {{ $value }}°C" - alert: GPUMemoryFull expr: (nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) > 0.95 for: 5m labels: severity: warning annotations: summary: "GPU {{ $labels.gpu_index }} memory usage is {{ $value | humanizePercentage }}" - alert: GPUHighPower expr: (nvidia_gpu_power_watts / nvidia_gpu_power_limit_watts) > 0.9 for: 10m labels: severity: info annotations: summary: "GPU {{ $labels.gpu_index }} power usage is high"

Troubleshooting

nvidia-smi not found

Ensure NVIDIA drivers are installed and nvidia-smi is in PATH:

which nvidia-smi nvidia-smi

Permission denied

The exporter needs permission to read /proc/{pid}/ for process information. Run as a user with appropriate permissions or use sudo.

If working directory shows "unknown", refer to grafana/WORKDIR_TROUBLESHOOTING.md.

High CPU usage

Increase the scrape interval in Prometheus or adjust the nvidia timeout:

./gpu-exporter -nvidia.timeout=10s

Development

Project Structure

gpu-exporter/ ├── main.go # Main entry point ├── collector/ │ ├── nvidia.go # nvidia-smi wrapper │ └── collector.go # Prometheus collector ├── grafana/ │ ├── gpu-dashboard.json # Full-featured Grafana dashboard │ ├── gpu-memory-dashboard.json # Simplified dashboard │ └── WORKDIR_TROUBLESHOOTING.md # Working directory troubleshooting ├── prometheus/ │ ├── prometheus.yml # Prometheus configuration │ ├── alerts.yml # Alert rules │ ├── gpu-servers.yml.example # File SD example (YAML) │ ├── gpu-servers.json.example # File SD example (JSON) │ ├── PROMETHEUS_SETUP.md # Detailed setup guide │ ├── INTEGRATION_GUIDE.md # Integration guide │ └── README.md # Prometheus configuration README ├── systemd/ │ └── gpu-exporter.service # Systemd service file ├── go.mod # Go module dependencies ├── go.sum # Go module checksums ├── Makefile # Build and installation scripts ├── REQUIREMENTS.md # Requirements documentation ├── README.md # English documentation └── README_zh.md # Chinese documentation

Build

go build -o gpu-exporter .

Test

go test ./...

Cross-compile

# Linux AMD64 GOOS=linux GOARCH=amd64 go build -o gpu-exporter-linux-amd64 . # Linux ARM64 GOOS=linux GOARCH=arm64 go build -o gpu-exporter-linux-arm64 .

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Author

yuwen

Acknowledgments