yuwen-gueen/gpu-exporter

Public

Code Issues Pull requests Events Packages Insights

master

gpu-exporter/README.md

XGQ<1439890861@qq.com>

Prometheus's GPU monitor

39bd1d08

PreviewCode viewBlame

Raw

GPU Exporter

English | 中文

Prometheus exporter for NVIDIA GPU metrics, including hardware utilization and per-process GPU memory usage.

Features

🔍 GPU Hardware Metrics: Temperature, utilization, memory, power, fan speed
📊 Process-Level Monitoring: Track GPU memory usage per process with user information
🎯 Multi-GPU Support: Monitor all GPUs on the system
⚡ Lightweight: Low CPU and memory footprint
🔧 Easy Deployment: Single binary with systemd support

Metrics

Health Check

nvidia_smi_up - nvidia-smi availability (1=up, 0=down)

GPU Hardware Metrics

nvidia_gpu_info - GPU device information
nvidia_gpu_utilization_percent - GPU compute utilization
nvidia_gpu_memory_used_bytes - GPU memory used
nvidia_gpu_memory_total_bytes - GPU total memory
nvidia_gpu_memory_free_bytes - GPU free memory
nvidia_gpu_temperature_celsius - GPU temperature
nvidia_gpu_power_watts - GPU power usage
nvidia_gpu_power_limit_watts - GPU power limit
nvidia_gpu_fan_speed_percent - GPU fan speed

Process Metrics

nvidia_gpu_process_memory_bytes - Process GPU memory usage
nvidia_gpu_process_info - Process information (cmdline, username)
nvidia_gpu_process_count - Number of processes per GPU

Exporter Metrics

nvidia_exporter_collect_duration_seconds - Collection duration
nvidia_exporter_collect_errors_total - Collection errors count
nvidia_exporter_last_collect_timestamp_seconds - Last collection timestamp

Requirements

Linux operating system
NVIDIA GPU with driver installed
nvidia-smi command available

Installation

Build from Source


# Clone the repository
git clone https://github.com/yuwen/gpu-exporter.git
cd gpu-exporter

# Build
go build -o gpu-exporter .

# Run
./gpu-exporter

Download Binary

Download the latest release from the releases page.


chmod +x gpu-exporter
./gpu-exporter

Usage

Command Line Options


./gpu-exporter [flags]

Flags:
  -web.listen-address string
        Address to listen on for web interface and telemetry (default ":9101")
  -web.telemetry-path string
        Path under which to expose metrics (default "/metrics")
  -nvidia.timeout duration
        Timeout for nvidia-smi command (default 5s)
  -version
        Print version information

Example


# Start exporter on custom port
./gpu-exporter -web.listen-address=":9102"

# Use custom timeout for nvidia-smi
./gpu-exporter -nvidia.timeout=10s

# Check version
./gpu-exporter -version

Test Metrics


# View metrics
curl http://localhost:9101/metrics

# Check health
curl http://localhost:9101/health

Systemd Service

Create /etc/systemd/system/gpu-exporter.service:


[Unit]
Description=NVIDIA GPU Exporter
After=network.target

[Service]
Type=simple
User=nobody
ExecStart=/usr/local/bin/gpu-exporter
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Enable and start the service:


sudo cp gpu-exporter /usr/local/bin/
sudo chmod +x /usr/local/bin/gpu-exporter
sudo systemctl daemon-reload
sudo systemctl enable gpu-exporter
sudo systemctl start gpu-exporter
sudo systemctl status gpu-exporter

View logs:


sudo journalctl -u gpu-exporter -f

Prometheus Configuration

Pre-configured Prometheus configuration files are included in the prometheus/ directory.

Quick Setup

1. Copy configuration files:


sudo mkdir -p /etc/prometheus
sudo cp prometheus/prometheus.yml /etc/prometheus/
sudo cp prometheus/alerts.yml /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus

2. Update targets in prometheus.yml:


scrape_configs:
  - job_name: 'gpu-exporter'
    static_configs:
      - targets:
          - 'gpu-server-01:9101'  # Replace with your server addresses
          - 'gpu-server-02:9101'

3. Restart Prometheus:


sudo systemctl restart prometheus

4. Verify:

Visit http://localhost:9090 and check:

Status → Targets: GPU exporter targets should be UP
Status → Rules: Alert rules should be loaded

Configuration Files

prometheus/prometheus.yml - Main Prometheus configuration
prometheus/alerts.yml - Alert rules for GPU monitoring
prometheus/gpu-servers.yml.example - File-based service discovery example (YAML)
prometheus/gpu-servers.json.example - File-based service discovery example (JSON)
prometheus/PROMETHEUS_SETUP.md - Detailed setup guide

Alert Rules Included

The configuration includes comprehensive alert rules:

Health Alerts:

nvidia-smi unavailable
GPU Exporter down

Hardware Alerts:

High/critical GPU temperature
Memory usage warnings
Fan failure detection

Utilization Alerts:

Idle GPUs with processes
Resource underutilization

Process Alerts:

Too many processes per GPU
Excessive memory consumption

Service Discovery Options

Static Configuration (simple):


static_configs:
  - targets: ['gpu-server:9101']

File-based Discovery (recommended):


file_sd_configs:
  - files: ['/etc/prometheus/targets/*.yml']
    refresh_interval: 30s

Consul Discovery:


consul_sd_configs:
  - server: 'consul:8500'

For detailed instructions, see prometheus/PROMETHEUS_SETUP.md.

PromQL Query Examples


# Check if nvidia-smi is up
nvidia_smi_up

# GPU memory utilization rate
(nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) * 100

# Total memory used by all processes of a specific user
sum by (username) (nvidia_gpu_process_memory_bytes)

# Processes running on multiple GPUs (same PID on different GPUs)
count by (pid, process_name, username) (nvidia_gpu_process_memory_bytes) > 1

# GPU temperature over 70°C
nvidia_gpu_temperature_celsius > 70

# GPU power usage rate
(nvidia_gpu_power_watts / nvidia_gpu_power_limit_watts) * 100

# Top 5 processes by memory usage
topk(5, nvidia_gpu_process_memory_bytes)

# Idle GPUs (low utilization but memory occupied)
nvidia_gpu_utilization_percent < 5 and nvidia_gpu_memory_used_bytes > 0

# Number of GPUs per user
count by (username) (
  sum by (gpu_index, username) (nvidia_gpu_process_memory_bytes) > 0
)

# Average GPU utilization across all GPUs
avg(nvidia_gpu_utilization_percent)

Grafana Dashboard

Pre-built Grafana dashboards are included in the grafana/ directory.

Full-Featured Dashboard

The full dashboard (grafana/gpu-dashboard.json) includes the following panels:

Row 1 - Overview Stats:

nvidia-smi Status (UP/DOWN indicator)
Total GPUs count
Average GPU Utilization gauge
Average Memory Usage gauge

Row 2 - Hardware Metrics:

GPU Utilization timeline (all GPUs)
GPU Temperature timeline (all GPUs)

Row 3 - Resource Usage:

GPU Memory Usage (stacked chart with used/free)
GPU Power Usage (with limit lines)

Row 4 - Process Monitoring:

Memory Usage by User (pie chart)
Top 10 Processes by Memory (table)

Row 5 - Detailed View:

GPU Details table (utilization, memory, temperature, power, fan speed)

Simplified Dashboard (Memory Focus)

The simplified dashboard (grafana/gpu-memory-dashboard.json) focuses on memory monitoring and process information:

Panel 1 - GPU Memory Trend (Line Chart)

Total memory: Blue dashed line (reference)
Used memory: Red solid line
Free memory: Green solid line

Panel 2 - GPU Memory Overview Table

Total memory, used memory, free memory per GPU
Usage percentage with color-coded alerts

Panel 3 - GPU Process Details

GPU, PID, process name, username
Working directory (pwdx functionality)
Memory usage with progress bar

Import Dashboard

Method 1: Import via Grafana UI

Open Grafana and navigate to Dashboards → Import
Click Upload JSON file
Select dashboard file:
- grafana/gpu-dashboard.json - Full-featured dashboard
- grafana/gpu-memory-dashboard.json - Simplified dashboard
Select your Prometheus datasource
Click Import

Method 2: Import via API


# Replace with your Grafana URL and API key
GRAFANA_URL="http://localhost:3000"
GRAFANA_API_KEY="your-api-key"

curl -X POST \
  -H "Authorization: Bearer ${GRAFANA_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @grafana/gpu-dashboard.json \
  "${GRAFANA_URL}/api/dashboards/db"

Method 3: Provisioning (Recommended for Production)

Create /etc/grafana/provisioning/dashboards/gpu-exporter.yaml:


apiVersion: 1

providers:
  - name: 'GPU Exporter'
    orgId: 1
    folder: 'Hardware Monitoring'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards/gpu-exporter

Copy the dashboard file:


sudo mkdir -p /var/lib/grafana/dashboards/gpu-exporter
sudo cp grafana/gpu-dashboard.json /var/lib/grafana/dashboards/gpu-exporter/
sudo chown -R grafana:grafana /var/lib/grafana/dashboards
sudo systemctl restart grafana-server

Dashboard Variables

The dashboard includes the following variables for filtering:

Data Source: Select your Prometheus datasource
Instance: Filter by GPU server instance
GPU: Select specific GPU(s) or view all

Customization

You can customize the dashboard by:

Adjusting refresh interval (default: 5s)
Modifying threshold values for alerts (temperature, memory, etc.)
Adding/removing panels based on your needs
Changing time range (default: last 1 hour)

Alerting Rules

Example Prometheus alerting rules:


groups:
  - name: gpu_alerts
    rules:
      - alert: NvidiaSmiDown
        expr: nvidia_smi_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "nvidia-smi unavailable on {{ $labels.instance }}"

      - alert: GPUHighTemperature
        expr: nvidia_gpu_temperature_celsius > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu_index }} temperature is {{ $value }}°C"

      - alert: GPUMemoryFull
        expr: (nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes) > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu_index }} memory usage is {{ $value | humanizePercentage }}"

      - alert: GPUHighPower
        expr: (nvidia_gpu_power_watts / nvidia_gpu_power_limit_watts) > 0.9
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "GPU {{ $labels.gpu_index }} power usage is high"

Troubleshooting

nvidia-smi not found

Ensure NVIDIA drivers are installed and nvidia-smi is in PATH:


which nvidia-smi
nvidia-smi

Permission denied

The exporter needs permission to read /proc/{pid}/ for process information. Run as a user with appropriate permissions or use sudo.

If working directory shows "unknown", refer to grafana/WORKDIR_TROUBLESHOOTING.md.

High CPU usage

Increase the scrape interval in Prometheus or adjust the nvidia timeout:


./gpu-exporter -nvidia.timeout=10s

Development

Project Structure


gpu-exporter/
├── main.go                          # Main entry point
├── collector/
│   ├── nvidia.go                   # nvidia-smi wrapper
│   └── collector.go                # Prometheus collector
├── grafana/
│   ├── gpu-dashboard.json          # Full-featured Grafana dashboard
│   ├── gpu-memory-dashboard.json   # Simplified dashboard
│   └── WORKDIR_TROUBLESHOOTING.md  # Working directory troubleshooting
├── prometheus/
│   ├── prometheus.yml              # Prometheus configuration
│   ├── alerts.yml                  # Alert rules
│   ├── gpu-servers.yml.example     # File SD example (YAML)
│   ├── gpu-servers.json.example    # File SD example (JSON)
│   ├── PROMETHEUS_SETUP.md         # Detailed setup guide
│   ├── INTEGRATION_GUIDE.md        # Integration guide
│   └── README.md                   # Prometheus configuration README
├── systemd/
│   └── gpu-exporter.service        # Systemd service file
├── go.mod                          # Go module dependencies
├── go.sum                          # Go module checksums
├── Makefile                        # Build and installation scripts
├── REQUIREMENTS.md                 # Requirements documentation
├── README.md                       # English documentation
└── README_zh.md                    # Chinese documentation

Build


go build -o gpu-exporter .

Test


go test ./...

Cross-compile


# Linux AMD64
GOOS=linux GOARCH=amd64 go build -o gpu-exporter-linux-amd64 .

# Linux ARM64
GOOS=linux GOARCH=arm64 go build -o gpu-exporter-linux-arm64 .

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Author

yuwen

Acknowledgments

prometheus/client_golang - Prometheus Go client library
NVIDIA System Management Interface - nvidia-smi documentation

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111