Use Cases
Monitoring NVIDIA GPUs
Overview
MetricsHub monitors NVIDIA GPUs to help detect hardware and performance issues that may impact availability and stability in AI, HPC, and accelerated computing environments.
MetricsHub collects NVIDIA GPU metrics related to:
- GPU health and operational status
- GPU utilization and memory usage
- Power consumption and power limits
- Temperature and thermal sensors
- GPU inventory information (model, UUID, serial number)
- GPU-level performance indicators.
MetricsHub provides flexible GPU monitoring using either REST-based or CLI-based connectors:
- NVIDIA DGX (REST) enables GPU monitoring on NVIDIA DGX platforms using REST APIs.
- NVIDIA SMI provides generic GPU monitoring on any NVIDIA-enabled server by executing the
nvidia-smicommand.
Configuration
In the examples below, MetricsHub is configured to monitor NVIDIA GPUs using either REST (for DGX systems) or NVIDIA SMI.
Note: For optimal performance, it is recommended to explicitly specify the connector(s) to use rather than letting MetricsHub automatically detect the most suitable one(s).
Monitoring NVIDIA GPUs via REST (DGX - Enterprise Edition)
Copy and paste the following configuration in the config/metricshub.yaml file:
resourceGroups:
<RESOURCE_GROUP>:
resources:
<HOSTNAME-ID>:
attributes:
host.name: <HOSTNAME> # Change with actual host name
host.type: management
connectors: [ +NvidiaDGXREST ] # Optional, to load only this connector
protocols:
http:
https: true
port: 443 # or probably something else
username: <USERNAME> # Change with actual credentials
password: <PASSWORD> # Encrypted using metricshub-encrypt
Replace <HOSTNAME-ID>, <HOSTNAME>, <USERNAME>, and <PASSWORD> with actual values.
Monitoring NVIDIA GPUs via SMI (Generic - Community & Enterprise Editions)
This method applies to any Windows and Linux system with NVIDIA drivers installed and nvidia-smi available (including DGX systems) and when REST APIs are not available or not desired.
Linux Systems
Copy and paste the following configuration in the config/metricshub.yaml file:
resourceGroups:
<RESOURCE_GROUP>:
resources:
<HOSTNAME-ID>:
attributes:
host.name: <HOSTNAME> # Change with actual host name
host.type: linux
connectors: [ +NvidiaSmi ] # Optional, to load only this connector
protocols:
ssh:
username: <USERNAME> # Change with actual credentials
password: <PASSWORD> # Encrypted using metricshub-encrypt
Replace <HOSTNAME-ID>, <HOSTNAME>, <USERNAME>, and <PASSWORD> with actual values.
Windows Systems
Copy and paste the following configuration in the config/metricshub.yaml file:
resourceGroups:
<RESOURCE_GROUP>:
resources:
<HOSTNAME-ID>:
attributes:
host.name: <HOSTNAME> # Change with actual host name
host.type: win
connectors: [ +NvidiaSmi ] # Optional, to load only this connector
protocols:
wmi:
username: <USERNAME> # Change with actual credentials
password: <PASSWORD> # Encrypted using metricshub-encrypt
Replace <HOSTNAME-ID>, <HOSTNAME>, <USERNAME>, and <PASSWORD> with actual values.