nvidia-smi 技巧
查询nvlink互联拓扑:
nvidia-smi topo -m
以下是几个例子
双卡1080ti,普通家用主板:
yxc@hua-System-Product-Name:~$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X PHB 0-11 N/A
GPU1 PHB X 0-11 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
输出结果显示了两个GPU(GPU0和GPU1)的拓扑关系,以及它们与CPU的亲和性(CPU Affinity)。
GPU0 和 GPU1 列表示不同的GPU设备。
X 表示自己,即GPU0对应的是自己,不需要与自己建立连接。
PHB 表示连接是通过一个PCIe主机桥(PCIe Host Bridge)完成的,通常这个桥接器是CPU。这意味着两块GPU通过PCIe总线互联,且它们可能直接连接到CPU上的不同PCIe接口上。
CPU Affinity 列显示的是CPU的亲和性,即哪些CPU核心(逻辑处理器)与各个GPU相连。在这个例子中,GPU0和GPU1都与CPU上的核心0到11相连,意味着两个GPU可以被分配到任何这些CPU核上运行任 务。
NUMA Affinity 列显示了GPU与NUMA(Non-Uniform Memory Access)节点的亲和性,但在这里显示为 N/A(不适用),可能是因为使用的主板不支持NUMA或者CPU架构不是NUMA的。
又解释了Legend有哪些:
SYS:连接跨越PCIe以及NUMA节点之间的SMP互连(例如QPI/UPI)。
NODE:连接跨越PCIe以及一个NUMA节点内的PCIe主机桥之间的互连。
PHB:连接跨越PCIe以及一个PCIe主机桥。
PXB:连接跨越多个PCIe桥,但没有跨越PCIe主机桥。
PIX:连接最多跨越一个PCIe桥。
NV#:连接跨越一组绑定的#个NVLink通道。
8卡A100,nvlink:
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 NIC12 NIC13 NIC14 NIC15 NIC16 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 0-103 0 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 0-103 0 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS 0-103 0 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS 0-103 0 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB PXB PXB SYS SYS SYS SYS SYS 0-103 0 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB PXB PXB SYS SYS SYS SYS SYS 0-103 0 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB PXB PXB SYS 0-103 0 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB PXB PXB SYS 0-103 0 N/A
NIC0 PXB PXB SYS SYS SYS SYS SYS SYS X PIX PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PXB PXB SYS SYS SYS SYS SYS SYS PIX X PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC2 PXB PXB SYS SYS SYS SYS SYS SYS PXB PXB X PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC3 PXB PXB SYS SYS SYS SYS SYS SYS PXB PXB PIX X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC4 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS X PIX PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC5 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS PIX X PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC6 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB X PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC7 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB PIX X SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC8 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX PXB PXB SYS SYS SYS SYS SYS
NIC9 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X PXB PXB SYS SYS SYS SYS SYS
NIC10 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB X PIX SYS SYS SYS SYS SYS
NIC11 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB PIX X SYS SYS SYS SYS SYS
NIC12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX PXB PXB SYS
NIC13 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X PXB PXB SYS
NIC14 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB X PIX SYS
NIC15 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB PIX X SYS
NIC16 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_2
NIC1: mlx5_3
NIC2: mlx5_4
NIC3: mlx5_5
NIC4: mlx5_6
NIC5: mlx5_7
NIC6: mlx5_8
NIC7: mlx5_9
NIC8: mlx5_10
NIC9: mlx5_11
NIC10: mlx5_12
NIC11: mlx5_13
NIC12: mlx5_14
NIC13: mlx5_15
NIC14: mlx5_16
NIC15: mlx5_17
NIC16: mlx5_bond_0
GPU0 到 GPU7 列表示8个GPU设备。
NIC0 则表示网络接口卡,这些都是具体的网络设备。
NV12 表示两个GPU之间通过12条NVLink连接。每个GPU都与其他所有GPU通过NVLink相连,这表明了一个高密度、高速的GPU网络,非常适合执行并行计算密集型任务。
PXB 表示连接跨越了多个PCIe桥(但没有跨越PCIe主机桥),PIX 则表示连接最多跨越了一个PCIe桥。这些连接类型通常用于描述GPU与NICs之间的关系。
SYS 表示连接跨越了PCIe以及SMP互连,包括NUMA节点之间的连接(例如QPI/UPI)。SYS 连接类型通常表明连接是通过系统总线,可能涉及更长的物理距离和潜在的更高延迟。
CPU Affinity 列显示了所有GPU都与CPU的逻辑处理器0到103相连,表示这个系统有多核CPU,并且所有GPU都可以与任何CPU核心通信。
NUMA Affinity 列显示了所有GPU都与NUMA节点0相关联,并且 GPU NUMA ID 列为 N/A,这可能意味着系统可能不支持或没有启用NUMA或者所有的GPU都属于同一个NUMA节点。
NIC Legend 提供了关于网络接口卡的额外信息,说明每个NIC的型号。
总结来说,在这个8卡A100 GPU系统中,所有GPU都通过NVLink互连,并能够与CPU上的所有核心通信。网络接口卡通过PCIe连接到系统
4卡A10, 无nvlink
$nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS SYS SYS SYS 0-127 0 N/A
GPU1 SYS X SYS SYS SYS SYS 0-127 0 N/A
GPU2 SYS SYS X SYS SYS SYS 0-127 0 N/A
GPU3 SYS SYS SYS X SYS SYS 0-127 0 N/A
NIC0 SYS SYS SYS SYS X PIX
NIC1 SYS SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
GPU0 到 GPU3 列分别表示系统中的四个GPU设备。
NIC0 和 NIC1 表示网络接口卡,这些是连接到系统网络的设备。在这个例子中,它们是Mellanox的网络接口卡(由NIC Legend说明)。
SYS 表示连接跨越了PCIe以及可能的NUMA节点之间的SMP互连(例如QPI/UPI)。在这个例子中,所有的GPU与其他GPU之间的连接都是 SYS 类型,这意味着它们都通过PCIe总线连接但没有使用高速NVLink。
NUMA Affinity 列显示了所有GPU都与NUMA节点0相关联。GPU NUMA ID 列显示为 N/A,这可能意味着系统可能不支持或没有启用NUMA,或者所有GPU都属于同一个NUMA节点。
NIC Legend 提供了有关网络接口卡的额外信息,mlx5_0 和 mlx5_1 表示两个Mellanox的网络接口卡。
总结来说,这个系统中的四块A10 GPU是通过系统的PCIe总线连接的,并且与CPU的所有核心都有通信能力。虽然它们之间没有NVLink,但这样的配置对于一些不需要密集GPU间通信的计算任务来说是足够的。
显示内存、gpu的占用率,动态显示成列表形式,并输出到csv表格中
lms是毫秒为粒度刷新 如果秒的话,参数设置为l就可以了
nvidia-smi -lms --query-gpu=timestamp,pstate,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv | tee gpu-log.csv
$nvidia-smi -lms --query-gpu=timestamp,pstate,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv | tee gpu-log.csv
timestamp, pstate, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2024/01/16 14:44:16.977, P8, 25, 0 %, 0 %, 23028 MiB, 22494 MiB, 21 MiB
2024/01/16 14:44:16.981, P8, 24, 0 %, 0 %, 23028 MiB, 22508 MiB, 7 MiB
2024/01/16 14:44:16.986, P8, 23, 0 %, 0 %, 23028 MiB, 22508 MiB, 7 MiB
2024/01/16 14:44:16.990, P8, 26, 0 %, 0 %, 23028 MiB, 22508 MiB, 7 MiB
2024/01/16 14:44:17.095, P8, 25, 0 %, 0 %, 23028 MiB, 22494 MiB, 21 MiB
PCIE带宽
nvidia-smi dmon -i 0 -s mutc -d 1 -o TD
$nvidia-smi dmon -i 0 -s mutc -d 1 -o TD
#Date Time gpu fb bar1 ccpm sm mem enc dec jpg ofa rxpci txpci mclk pclk
#YYYYMMDD HH:MM:SS Idx MB MB MB % % % % % % MB/s MB/s MHz MHz
20240104 13:54:35 0 239 3 0 6 1 0 0 0 0 5 10 6250 1695
20240104 13:54:36 0 74 10 0 0 0 0 0 0 0 217 677 6250 1695
20240104 13:54:37 0 7 2 0 3 0 0 0 0 0 0 3 6250 1695
20240104 13:54:38 0 478 6 0 13 0 0 0 0 0 44 30 6250 1695
20240104 13:54:39 0 67 10 0 0 0 0 0 0 0 30 40 6250 1695
20240104 13:54:40 0 7 2 0 3 0 0 0 0 0 0 0 6250 1695
如果需要监控其他的指标,可以用这个命令查看支持的属性
命令:nvidia-smi --help-query-gpu
$nvidia-smi --help-query-gpu
List of valid properties to query for the switch "--query-gpu":
"timestamp"
The timestamp of when the query was made in format "YYYY/MM/DD HH:MM:SS.msec".
"driver_version"
The version of the installed NVIDIA display driver. This is an alphanumeric string.
Section about vgpu_driver_capability properties
Retrieves information about driver level caps.
"vgpu_driver_capability.heterogenous_multivGPU"
Whether heterogeneuos multi-vGPU is supported by driver.
"count"
The number of NVIDIA GPUs in the system.
"name" or "gpu_name"
The official product name of the GPU. This is an alphanumeric string. For all products.
"serial" or "gpu_serial"
This number matches the serial number physically printed on each board. It is a globally unique immutable alphanumeric value.
"uuid" or "gpu_uuid"
This value is the globally unique immutable alphanumeric identifier of the GPU. It does not correspond to any physical label on the board.
"pci.bus_id" or "gpu_bus_id"
PCI bus id as "domain:bus:device.function", in hex.
"pci.domain"
PCI domain number, in hex.
"pci.bus"
PCI bus number, in hex.
"pci.device"
PCI device number, in hex.
"pci.device_id"
PCI vendor device id, in hex
"pci.sub_device_id"
PCI Sub System id, in hex
Section about vgpu_device_capability properties
Retrieves information about device level caps.
"vgpu_device_capability.fractional_multiVgpu"
Fractional vGPU profiles on this GPU can be used in multi-vGPU configurations.
"vgpu_device_capability.heterogeneous_timeSlice_profile"
Supports concurrent execution of timesliced vGPU profiles of differing types.
"vgpu_device_capability.heterogeneous_timeSlice_sizes"
Supports concurrent execution of timesliced vGPU profiles of differing framebuffer sizes.
"pcie.link.gen.current"
The current PCI-E link generation. These may be reduced when the GPU is not in use. Deprecated, use pcie.link.gen.gpucurrent instead.
"pcie.link.gen.gpucurrent"
The current PCI-E link generation. These may be reduced when the GPU is not in use.
"pcie.link.gen.max"
The maximum PCI-E link generation possible with this GPU and system configuration. For example, if the GPU supports a higher PCIe generation than the system supports then this reports the system PCIe generation.
"pcie.link.gen.gpumax"
The maximum PCI-E link generation supported by this GPU.
"pcie.link.gen.hostmax"
The maximum PCI-E link generation supported by the root port corresponding to this GPU.
"pcie.link.width.current"
The current PCI-E link width. These may be reduced when the GPU is not in use.
"pcie.link.width.max"
The maximum PCI-E link width possible with this GPU and system configuration. For example, if the GPU supports a higher PCIe generation than the system supports then this reports the system PCIe generation.
"index"
Zero based index of the GPU. Can change at each boot.
"display_mode"
A flag that indicates whether a physical display (e.g. monitor) is currently connected to any of the GPU's connectors. "Enabled" indicates an attached display. "Disabled" indicates otherwise.
"display_active"
A flag that indicates whether a display is initialized on the GPU's (e.g. memory is allocated on the device for display). Display can be active even when no monitor is physically attached. "Enabled" indicates an active display. "Disabled" indicates otherwise.
"persistence_mode"
A flag that indicates whether persistence mode is enabled for the GPU. Value is either "Enabled" or "Disabled". When persistence mode is enabled the NVIDIA driver remains loaded even when no active clients, such as X11 or nvidia-smi, exist. This minimizes the driver load latency associated with running dependent apps, such as CUDA programs. Linux only.
"addressing_mode"
A flag that indicates the type of addressing mode enabled for the GPU. Value is either "HMM" or "ATS" or "None". When the mode is HMM, system allocated memory (malloc, mmap) is addressable from the device (GPU), via software-based mirroring of the CPU's page tables, on the GPU. When the mode is ATS, system allocated memory (malloc, mmap) is addressable from the device (GPU), via Address Translation Services. This means that there is (effectively) a single set of page tables, and the CPU and GPU both use them. The mode is None when neither HMM nor ATS is active. Linux only.
"accounting.mode"
A flag that indicates whether accounting mode is enabled for the GPU. Value is either "Enabled" or "Disabled". When accounting is enabled statistics are calculated for each compute process running on the GPU.Statistics can be queried during the lifetime or after termination of the process.The execution time of process is reported as 0 while the process is in running state and updated to actualexecution time after the process has terminated. See --help-query-accounted-apps for more info.
"accounting.buffer_size"
The size of the circular buffer that holds list of processes that can be queried for accounting stats. This is the maximum number of processes that accounting information will be stored for before information about oldest processes will get overwritten by information about new processes.
Section about driver_model properties
On Windows, the TCC and WDDM driver models are supported. The driver model can be changed with the (-dm) or (-fdm) flags. The TCC driver model is optimized for compute applications. I.E. kernel launch times will be quicker with TCC. The WDDM driver model is designed for graphics applications and is not recommended for compute applications. Linux does not support multiple driver models, and will always have the value of "N/A". Only for selected products. Please see feature matrix in NVML documentation.
"driver_model.current"
The driver model currently in use. Always "N/A" on Linux.
"driver_model.pending"
The driver model that will be used on the next reboot. Always "N/A" on Linux.
"vbios_version"
The BIOS of the GPU board.
Section about inforom properties
Version numbers for each object in the GPU board's inforom storage. The inforom is a small, persistent store of configuration and state data for the GPU. All inforom version fields are numerical. It can be useful to know these version numbers because some GPU features are only available with inforoms of a certain version or higher.
"inforom.img" or "inforom.image"
Global version of the infoROM image. Image version just like VBIOS version uniquely describes the exact version of the infoROM flashed on the board in contrast to infoROM object version which is only an indicator of supported features.
"inforom.oem"
Version for the OEM configuration data.
"inforom.ecc"
Version for the ECC recording data.
"inforom.pwr" or "inforom.power"
Version for the power management data.
Section about reset_status properties
GPU reset status information. Reports if there is a GPU reset required or drain and reset recommended to recover from a bad state. 'N/A' indicates that the field is not supported on the current device or device configuration. An error message indicates that retrieving the field failed.
"reset_status.reset_required"
Checks if a GPU reset is required.
"reset_status.drain_and_reset_recommended"
Checks if a GPU drain and reset is recommended.
Section about gom properties
GOM allows to reduce power usage and optimize GPU throughput by disabling GPU features. Each GOM is designed to meet specific user needs.
In "All On" mode everything is enabled and running at full speed.
The "Compute" mode is designed for running only compute tasks. Graphics operations are not allowed.
The "Low Double Precision" mode is designed for running graphics applications that don't require high bandwidth double precision.
GOM can be changed with the (--gom) flag.
"gom.current" or "gpu_operation_mode.current"
The GOM currently in use.
"gom.pending" or "gpu_operation_mode.pending"
The GOM that will be used on the next reboot.
"fan.speed"
The fan speed value is the percent of the product's maximum noise tolerance fan speed that the device's fan is currently intended to run at. This value may exceed 100% in certain cases. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.
"pstate"
The current performance state for the GPU. States range from P0 (maximum performance) to P12 (minimum performance).
Section about clocks_event_reasons properties
Retrieves information about factors that are reducing the frequency of clocks. If all event reasons are returned as "Not Active" it means that clocks are running as high as possible.
"clocks_event_reasons.supported" or "clocks_throttle_reasons.supported"
Bitmask of supported clock event reasons. See nvml.h for more details.
"clocks_event_reasons.active" or "clocks_throttle_reasons.active"
Bitmask of active clock event reasons. See nvml.h for more details.
"clocks_event_reasons.gpu_idle" or "clocks_throttle_reasons.gpu_idle"
Nothing is running on the GPU and the clocks are dropping to Idle state. This limiter may be removed in a later release.
"clocks_event_reasons.applications_clocks_setting" or "clocks_throttle_reasons.applications_clocks_setting"
GPU clocks are limited by applications clocks setting. E.g. can be changed by nvidia-smi --applications-clocks=
"clocks_event_reasons.sw_power_cap" or "clocks_throttle_reasons.sw_power_cap"
SW Power Scaling algorithm is reducing the clocks below requested clocks because the GPU is consuming too much power. E.g. SW power cap limit can be changed with nvidia-smi --power-limit=
"clocks_event_reasons.hw_slowdown" or "clocks_throttle_reasons.hw_slowdown"
HW Slowdown (reducing the core clocks by a factor of 2 or more) is engaged. This is an indicator of:
HW Thermal Slowdown: temperature being too high
HW Power Brake Slowdown: External Power Brake Assertion is triggered (e.g. by the system power supply)
* Power draw is too high and Fast Trigger protection is reducing the clocks
* May be also reported during PState or clock change
* This behavior may be removed in a later release
"clocks_event_reasons.hw_thermal_slowdown" or "clocks_throttle_reasons.hw_thermal_slowdown"
HW Thermal Slowdown (reducing the core clocks by a factor of 2 or more) is engaged. This is an indicator of temperature being too high
"clocks_event_reasons.hw_power_brake_slowdown" or "clocks_throttle_reasons.hw_power_brake_slowdown"
HW Power Brake Slowdown (reducing the core clocks by a factor of 2 or more) is engaged. This is an indicator of External Power Brake Assertion being triggered (e.g. by the system power supply)
"clocks_event_reasons.sw_thermal_slowdown" or "clocks_throttle_reasons.sw_thermal_slowdown"
SW Thermal capping algorithm is reducing clocks below requested clocks because GPU temperature is higher than Max Operating Temp.
"clocks_event_reasons.sync_boost" or "clocks_throttle_reasons.sync_boost"
Sync Boost This GPU has been added to a Sync boost group with nvidia-smi or DCGM in
* order to maximize performance per watt. All GPUs in the sync boost group
* will boost to the minimum possible clocks across the entire group. Look at
* the event reasons for other GPUs in the system to see why those GPUs are
* holding this one at lower clocks.
Section about memory properties
On-board memory information. Reported total memory is affected by ECC state. If ECC is enabled the total available memory is decreased by several percent, due to the requisite parity bits. The driver may also reserve a small amount of memory for internal use, even without active work on the GPU.
"memory.total"
Total installed GPU memory.
"memory.reserved"
Total memory reserved by the NVIDIA driver and firmware.
"memory.used"
Total memory allocated by active contexts.
"memory.free"
Total free memory.
"compute_mode"
The compute mode flag indicates whether individual or multiple compute applications may run on the GPU.
"0: Default" means multiple contexts are allowed per device.
"1: Exclusive_Thread", deprecated, use Exclusive_Process instead
"2: Prohibited" means no contexts are allowed per device (no compute apps).
"3: Exclusive_Process" means only one context is allowed per device, usable from multiple threads at a time.
"compute_cap"
The CUDA Compute Capability, represented as Major DOT Minor.
Section about utilization properties
Utilization rates report how busy each GPU is over time, and can be used to determine how much an application is using the GPUs in the system.
Note: On MIG-enabled GPUs, querying the utilization of encoder, decoder, jpeg, ofa, gpu, and memory is not currently supported.
"utilization.gpu"
Percent of time over the past sample period during which one or more kernels was executing on the GPU.
The sample period may be between 1 second and 1/6 second depending on the product.
"utilization.memory"
Percent of time over the past sample period during which global (device) memory was being read or written.
The sample period may be between 1 second and 1/6 second depending on the product.
"utilization.encoder"
Percent of time over the past sample period during which one or more kernels was executing on the Encoder Engine.
The sample period may be between 1 second and 1/6 second depending on the product.
"utilization.decoder"
Percent of time over the past sample period during which one or more kernels was executing on the Decoder Engine.
The sample period may be between 1 second and 1/6 second depending on the product.
"utilization.jpeg"
Percent of time over the past sample period during which one or more kernels was executing on the Jpeg Engine.
The sample period may be between 1 second and 1/6 second depending on the product.
"utilization.ofa"
Percent of time over the past sample period during which one or more kernels was executing on the Optical Flow Accelerator Engine.
The sample period may be between 1 second and 1/6 second depending on the product.
Section about encoder.stats properties
Encoder stats report number of encoder sessions, average FPS and average latency in us for given GPUs in the system.
"encoder.stats.sessionCount"
Number of encoder sessions running on the GPU.
"encoder.stats.averageFps"
Average FPS of all sessions running on the GPU.
"encoder.stats.averageLatency"
Average latency in microseconds of all sessions running on the GPU.
Section about ecc.mode properties
A flag that indicates whether ECC support is enabled. May be either "Enabled" or "Disabled". Changes to ECC mode require a reboot. Requires Inforom ECC object version 1.0 or higher.
"ecc.mode.current"
The ECC mode that the GPU is currently operating under.
"ecc.mode.pending"
The ECC mode that the GPU will operate under after the next reboot.
Section about ecc.errors properties
NVIDIA GPUs can provide error counts for various types of ECC errors. Some ECC errors are either single or double bit, where single bit errors are corrected and double bit errors are uncorrectable. Texture memory errors may be correctable via resend or uncorrectable if the resend fails. These errors are available across two timescales (volatile and aggregate). Single bit ECC errors are automatically corrected by the HW and do not result in data corruption. Double bit errors are detected but not corrected. Please see the ECC documents on the web for information on compute application behavior when double bit errors occur. Volatile error counters track the number of errors detected since the last driver load. Aggregate error counts persist indefinitely and thus act as a lifetime counter.
"ecc.errors.corrected.volatile.device_memory"
Errors detected in global device memory.
"ecc.errors.corrected.volatile.dram"
Errors detected in global device memory.
"ecc.errors.corrected.volatile.register_file"
Errors detected in register file memory.
"ecc.errors.corrected.volatile.l1_cache"
Errors detected in the L1 cache.
"ecc.errors.corrected.volatile.l2_cache"
Errors detected in the L2 cache.
"ecc.errors.corrected.volatile.texture_memory"
Parity errors detected in texture memory.
"ecc.errors.corrected.volatile.cbu"
Parity errors detected in CBU.
"ecc.errors.corrected.volatile.sram"
Errors detected in global SRAMs.
"ecc.errors.corrected.volatile.total"
Total errors detected across entire chip.
"ecc.errors.corrected.aggregate.device_memory"
Errors detected in global device memory.
"ecc.errors.corrected.aggregate.dram"
Errors detected in global device memory.
"ecc.errors.corrected.aggregate.register_file"
Errors detected in register file memory.
"ecc.errors.corrected.aggregate.l1_cache"
Errors detected in the L1 cache.
"ecc.errors.corrected.aggregate.l2_cache"
Errors detected in the L2 cache.
"ecc.errors.corrected.aggregate.texture_memory"
Parity errors detected in texture memory.
"ecc.errors.corrected.aggregate.cbu"
Parity errors detected in CBU.
"ecc.errors.corrected.aggregate.sram"
Errors detected in global SRAMs.
"ecc.errors.corrected.aggregate.total"
Total errors detected across entire chip.
"ecc.errors.uncorrected.volatile.device_memory"
Errors detected in global device memory.
"ecc.errors.uncorrected.volatile.dram"
Errors detected in global device memory.
"ecc.errors.uncorrected.volatile.register_file"
Errors detected in register file memory.
"ecc.errors.uncorrected.volatile.l1_cache"
Errors detected in the L1 cache.
"ecc.errors.uncorrected.volatile.l2_cache"
Errors detected in the L2 cache.
"ecc.errors.uncorrected.volatile.texture_memory"
Parity errors detected in texture memory.
"ecc.errors.uncorrected.volatile.cbu"
Parity errors detected in CBU.
"ecc.errors.uncorrected.volatile.sram"
Errors detected in global SRAMs.
"ecc.errors.uncorrected.volatile.total"
Total errors detected across entire chip.
"ecc.errors.uncorrected.aggregate.device_memory"
Errors detected in global device memory.
"ecc.errors.uncorrected.aggregate.dram"
Errors detected in global device memory.
"ecc.errors.uncorrected.aggregate.register_file"
Errors detected in register file memory.
"ecc.errors.uncorrected.aggregate.l1_cache"
Errors detected in the L1 cache.
"ecc.errors.uncorrected.aggregate.l2_cache"
Errors detected in the L2 cache.
"ecc.errors.uncorrected.aggregate.texture_memory"
Parity errors detected in texture memory.
"ecc.errors.uncorrected.aggregate.cbu"
Parity errors detected in CBU.
"ecc.errors.uncorrected.aggregate.sram"
Errors detected in global SRAMs.
"ecc.errors.uncorrected.aggregate.total"
Total errors detected across entire chip.
Section about retired_pages properties
NVIDIA GPUs can retire pages of GPU device memory when they become unreliable. This can happen when multiple single bit ECC errors occur for the same page, or on a double bit ECC error. When a page is retired, the NVIDIA driver will hide it such that no driver, or application memory allocations can access it.
"retired_pages.single_bit_ecc.count" or "retired_pages.sbe"
The number of GPU device memory pages that have been retired due to multiple single bit ECC errors.
"retired_pages.double_bit.count" or "retired_pages.dbe"
The number of GPU device memory pages that have been retired due to a double bit ECC error.
"retired_pages.pending"
Checks if any GPU device memory pages are pending retirement on the next reboot. Pages that are pending retirement can still be allocated, and may cause further reliability issues.
"temperature.gpu"
Core GPU temperature. in degrees C.
"temperature.gpu.tlimit"
GPU T.Limit temperature. in degrees C.
"temperature.memory"
HBM memory temperature. in degrees C.
"power.management"
A flag that indicates whether power management is enabled. Either "Supported" or "[Not Supported]". Requires Inforom PWR object version 3.0 or higher or Kepler device.
"power.draw"
The last measured power draw for the entire board, in watts. On Ampere or newer devices, returns average power draw over 1 sec. On older devices, returns instantaneous power draw. Only available if power management is supported. This reading is accurate to within +/- 5 watts.
"power.draw.average"
The last measured average power draw for the entire board, in watts. Only available if power management is supported and Ampere (except GA100) or newer devices. This reading is accurate to within +/- 5 watts.
"power.draw.instant"
The last measured instant power draw for the entire board, in watts. Only available if power management is supported. This reading is accurate to within +/- 5 watts.
"power.limit"
The software power limit in watts. Set by software like nvidia-smi. On Kepler devices Power Limit can be adjusted using [-pl | --power-limit=] switches.
"enforced.power.limit"
The power management algorithm's power ceiling, in watts. Total board power draw is manipulated by the power management algorithm such that it stays under this value. This value is the minimum of various power limiters.
"power.default_limit"
The default power management algorithm's power ceiling, in watts. Power Limit will be set back to Default Power Limit after driver unload.
"power.min_limit"
The minimum value in watts that power limit can be set to.
"power.max_limit"
The maximum value in watts that power limit can be set to.
"clocks.current.graphics" or "clocks.gr"
Current frequency of graphics (shader) clock.
"clocks.current.sm" or "clocks.sm"
Current frequency of SM (Streaming Multiprocessor) clock.
"clocks.current.memory" or "clocks.mem"
Current frequency of memory clock.
"clocks.current.video" or "clocks.video"
Current frequency of video encoder/decoder clock.
Section about clocks.applications properties
User specified frequency at which applications will be running at. Can be changed with [-ac | --applications-clocks] switches.
"clocks.applications.graphics" or "clocks.applications.gr"
User specified frequency of graphics (shader) clock.
"clocks.applications.memory" or "clocks.applications.mem"
User specified frequency of memory clock.
Section about clocks.default_applications properties
Default frequency at which applications will be running at. Application clocks can be changed with [-ac | --applications-clocks] switches. Application clocks can be set to default using [-rac | --reset-applications-clocks] switches.
"clocks.default_applications.graphics" or "clocks.default_applications.gr"
Default frequency of applications graphics (shader) clock.
"clocks.default_applications.memory" or "clocks.default_applications.mem"
Default frequency of applications memory clock.
Section about clocks.max properties
Maximum frequency at which parts of the GPU are design to run.
"clocks.max.graphics" or "clocks.max.gr"
Maximum frequency of graphics (shader) clock.
"clocks.max.sm" or "clocks.max.sm"
Maximum frequency of SM (Streaming Multiprocessor) clock.
"clocks.max.memory" or "clocks.max.mem"
Maximum frequency of memory clock.
Section about mig.mode properties
A flag that indicates whether MIG mode is enabled. May be either "Enabled" or "Disabled". Changes to MIG mode require a GPU reset.
"mig.mode.current"
The MIG mode that the GPU is currently operating under.
"mig.mode.pending"
The MIG mode that the GPU will operate under after reset.
Section about gsp.mode properties
A flag that indicates whether GSP firmware is enabled.May be either "Enabled" or "Disabled".
"gsp.mode.current"
The current status of GSP firmware.
"gsp.mode.default"
The default status of GSP firmware.
Section about protected_memory properties
On-board protected memory information.
"protected_memory.total"
Total installed GPU conf compute protected memory.
"protected_memory.used"
Total conf compute protected memory allocated by active contexts.
"protected_memory.free"
Total free conf compute protected memory.
"fabric.state"
Current state of GPU fabric registration process.
"fabric.status"
Error status, valid only if gpu fabric registration state is "completed"
nvidia-smi dmon -s m
优点:显示如下,这个fb与nvidia-smi查询的内存占用数量是一样的,具体如下图,
[yxc01841111@NGIS-QTFCR20230280 /home/yxc01841111]
$nvidia-smi dmon -s m
# gpu fb bar1 ccpm
# Idx MB MB MB
0 4 1 0
1 4 1 0
2 4 1 0
3 4 1 0
4 4 1 0
5 4 1 0
6 4 1 0
7 4 1 0
0 4 1 0
1 4 1 0
2 4 1 0
3 4 1 0
4 4 1 0
5 4 1 0
6 4 1 0
7 4 1 0
缺点:1、最小粒度为秒,不能更细粒度显示 2、没有找到gpu利用率的参数,不能完全满足监控的需要
nvidia-smi dmon其他参数解释
指定显示哪些监控指标(默认为puc),其中:
p:电源使用情况和温度(pwr:功耗,temp:温度)
u:GPU使用率(sm:流处理器,mem:显存,enc:编码资源,dec:解码资源)
c:GPU处理器和GPU内存时钟频率(mclk:显存频率,pclk:处理器频率)
v:电源和热力异常
m:FB内存和Bar1内存
e:ECC错误和PCIe重显错误个数
t:PCIe读写带宽 ### nvidia-smi -a 查看所有参数的情况,
$nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Thu Jan 4 14:41:13 2024
Driver Version : 535.129.03
CUDA Version : 12.2
Attached GPUs : 8
GPU 00000000:2F:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323121035405
GPU UUID : GPU-faae5314-b87e-87c0-400b-b7cfb9e7b310
Minor Number : 0
VBIOS Version : 92.00.36.00.04
MultiGPU Board : No
Board ID : 0x2f00
Board Part Number : 692-2G506-0200-002
GPU Part Number : 20B0-884-A1
FRU Part Number : N/A
Module ID : 4
Inforom Version
Image Version : G506.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x2F
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:2F:00.0
Sub System Id : 0x134F10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 620 MiB
Used : 4 MiB
Free : 40334 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 22 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 40 C
Memory Max Operating Temp : 95 C
GPU Power Readings
Power Draw : 52.12 W
Current Power Limit : 400.00 W
Requested Power Limit : 400.00 W
Default Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 706.250 mV
Fabric
State : N/A
Status : N/A
Processes : None
GPU 00000000:35:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323121013540
GPU UUID : GPU-20af7308-7522-d66f-fe01-fb6f0551a4b9
Minor Number : 1
VBIOS Version : 92.00.36.00.04
MultiGPU Board : No
Board ID : 0x3500
Board Part Number : 692-2G506-0200-002
GPU Part Number : 20B0-884-A1
FRU Part Number : N/A
Module ID : 2
Inforom Version
Image Version : G506.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x35
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:35:00.0
Sub System Id : 0x134F10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 620 MiB
Used : 4 MiB
Free : 40334 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 22 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 22 C
Memory Max Operating Temp : 95 C
GPU Power Readings
Power Draw : 53.16 W
Current Power Limit : 400.00 W
Requested Power Limit : 400.00 W
Default Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 712.500 mV
Fabric
State : N/A
Status : N/A
Processes : None
GPU 00000000:62:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323121015220
GPU UUID : GPU-55be19c4-22e1-4622-c7c5-a617a8001574
Minor Number : 2
VBIOS Version : 92.00.36.00.04
MultiGPU Board : No
Board ID : 0x6200
Board Part Number : 692-2G506-0200-002
GPU Part Number : 20B0-884-A1
FRU Part Number : N/A
Module ID : 8
Inforom Version
Image Version : G506.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x62
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:62:00.0
Sub System Id : 0x134F10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 620 MiB
Used : 4 MiB
Free : 40334 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 21 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 38 C
Memory Max Operating Temp : 95 C
GPU Power Readings
Power Draw : 51.79 W
Current Power Limit : 400.00 W
Requested Power Limit : 400.00 W
Default Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 712.500 mV
Fabric
State : N/A
Status : N/A
Processes : None
GPU 00000000:67:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323121015264
GPU UUID : GPU-8029ef2b-47a3-15f7-cca6-5777783049b4
Minor Number : 3
VBIOS Version : 92.00.36.00.04
MultiGPU Board : No
Board ID : 0x6700
Board Part Number : 692-2G506-0200-002
GPU Part Number : 20B0-884-A1
FRU Part Number : N/A
Module ID : 6
Inforom Version
Image Version : G506.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x67
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:67:00.0
Sub System Id : 0x134F10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 620 MiB
Used : 4 MiB
Free : 40334 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 22 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 44 C
Memory Max Operating Temp : 95 C
GPU Power Readings
Power Draw : 52.33 W
Current Power Limit : 400.00 W
Requested Power Limit : 400.00 W
Default Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 725.000 mV
Fabric
State : N/A
Status : N/A
Processes : None
GPU 00000000:9F:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322921084044
GPU UUID : GPU-a62346e8-6045-319e-845d-0bf10385caaf
Minor Number : 4
VBIOS Version : 92.00.36.00.04
MultiGPU Board : No
Board ID : 0x9f00
Board Part Number : 692-2G506-0200-002
GPU Part Number : 20B0-884-A1
FRU Part Number : N/A
Module ID : 7
Inforom Version
Image Version : G506.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x9F
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:9F:00.0
Sub System Id : 0x134F10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 620 MiB
Used : 4 MiB
Free : 40334 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 23 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 22 C
Memory Max Operating Temp : 95 C
GPU Power Readings
Power Draw : 53.05 W
Current Power Limit : 400.00 W
Requested Power Limit : 400.00 W
Default Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 706.250 mV
Fabric
State : N/A
Status : N/A
Processes : None
GPU 00000000:A5:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323121035431
GPU UUID : GPU-926d0f1b-0ac4-e9cb-ec83-520d975c12cf
Minor Number : 5
VBIOS Version : 92.00.36.00.04
MultiGPU Board : No
Board ID : 0xa500
Board Part Number : 692-2G506-0200-002
GPU Part Number : 20B0-884-A1
FRU Part Number : N/A
Module ID : 5
Inforom Version
Image Version : G506.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xA5
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:A5:00.0
Sub System Id : 0x134F10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 620 MiB
Used : 4 MiB
Free : 40334 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 22 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 22 C
Memory Max Operating Temp : 95 C
GPU Power Readings
Power Draw : 55.01 W
Current Power Limit : 400.00 W
Requested Power Limit : 400.00 W
Default Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 712.500 mV
Fabric
State : N/A
Status : N/A
Processes : None
GPU 00000000:C8:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323121035302
GPU UUID : GPU-aac22a4c-e9dd-6fb5-46e5-0f14b9374044
Minor Number : 6
VBIOS Version : 92.00.36.00.04
MultiGPU Board : No
Board ID : 0xc800
Board Part Number : 692-2G506-0200-002
GPU Part Number : 20B0-884-A1
FRU Part Number : N/A
Module ID : 3
Inforom Version
Image Version : G506.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xC8
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:C8:00.0
Sub System Id : 0x134F10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 1000 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 620 MiB
Used : 4 MiB
Free : 40334 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 22 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 23 C
Memory Max Operating Temp : 95 C
GPU Power Readings
Power Draw : 52.17 W
Current Power Limit : 400.00 W
Requested Power Limit : 400.00 W
Default Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 712.500 mV
Fabric
State : N/A
Status : N/A
Processes : None
GPU 00000000:CD:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322921056183
GPU UUID : GPU-1a011247-3834-ff88-6682-d0776a7d7a1e
Minor Number : 7
VBIOS Version : 92.00.36.00.04
MultiGPU Board : No
Board ID : 0xcd00
Board Part Number : 692-2G506-0200-002
GPU Part Number : 20B0-884-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G506.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xCD
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:CD:00.0
Sub System Id : 0x134F10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 620 MiB
Used : 4 MiB
Free : 40334 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 22 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 36 C
Memory Max Operating Temp : 95 C
GPU Power Readings
Power Draw : 53.10 W
Current Power Limit : 400.00 W
Requested Power Limit : 400.00 W
Default Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 712.500 mV
Fabric
State : N/A
Status : N/A
Processes : None
5、nvidia-smi -l 按秒刷新追加显示
nvidia-smi 使用注意点:
nvidia-smi 只是统计 sm 的加权平均, 也就是gpu-utils 的统计
nvidia-smi 的采集粒度偏大
指定板卡id,查看gpu状态,nvidia-smi -i 0
查看gpu 的详细状态信息 nvidia-smi -q nvidia-smi -i 0 -q
$nvidia-smi -i 0 -q
==============NVSMI LOG==============
Timestamp : Thu Jan 4 14:45:46 2024
Driver Version : 535.129.03
CUDA Version : 12.2
Attached GPUs : 8
GPU 00000000:2F:00.0
Product Name : NVIDIA A100-SXM4-40GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323121035405
GPU UUID : GPU-faae5314-b87e-87c0-400b-b7cfb9e7b310
Minor Number : 0
VBIOS Version : 92.00.36.00.04
MultiGPU Board : No
Board ID : 0x2f00
Board Part Number : 692-2G506-0200-002
GPU Part Number : 20B0-884-A1
FRU Part Number : N/A
Module ID : 4
Inforom Version
Image Version : G506.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x2F
Device : 0x00
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:2F:00.0
Sub System Id : 0x134F10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40960 MiB
Reserved : 620 MiB
Used : 4 MiB
Free : 40334 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 22 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 40 C
Memory Max Operating Temp : 95 C
GPU Power Readings
Power Draw : 52.45 W
Current Power Limit : 400.00 W
Requested Power Limit : 400.00 W
Default Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1215 MHz
Video : 585 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 706.250 mV
Fabric
State : N/A
Status : N/A
Processes : None
查看gpu 的编码器状态
nvidia-smi -q | grep -i enc
nvidia-smi -i 0 -q | grep -i enc
$nvidia-smi -i 0 -q | grep -i enc
Persistence Mode : Disabled
Encoder : 0 %
Encoder Stats
Average Latency : 0
Average Latency : 0
nvidia-smi dmon 相关
设备监控命令,以滚动条形式显示GPU设备统计信息 nvidia-smi dmon
GPU统计信息以一行的滚动格式显示,要监控的指标可以基于终端窗口的宽度进行调整。 监控所有的GPU
附加选项:
nvidia-smi dmon -i xxx 用逗号分隔GPU索引,PCI总线ID或UUID
nvidia-smi dmon -d xxx 指定刷新时间(默认为1秒)
nvidia-smi dmon -c xxx 显示指定数目的统计信息并退出
nvidia-smi dmon -s xxx 指定显示哪些监控指标(默认为puc),其中: p:电源使用情况和温度(pwr:功耗,temp:温度) u:GPU使用率(sm:流处理器,mem:显存,enc:编码资源,dec:解码资源) c:GPU处理器和GPU内存时钟频率(mclk:显存频率,pclk:处理器频率) v:电源和热力异常 m:FB内存和Bar1内存 e:ECC错误和PCIe重显错误个数 t:PCIe读写带宽
nvidia-smi dmon –o D/T 指定显示的时间格式D:YYYYMMDD,THH:MM:SS
nvidia-smi dmon –f xxx 将查询的信息输出到具体的文件中,不在终端显示
nvidia-smi pmon 相关
进程监控命令,以滚动条形式显示GPU进程状态信息。 nvidia-smi pmon GPU进程统计信息以一行的滚动格式显示,此工具列出了GPU所有进程的统计信息。要监控的指标可以基于终端窗口的宽度进行调整。 附加选项: nvidia-smi pmon -i xxx 用逗号分隔GPU索引,PCI总线ID或UUID
nvidia-smi pmon -d xxx 指定刷新时间(默认为1秒,最大为10秒)
nvidia-smi pmon -c xxx 显示指定数目的统计信息并退出
nvidia-smi pmon -s xxx 指定显示哪些监控指标(默认为u),其中: u:GPU使用率 m:FB内存使用情况