Skip to main content
Version: Next

Cluster device allocation endpoint

You can get the overview of cluster device allocation and limit by visiting {scheduler node ip}:31993/metrics, or add it to a prometheus endpoint, as the command below:

curl {scheduler node ip}:31993/metrics

It contains the following metrics:

MetricsDescriptionExample
hami_gpu_core_limit_ratioDevice core limit for a certain GPU{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"} 100
hami_gpu_memory_limit_bytesDevice memory limit for a certain GPU{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"} 3.4359738368e+10
hami_gpu_core_allocated_ratioDevice core allocated for a certain GPU{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"} 45
hami_gpu_memory_allocated_bytesDevice memory allocated for a certain GPU{device_cores="0",device_index="0",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",zone="vGPU"} 3.221225472e+09
hami_gpu_shared_countNumber of containers sharing this GPU{device_index="0",device_uuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",node="aio-node67",zone="vGPU"} 1
hami_vgpu_core_allocated_ratiovGPU core allocated from a container{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"} 50
hami_vgpu_memory_allocated_bytesvGPU memory allocated from a container{container_index="Ascend310P",device_uuid="aio-node74-arm-Ascend310P-0",node="aio-node74-arm",pod="ascend310p-pod",namespace="default",zone="vGPU"} 3.221225472e+09
hami_resource_quota_usedresourcequota usage for a certain device{quota_name="nvidia.com/gpucores", namespace="default",limit="200",zone="vGPU"} 100

If you are using HAMi DRA, the metrics will be:

MetricsDescriptionExample
GPUDeviceCoreLimitGPUDeviceCoreLimit Device memory core limit for a certain GPU{devicebrand="Tesla",deviceidx="0",devicename="hami-gpu-1",deviceproductname="Tesla P4",deviceuuid="GPU-3ab1-179d-d6dd",nodeid="k8s-node01"} 100
GPUDeviceMemoryLimitGPUDeviceMemoryLimit Device memory limit for a certain GPU{devicebrand="Tesla",deviceidx="0",devicename="hami-gpu-1",deviceproductname="Tesla P4",deviceuuid="GPU-3ab1-179d-d6dd",nodeid="k8s-node01"} 8192
GPUDeviceCoreAllocatedDevice core allocated for a certain GPU{devicebrand="Tesla",deviceidx="0",devicename="hami-gpu-1",deviceproductname="Tesla P4",deviceuuid="GPU-3ab1-179d-d6dd",nodeid="k8s-node01"} 0
GPUDeviceMemoryAllocatedDevice memory allocated for a certain GPU{devicebrand="Tesla",deviceidx="0",devicename="hami-gpu-1",deviceproductname="Tesla P4",deviceuuid="GPU-3ab1-179d-d6dd",nodeid="k8s-node01"} 0
vGPUDeviceCoreAllocatedvGPU core allocated from a container{devicebrand="Tesla",deviceidx="0",devicename="hami-gpu-0",deviceproductname="Tesla P4",deviceuuid="GPU-82be-83fe-3068",nodeid="k8s-node01",podname="pod-0",podnamespace="default"} 100
vGPUDeviceMemoryAllocatedvGPU memory allocated from a container{devicebrand="Tesla",deviceidx="0",devicename="hami-gpu-0",deviceproductname="Tesla P4",deviceuuid="GPU-82be-83fe-3068",nodeid="k8s-node01",podname="pod-0",podnamespace="default"} 4000
note

This is the overview of device allocation, it is NOT device real-time usage metrics. For that part, see real-time device usage.

CNCFHAMi is a CNCF Sandbox project