A new researcher needs access to GPU resources but should not have permission to modify cluster settings or manage other users.
What role should you assign them in Run:ai?
You are using BCM for configuring an active-passive high availability (HA) cluster for a firewall system. To ensure seamless failover, what is one best practice related to session synchronization between the active and passive nodes?
A system administrator notices that jobs are failing intermittently on Base Command Manager due to incorrect GPU configurations in Slurm. The administrator needs to ensure that jobs utilize GPUs correctly.
How should they troubleshoot this issue?
What should an administrator check if GPU-to-GPU communication is slow in a distributed system using Magnum IO?
A system administrator needs to lower latency for an AI application by utilizing GPUDirect Storage.
What two (2) bottlenecks are avoided with this approach? (Choose two.)
You are an administrator managing a large-scale Kubernetes-based GPU cluster using Run:AI.
To automate repetitive administrative tasks and efficiently manage resources across multiple nodes, which of the following is essential when using the Run:AI Administrator CLI for environments where automation or scripting is required?
When troubleshooting Slurm job scheduling issues, a common source of problems is jobs getting stuck in a pending state indefinitely.
Which Slurm command can be used to view detailed information about all pending jobs and identify the cause of the delay?
A Slurm user needs to display real-time information about the running processes and resource usage of a Slurm job.
Which command should be used?
A cloud engineer is looking to deploy a digital fingerprinting pipeline using NVIDIA Morpheus and the NVIDIA AI Enterprise Virtual Machine Image (VMI).
Where would the cloud engineer find the VMI?
Your Kubernetes cluster is running a mixture of AI training and inference workloads. You want to ensure that inference services have higher priority over training jobs during peak resource usage times.
How would you configure Kubernetes to prioritize inference workloads?
A system administrator needs to scale a Kubernetes Job to 4 replicas.
What command should be used?
A system administrator is troubleshooting a Docker container that crashes unexpectedly due to a segmentation fault. They want to generate and analyze core dumps to identify the root cause of the crash.
Why would generating core dumps be a critical step in troubleshooting this issue?
A system administrator needs to configure and manage multiple installations of NVIDIA hardware ranging from single DGX BasePOD to SuperPOD.
Which software stack should be used?
You are configuring networking for a new AI cluster in your data center. The cluster will handle large-scale distributed training jobs that require fast communication between servers.
What type of networking architecture can maximize performance for these AI workloads?
After completing the installation of a Kubernetes cluster on your NVIDIA DGX systems using BCM, how can you verify that all worker nodes are properly registered and ready?
You are managing a Slurm cluster with multiple GPU nodes, each equipped with different types of GPUs. Some jobs are being allocated GPUs that should be reserved for other purposes, such as display rendering.
How would you ensure that only the intended GPUs are allocated to jobs?