For an NVIDIA Enterprise AI Factory with 256 GPUs, which storage solution characteristic is most critical to validate during scaling tests?
After configuring HA, the administrator runs cmsh status and notices the secondary head node reports mysql [FAIL]. What is the most likely cause?
Which of the following tests should be used to check for the lowest possible latency between two nodes in a fabric?
A media company is developing an AI platform for video content analysis that requires storing and processing large volumes of unstructured video data. The platform must support high throughput for data ingestion and provide efficient access for real-time analytics. Given these requirements, which storage strategy should the company implement?
You are installing the operating system as part of the initial setup for a new NVIDIA Base Command Manager cluster. Which two of the following actions are essential for a successful OS installation on the cluster’s head node?
Pick the 2 correct responses below.
When updating the firmware on an NVLink switch transceiver, how can an engineer apply new firmware without interrupting the network?
A customer is designing an AI Factory for enterprise-scale deployments and wants to ensure redundancy and load balancing for the management and storage networks. Which feature should be implemented on the Ethernet switches?
A cluster administrator is preparing to update the firmware on a DGX H100 system, including the GPU tray (baseboard). What is the correct sequence of steps to perform a safe and successful firmware upgrade?
After initial setup and health checks, the DGX H100 system administrator wants to verify that containers can access GPUs before running production workloads. Which method is recommended for this validation?
A system administrator needs to configure a BlueField DPU and enable RShim on the baseboard management controller (BMC). Which command should be executed?
ClusterKit’s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?
During server maintenance, a system administrator wants to ensure that the NVIDIA DGX server has sufficient disk space for operational activities. The administrator is scripting an alert system that will notify the team if disk space falls below a threshold. Which command could be included in the maintenance script to check the available disk space on the server?
During a multi-day NeMo burn-in, intermittent " GPU fell off bus " errors occur. Which diagnostic approach isolates hardware faults?
An engineer is tasked with configuring Out-of-Band management for a DGX BasePOD deployment. Which network design will best ensure secure and reliable Out-of-Band management operations?
An infrastructure engineer runs an NCCL burn-in on an eight-node GPU cluster. Over a 12-hour period, all GPUs are tested with repeated all-reduce collectives. Monitoring tools show the following observations:
Aggregate bandwidth remains within 5% of documented reference for the hardware on every run.
No errors or timeouts are reported in NCCL logs.
On three occasions, one GPU logged single-run bandwidth dips of 15–20% compared to its normal performance, but performance recovered on the next run and stayed stable afterward. System logs show no hardware or driver errors.
Two minor NCCL WARN-level messages about “unexpected latency spike” appear in system logs for separate nodes, but could not be reproduced.
Which conclusion is the best strategy before releasing the cluster to production?
A company has a registered NGC account and their server has NGC CLI installed. What step should be taken first to gain access to NGC?
You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize to develop more accurate energy-efficiency metrics?
After running a 24-hour stress test on a DGX node, the administrator should verify which two key metrics to ensure system stability?
An InfiniBand administrator needs to run performance benchmarks on new devices added to the fabric. What tool should be used to check the latency?
A 24-hour HPL burn-in fails with " illegal value " errors during the first iteration. Which initial troubleshooting step resolves this without compromising burn-in validity?
A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?
What is the primary purpose of performing a NeMo burn-in on a new AI infrastructure?
An engineer needs to validate 400G DAC cable signal integrity in a DGX cluster. Which CVT metric best identifies marginal cables needing replacement?
After NCCL burn-in reports " transport retry count exceeded, " which corrective action addresses the underlying fabric issue?
You are validating the environment of an NVIDIA GPU-accelerated data center during post-deployment checks. Which one action is essential to confirm that power and cooling are sufficient for the stable operation of NVIDIA DGX H100 systems?
A financial services firm is deploying an AI model for fraud detection that requires rapid inference and data retrieval across multiple sites. Which feature should their storage system prioritize?
An engineer is reimaging a DGX system in a large cluster. Which method ensures the most efficient and secure remote installation without physical access?
An AI training cluster with NVIDIA GPUs experiences prolonged data loading times during checkpoint reloading, causing GPUs to idle frequently. CPU utilization during data transfers remains high. Which solution most effectively optimizes storage-to-GPU throughput while reducing CPU overhead?
A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?
A leaf switch shows " FW Version Mismatch " alerts for transceivers after cluster expansion. Which tool validates transceiver firmware against expected versions?
A system administrator has upgraded the firmware of the DPU. What will be the state of the firmware after the upgrade?
A user encounters " permission denied " errors when running GPU-accelerated containers on a Secure Boot-enabled system. What resolves this?
You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize?