Pre-Summer Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: cramtick70

NCP-AII NVIDIA AI Infrastructure Questions and Answers

Questions 4

For an NVIDIA Enterprise AI Factory with 256 GPUs, which storage solution characteristic is most critical to validate during scaling tests?

Options:

A.

Consistent per-node throughput > 8 GiB/s.

B.

Single-node write performance during idle clusters.

C.

RAID rebuild times under disk failure.

D.

Maximum 4K random read IOPS exceeding 1 million.

Buy Now
Questions 5

After configuring HA, the administrator runs cmsh status and notices the secondary head node reports mysql [FAIL]. What is the most likely cause?

Options:

A.

The BCM license expired after HA configuration.

B.

Network connectivity issues between the primary and secondary head nodes.

C.

The secondary head node lacks NVIDIA GPU drivers.

D.

The cluster nodes are powered on during the HA configuration.

Buy Now
Questions 6

Which of the following tests should be used to check for the lowest possible latency between two nodes in a fabric?

Options:

A.

ib_read_bw

B.

ib_read_lat

C.

ib_write_bw

D.

ib_write_lat

Buy Now
Questions 7

A media company is developing an AI platform for video content analysis that requires storing and processing large volumes of unstructured video data. The platform must support high throughput for data ingestion and provide efficient access for real-time analytics. Given these requirements, which storage strategy should the company implement?

Options:

A.

Tape storage for its cost-effectiveness and archival capabilities

B.

Block storage for low latency and high performance

C.

File storage for hierarchical organization and easy navigation

D.

Object storage for scalability and metadata management

Buy Now
Questions 8

You are installing the operating system as part of the initial setup for a new NVIDIA Base Command Manager cluster. Which two of the following actions are essential for a successful OS installation on the cluster’s head node?

Pick the 2 correct responses below.

Options:

A.

Download the latest BCM ISO and verify its integrity using the provided checksum, then start the installation.

B.

Configure network switches for PXE boot to all compute nodes before installing the OS on the head node.

C.

Set the desired time zone and configure NTP synchronization during the OS installation wizard.

D.

Start the head node OS installation process with the system BIOS set to legacy boot mode instead of UEFI.

Buy Now
Questions 9

When updating the firmware on an NVLink switch transceiver, how can an engineer apply new firmware without interrupting the network?

Options:

A.

mlxfwreset -d -lid 27 reset --yes to reset the transceiver

B.

Physically disconnect and reconnect the transceiver.

C.

flint -d -lid 27 --linkx --linkx_auto_update --activate

D.

nv action reboot system to force immediate activation.

Buy Now
Questions 10

A customer is designing an AI Factory for enterprise-scale deployments and wants to ensure redundancy and load balancing for the management and storage networks. Which feature should be implemented on the Ethernet switches?

Options:

A.

Implement redundant switches with spanning tree protocol.

B.

MLAG for bonded interfaces across redundant switches.

C.

Use only one switch for all management and storage traffic.

D.

Disable VLANs and use unmanaged switches.

Buy Now
Questions 11

A cluster administrator is preparing to update the firmware on a DGX H100 system, including the GPU tray (baseboard). What is the correct sequence of steps to perform a safe and successful firmware upgrade?

Options:

A.

Update the BMC and skip the GPU tray and motherboard tray updates if the system appears healthy.

B.

Perform a cold reset, stop all GPU activity, update and reboot the BMC, update motherboard and tray components, and verify completion.

C.

Update the GPU tray first, then the motherboard tray, and reboot the BMC after all updates are complete.

D.

Stop all GPU activity, update and reboot the BMC, update motherboard and tray components, perform a cold reset, and verify completion.

Buy Now
Questions 12

After initial setup and health checks, the DGX H100 system administrator wants to verify that containers can access GPUs before running production workloads. Which method is recommended for this validation?

Options:

A.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 systemctl

B.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 ls -la

C.

sudo docker run --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

D.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

Buy Now
Questions 13

A system administrator needs to configure a BlueField DPU and enable RShim on the baseboard management controller (BMC). Which command should be executed?

Options:

A.

ipmitool raw 0x32 0x6a 1

B.

systemctl restart rshim

C.

systemctl enable bmc-rshim.service

D.

scp < path_to_bfb > root@ < bmc_ip > :/dev/rshim0/boot

Buy Now
Questions 14

ClusterKit’s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

Options:

A.

Critical failure; expected is greater than 390 GB/s for HDR InfiniBand.

B.

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.

Optimal performance, indicating healthy fabric and GPUDirect RDMA.

D.

Inconclusive; rerun with --stress=cpu to validate.

Buy Now
Questions 15

During server maintenance, a system administrator wants to ensure that the NVIDIA DGX server has sufficient disk space for operational activities. The administrator is scripting an alert system that will notify the team if disk space falls below a threshold. Which command could be included in the maintenance script to check the available disk space on the server?

Options:

A.

nvidia-smi --query-disk-space

B.

du -sh /home/*

C.

df -h | grep ' /var '

D.

lsof +L1

Buy Now
Questions 16

During a multi-day NeMo burn-in, intermittent " GPU fell off bus " errors occur. Which diagnostic approach isolates hardware faults?

Options:

A.

Enable HPL_USE_NVSHMEM for alternative memory sharing.

B.

Run DCGM diagnostics alongside burn-in to monitor GPU health metrics.

C.

Switch from BERT to GPT models for simpler computations.

D.

Reduce blocksize to 500MB to lower memory pressure.

Buy Now
Questions 17

An engineer is tasked with configuring Out-of-Band management for a DGX BasePOD deployment. Which network design will best ensure secure and reliable Out-of-Band management operations?

Options:

A.

Use a single VLAN for both Out-of-Band management and compute fabric to simplify network design.

B.

Configure Out-of-Band management interfaces to be accessible from any subnet within the data center for maximum flexibility.

C.

Connect Out-of-Band management ports to the same switch as user traffic for easier troubleshooting.

D.

Place all BMC and management interfaces on an isolated Out-of-Band network with access restricted by firewall rules.

Buy Now
Questions 18

An infrastructure engineer runs an NCCL burn-in on an eight-node GPU cluster. Over a 12-hour period, all GPUs are tested with repeated all-reduce collectives. Monitoring tools show the following observations:

Aggregate bandwidth remains within 5% of documented reference for the hardware on every run.

No errors or timeouts are reported in NCCL logs.

On three occasions, one GPU logged single-run bandwidth dips of 15–20% compared to its normal performance, but performance recovered on the next run and stayed stable afterward. System logs show no hardware or driver errors.

Two minor NCCL WARN-level messages about “unexpected latency spike” appear in system logs for separate nodes, but could not be reproduced.

Which conclusion is the best strategy before releasing the cluster to production?

Options:

A.

Proceed, since all bandwidth targets are met, issues were transient and self-resolved, and there are no persistent errors or timeouts across repeated burn-ins.

B.

Recommend proactive maintenance, because any bandwidth drop, even if transient and unreproducible, shows the burn-in failed; clusters must not show performance variance above 10% for any GPU even once.

C.

Approve for AI workload use, but flag affected nodes for manual exclusion from distributed training jobs, as nodes showing any anomaly should be isolated whenever possible.

Buy Now
Questions 19

A company has a registered NGC account and their server has NGC CLI installed. What step should be taken first to gain access to NGC?

Options:

A.

ngc config get

B.

ngc init

C.

ngc config set

D.

ngc config update

Buy Now
Questions 20

You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize to develop more accurate energy-efficiency metrics?

Options:

A.

Focus on integrating kilowatt-hours into existing metrics to better reflect the actual energy used for productive work.

B.

Use Power Usage Effectiveness as the primary metric while supplementing it with additional measures of useful work done per unit of energy.

C.

Develop benchmarks tailored to specific workloads, such as MLPerf for AI applications, to better understand energy use in real-world scenarios.

D.

Use watts-used as the primary measure of efficiency, as it accurately reflects the power input at any given time.

Buy Now
Questions 21

After running a 24-hour stress test on a DGX node, the administrator should verify which two key metrics to ensure system stability?

Options:

A.

Average CPU usage > 80% and Docker container uptime.

B.

No thermal throttling events and consistent GPU utilization > 95% throughout the test.

C.

SSD write endurance and RAM capacity.

D.

Total energy consumption and NVLink bandwidth.

Buy Now
Questions 22

An InfiniBand administrator needs to run performance benchmarks on new devices added to the fabric. What tool should be used to check the latency?

Options:

A.

tcpdump

B.

ib_write_lat

C.

ibdiagnet

D.

perfmon

Buy Now
Questions 23

A 24-hour HPL burn-in fails with " illegal value " errors during the first iteration. Which initial troubleshooting step resolves this without compromising burn-in validity?

Options:

A.

Switch from FP64 to FP32 precision.

B.

Disable GPU affinity.

C.

Reduce test duration to 12 hours.

D.

Verify the matrix size is divisible by block size.

Buy Now
Questions 24

A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?

Options:

A.

Navigate to ’Devices " > select a switch > " Cables ' tab to see ASIC firmware and transceiver versions.

B.

Use " Topology’ view to visually inspect cable icons.

C.

Run mlxlink -d lid- < LID > -m on each port manually.

D.

Export all switch logs and grep for ’FW Version " .

Buy Now
Questions 25

What is the primary purpose of performing a NeMo burn-in on a new AI infrastructure?

Options:

A.

To benchmark production training speed and ensure all GPUs are running at identical clock speeds.

B.

To stress test the hardware and software stack with representative NeMo workloads, ensuring reliability.

C.

To tune NeMo model hyperparameters for maximum accuracy on user datasets during cluster deployment.

Buy Now
Questions 26

An engineer needs to validate 400G DAC cable signal integrity in a DGX cluster. Which CVT metric best identifies marginal cables needing replacement?

Options:

A.

Lane power variance < 3dB across all transceivers.

B.

Transceiver model matching QSFP-DD specifications.

C.

Temperature fluctuations > 5°C during validation.

D.

Effective BER > 1.5E-254 during a < 6-hour monitoring window.

Buy Now
Questions 27

After NCCL burn-in reports " transport retry count exceeded, " which corrective action addresses the underlying fabric issue?

Options:

A.

Switch from Ring to Tree algorithms via NCCL_ALGO=TREE

B.

Reduce message size to decrease network utilization

C.

Increase NCCL_IB_TIMEOUT to tolerate longer latencies

D.

Inspect InfiniBand link quality metrics (BER, symbol errors) and replace faulty cables

Buy Now
Questions 28

You are validating the environment of an NVIDIA GPU-accelerated data center during post-deployment checks. Which one action is essential to confirm that power and cooling are sufficient for the stable operation of NVIDIA DGX H100 systems?

Options:

A.

Confirm the system fans are running at 100% under all workloads to prevent overheating.

B.

Review the system BIOS to ensure GPU overclocking is enabled for maximum performance.

C.

Use NVSM to disable unused PCIe devices to reduce overall system heat output.

D.

Verify that each DGX system is connected to redundant, properly rated PDUs and that all power supplies are reporting nominal input.

Buy Now
Questions 29

A financial services firm is deploying an AI model for fraud detection that requires rapid inference and data retrieval across multiple sites. Which feature should their storage system prioritize?

Options:

A.

Multi-protocol data access with low latency.

B.

Tape backup systems.

C.

Low-cost HDD solutions.

D.

High capacity with moderate speed.

Buy Now
Questions 30

An engineer is reimaging a DGX system in a large cluster. Which method ensures the most efficient and secure remote installation without physical access?

Options:

A.

Use apt-get to upgrade the operating system without rebooting the system.

B.

Create a USB drive with the ISO and manually boot from it on the DGX system.

C.

Build a software image on Base Command Manager and then reimage the system.

D.

Skip ISO verification and directly flash the operating system to the disk via SSH.

Buy Now
Questions 31

An AI training cluster with NVIDIA GPUs experiences prolonged data loading times during checkpoint reloading, causing GPUs to idle frequently. CPU utilization during data transfers remains high. Which solution most effectively optimizes storage-to-GPU throughput while reducing CPU overhead?

Options:

A.

Increase batch sizes to reduce the frequency of storage access.

B.

Migrate datasets to SATA SSDs with RAID 0 for higher sequential read speeds.

C.

Add more GPUs to the cluster to parallelize data loading tasks.

D.

Implement GPUDirect Storage to enable direct data transfers.

Buy Now
Questions 32

A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?

Options:

A.

The command output is ignored if the system powers on without errors.

B.

At least half of the GPUs report Status_Health = OK.

C.

All GPUs report Status_Health = OK and Health = OK for each device.

D.

Only the head node ' s GPUs need to be healthy.

Buy Now
Questions 33

A leaf switch shows " FW Version Mismatch " alerts for transceivers after cluster expansion. Which tool validates transceiver firmware against expected versions?

Options:

A.

flint

B.

iblinkinfo

C.

mlxconfig

D.

ethtool

Buy Now
Questions 34

A system administrator has upgraded the firmware of the DPU. What will be the state of the firmware after the upgrade?

Options:

A.

The firmware is installed on the DPU.

B.

The firmware is deleted from the DPU.

C.

The firmware is copied to the DPU but not installed.

D.

The firmware is waiting on reboot to become active.

Buy Now
Questions 35

A user encounters " permission denied " errors when running GPU-accelerated containers on a Secure Boot-enabled system. What resolves this?

Options:

A.

Enroll the MOK and sign NVIDIA kernel modules.

B.

Reinstall Docker without the NVIDIA runtime.

C.

Disable SELinux to relax unnecessary security policies.

D.

Run Docker with sudo for elevated privileges.

Buy Now
Questions 36

You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize?

Options:

A.

Use Power Usage Effectiveness as the primary metric while supplementing it with additional measures of useful work done per unit of energy.

B.

Use watts used as the primary measure of efficiency, as it accurately reflects the power input at any given time.

C.

Develop benchmarks tailored to specific workloads, such as MLPerf for AI applications, to better understand energy use in real-world scenarios.

D.

Focus on integrating kilowatt-hours into existing metrics to better reflect the actual energy used for productive work.

Buy Now
Exam Code: NCP-AII
Exam Name: NVIDIA AI Infrastructure
Last Update: May 27, 2026
Questions: 71
NCP-AII pdf

NCP-AII PDF

$25.5  $84.99
NCP-AII Engine

NCP-AII Testing Engine

$30  $99.99
NCP-AII PDF + Engine

NCP-AII PDF + Testing Engine

$40.5  $134.99