Pre-Summer Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: cramtick70

Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Questions and Answers

Questions 4

A data engineer wants to join a stream of advertisement impressions (when an ad was shown) with another stream of user clicks on advertisements to correlate when impression led to monitizable clicks.

Which solution would improve the performance?

A)

B)

C)

D)

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Buy Now
Questions 5

Given the following PySpark code snippet in a Databricks notebook:

filtered_df = spark.read.format( " delta " ).load( " /mnt/data/large_table " ) \

.filter( " event_date > ' 2024-01-01 ' " )

filtered_df.count()

The data engineer notices from the Query Profiler that the scan operator for filtered_df is reading almost all files, despite the filter being applied.

What is the probable reason for poor data skipping?

Options:

A.

The Delta table lacks optimization that enables dynamic file pruning.

B.

The filter is executed only after the full data scan, preventing data skipping.

C.

The event_date column is outside the table’s partitioning and Z-ordering scheme.

D.

The filter condition involves a data type excluded from data skipping support.

Buy Now
Questions 6

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.

Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?

Options:

A.

Stage’s detail screen and Executor’s files

B.

Stage’s detail screen and Query’s detail screen

C.

Driver’s and Executor’s log files

D.

Executor’s detail screen and Executor’s log files

Buy Now
Questions 7

The following table consists of items found in user carts within an e-commerce website.

The following MERGE statement is used to update this table using an updates view, with schema evaluation enabled on this table.

How would the following update be handled?

Options:

A.

The update is moved to separate ' ' restored ' ' column because it is missing a column expected in the target schema.

B.

The new restored field is added to the target schema, and dynamically read as NULL for existing unmatched records.

C.

The update throws an error because changes to existing columns in the target schema are not supported.

D.

The new nested field is added to the target schema, and files underlying existing records are updated to include NULL values for the new field.

Buy Now
Questions 8

A workspace admin has created a new catalog called finance_data and wants to delegate permission management to a finance team lead without giving them full admin rights.

Which privilege should be granted to the finance team lead?

Options:

A.

ALL PRIVILEGES on the finance_data catalog.

B.

Make the finance team lead a metastore admin.

C.

GRANT OPTION privilege on the finance_data catalog.

D.

MANAGE privilege on the finance_data catalog.

Buy Now
Questions 9

A data engineer manages a production Lakeflow Declarative Pipeline that processes customer transaction data. The pipeline includes several data quality expectations such as transaction_amount > 0 and customer_id IS NOT NULL. These expectations are defined using the EXPECT clause in SQL.

The engineer aims to monitor the pipeline’s data quality by analyzing the number of records that passed or failed each expectation during the latest pipeline update. The Lakeflow Declarative Pipelines event logs are stored in a Delta table named event_log_table.

For the most recent pipeline update, determine a programmatically appropriate approach to extract information like the name of each expectation, associated dataset, count of records that passed the expectation, and count of records that failed the expectation.

Which method retrieves the desired data quality metrics from the Lakeflow Declarative Pipelines event log?

Options:

A.

Access the event_log_table, filter for events where event_type = ' flow_progress ' , and parse details.flow_progress.data_quality.expectations field to extract the required metrics.

B.

Use the Lakeflow Declarative Pipelines UI to navigate to the specific pipeline, select the dataset, and view the Data Quality tab to manually retrieve the expectation metrics.

C.

Query the event_log_table for events with event_type = ' data_quality ' and directly select the passed_records and failed_records fields.

D.

Access the event_log_table, filter for events where event_type = ' expectation_result ' , and extract the expectation metrics from the details field.

Buy Now
Questions 10

A facilities-monitoring team is building a near-real-time PowerBI dashboard off the Delta table device_readings:

Columns:

    device_id (STRING, unique sensor ID)

    event_ts (TIMESTAMP, ingestion timestamp UTC)

    temperature_c (DOUBLE, temperature in °C)

Requirement:

    For each sensor, generate one row per non-overlapping 5-minute interval , offset by 2 minutes (e.g., 00:02–00:07, 00:07–00:12, …).

    Each row must include interval start, interval end, and average temperature in that slice.

    Downstream BI tools (e.g., Power BI) must use the interval timestamps to plot time-series bars.

Options:

Options:

A.

WITH buckets AS (

SELECT device_id,

window(event_ts, ' 5 minutes ' , ' 2 minutes ' , ' 5 minutes ' ) AS win,

temperature_c

FROM device_readings

)

SELECT device_id,

win.start AS bucket_start,

win.end AS bucket_end,

AVG(temperature_c) AS avg_temp_5m

FROM buckets

GROUP BY device_id, win

ORDER BY device_id, bucket_start;

B.

SELECT device_id,

event_ts,

AVG(temperature_c) OVER (

PARTITION BY device_id

ORDER BY event_ts

RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW

) AS avg_temp_5m

FROM device_readings

WINDOW w AS (window(event_ts, ' 5 minutes ' , ' 2 minutes ' ));

C.

SELECT device_id,

date_trunc( ' minute ' , event_ts - INTERVAL 2 MINUTES) + INTERVAL 2 MINUTES AS bucket_start,

date_trunc( ' minute ' , event_ts - INTERVAL 2 MINUTES) + INTERVAL 7 MINUTES AS bucket_end,

AVG(temperature_c) AS avg_temp_5m

FROM device_readings

GROUP BY device_id, date_trunc( ' minute ' , event_ts - INTERVAL 2 MINUTES)

ORDER BY device_id, bucket_start;

D.

SELECT device_id,

window.start AS bucket_start,

window.end AS bucket_end,

AVG(temperature_c) AS avg_temp_5m

FROM device_readings

GROUP BY device_id, window(event_ts, ' 5 minutes ' , ' 5 minutes ' , ' 2 minutes ' )

ORDER BY device_id, bucket_start;

Buy Now
Questions 11

A data engineer is optimizing a managed Delta table that suffers from data skew and frequently changing query filter columns . The engineer wants to avoid costly data rewrites when query patterns evolve. The table size is under 1 TB.

How should the data engineer meet this requirement?

Options:

A.

Apply Z-ordering , since it allows flexible reorganization of data layout without rewriting existing files and adapts easily to new filter columns.

B.

Use Hive-style partitioning , as it provides efficient data skipping and is easy to change partition columns at any time.

C.

Enable liquid clustering , as it efficiently handles data skew, allows clustering keys to be changed without rewriting existing data, and adapts to evolving query patterns.

D.

Combine partitioning and Z-ordering to maximize flexibility and minimize maintenance as query patterns change.

Buy Now
Questions 12

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df . The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFrame df has the following schema:

" device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT "

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

A.

to_interval( " event_time " , " 5 minutes " ).alias( " time " )

B.

window( " event_time " , " 5 minutes " ).alias( " time " )

C.

" event_time "

D.

window( " event_time " , " 10 minutes " ).alias( " time " )

E.

lag( " event_time " , " 10 minutes " ).alias( " time " )

Buy Now
Questions 13

The business reporting tem requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms and load the data for their pipeline runs in 10 minutes.

Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

Options:

A.

Schedule a jo to execute the pipeline once and hour on a dedicated interactive cluster.

B.

Schedule a Structured Streaming job with a trigger interval of 60 minutes.

C.

Schedule a job to execute the pipeline once hour on a new job cluster.

D.

Configure a job that executes every time new data lands in a given directory.

Buy Now
Questions 14

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema:

user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT

New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id .

Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?

Options:

A.

Use Auto Loader to subscribe to new files in the account history directory; configure a Structured Streaminq trigger once job to batch update newly detected files into the account current table.

B.

Overwrite the account current table with each batch using the results of a query against the account history table grouping by user id and filtering for the max value of last updated.

C.

Filter records in account history using the last updated field and the most recent hour processed, as well as the max last iogin by user id write a merge statement to update or insert the most recent value for each user id.

D.

Use Delta Lake version history to get the difference between the latest version of account history and one version prior, then write these records to account current.

E.

Filter records in account history using the last updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the

most recent value for each username.

Buy Now
Questions 15

Which Python variable contains a list of directories to be searched when trying to locate required modules?

Options:

A.

importlib.resource path

B.

,sys.path

C.

os-path

D.

pypi.path

E.

pylib.source

Buy Now
Questions 16

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

Options:

A.

Set the configuration delta.deduplicate = true.

B.

VACUUM the Delta table after each batch completes.

C.

Perform an insert-only merge with a matching condition on a unique key.

D.

Perform a full outer join on a unique key and overwrite existing data.

E.

Rely on Delta Lake schema enforcement to prevent duplicate records.

Buy Now
Questions 17

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

Options:

A.

Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.

B.

Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.

C.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.

D.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will tail.

E.

Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Buy Now
Questions 18

A data engineer is building a Lakeflow Declarative Pipelines pipeline to process healthcare claims data. A metadata JSON file defines data quality rules for multiple tables, including:

{

" claims " : [

{ " name " : " valid_patient_id " , " constraint " : " patient_id IS NOT NULL " },

{ " name " : " non_negative_amount " , " constraint " : " claim_amount > = 0 " }

]

}

The pipeline must dynamically apply these rules to the claims table without hardcoding the rules.

How should the data engineer achieve this?

Options:

A.

Load the JSON metadata, loop through its entries, and apply expectations using dlt.expect_all.

B.

Invoke an external API to validate records against the metadata rules.

C.

Reference each expectation with @dlt.expect decorators in the table declaration.

D.

Use a SQL CONSTRAINT block referencing the JSON file path.

Buy Now
Questions 19

The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.

Which approach will ensure that this requirement is met?

Options:

A.

Whenever a database is being created, make sure that the location keyword is used

B.

When configuring an external data warehouse for all table storage. leverage Databricks for all ELT.

C.

Whenever a table is being created, make sure that the location keyword is used.

D.

When tables are created, make sure that the external keyword is used in the create table statement.

E.

When the workspace is being configured, make sure that external cloud object storage has been mounted.

Buy Now
Questions 20

A data engineering team is setting up deployment automation. To deploy workspace assets remotely using the Databricks CLI command, they must configure it with proper authentication.

Which authentication approach will provide the highest level of security ?

Options:

A.

Use a service principal with OAuth token federation.

B.

Use a service principal ID and its OAuth client secret.

C.

Use a service principal and its Personal Access Token.

D.

Use a shared user account and its OAuth client secret.

Buy Now
Questions 21

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema " customer_id LONG, predictions DOUBLE, date DATE " .

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.

Which code block accomplishes this task while minimizing potential compute costs?

Options:

A.

preds.write.mode( " append " ).saveAsTable( " churn_preds " )

B.

preds.write.format( " delta " ).save( " /preds/churn_preds " )

C)

D)

E)

C.

Option A

D.

Option B

E.

Option C

F.

Option D

G.

Option E

Buy Now
Questions 22

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

MERGE INTO customers

USING (

SELECT updates.customer_id as merge_ey, updates .*

FROM updates

UNION ALL

SELECT NULL as merge_key, updates .*

FROM updates JOIN customers

ON updates.customer_id = customers.customer_id

WHERE customers.current = true AND updates.address < > customers.address

) staged_updates

ON customers.customer_id = mergekey

WHEN MATCHED AND customers. current = true AND customers.address < > staged_updates.address THEN

UPDATE SET current = false, end_date = staged_updates.effective_date

WHEN NOT MATCHED THEN

INSERT (customer_id, address, current, effective_date, end_date)

VALUES (staged_updates.customer_id, staged_updates.address, true, staged_updates.effective_date, null)

Which statement describes this implementation?

Options:

A.

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

B.

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

C.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

D.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

Buy Now
Questions 23

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?

Options:

A.

A batch job will update the enriched_itemized_orders_by_account table, replacing only those rows that have different values than the current version of the table, using accountID as the primary key.

B.

The enriched_itemized_orders_by_account table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.

C.

An incremental job will leverage information in the state store to identify unjoined rows in the source tables and write these rows to the enriched_iteinized_orders_by_account table.

D.

An incremental job will detect if new rows have been written to any of the source tables; if new rows are detected, all results will be recalculated and used to overwrite the enriched_itemized_orders_by_account table.

E.

No computation will occur until enriched_itemized_orders_by_account is queried; upon query materialization, results will be calculated using the current valid version of data in each of the three tables referenced in the join logic.

Buy Now
Questions 24

A data engineer wants to reflector the following DLT code, which includes multiple definition with very similar code:

In an attempt to programmatically create these tables using a parameterized table definition, the data engineer writes the following code.

The pipeline runs an update with this refactored code, but generates a different DAG showing incorrect configuration values for tables.

How can the data engineer fix this?

Options:

A.

Convert the list of configuration values to a dictionary of table settings, using table names as keys.

B.

Convert the list of configuration values to a dictionary of table settings, using different input the for loop.

C.

Load the configuration values for these tables from a separate file, located at a path provided by a pipeline parameter.

D.

Wrap the loop inside another table definition, using generalized names and properties to replace with those from the inner table

Buy Now
Questions 25

Given the following error traceback:

AnalysisException: cannot resolve ' heartrateheartrateheartrate ' given input columns:

[spark_catalog.database.table.device_id, spark_catalog.database.table.heartrate,

spark_catalog.database.table.mrn, spark_catalog.database.table.time]

The code snippet was:

display(df.select(3* " heartrate " ))

Which statement describes the error being raised?

Options:

A.

There is a type error because a DataFrame object cannot be multiplied.

B.

There is a syntax error because the heartrate column is not correctly identified as a column.

C.

There is no column in the table named heartrateheartrateheartrate.

D.

There is a type error because a column object cannot be multiplied.

Buy Now
Questions 26

A junior developer complains that the code in their notebook isn ' t producing the correct results in the development environment. A shared screenshot reveals that while they ' re using a notebook versioned with Databricks Repos, they ' re using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

Options:

A.

Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9

B.

Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.

C.

Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch

D.

Merge all changes back to the main branch in the remote Git repository and clone the repo again

E.

Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository

Buy Now
Questions 27

The data governance team is reviewing user for deleting records for compliance with GDPR. The following logic has been implemented to propagate deleted requests from the user_lookup table to the user aggregate table.

Assuming that user_id is a unique identifying key and that all users have requested deletion have been removed from the user_lookup table, which statement describes whether successfully executing the above logic guarantees that the records to be deleted from the user_aggregates table are no longer accessible and why?

Options:

A.

No: files containing deleted records may still be accessible with time travel until a BACUM command is used to remove invalidated data files.

B.

Yes: Delta Lake ACID guarantees provide assurance that the DELETE command successed fully and permanently purged these records.

C.

No: the change data feed only tracks inserts and updates not deleted records.

D.

No: the Delta Lake DELETE command only provides ACID guarantees when combined with the MERGE INTO command

Buy Now
Questions 28

A data engineer is configuring a Lakeflow Declarative Pipeline to process CDC (Change Data Capture) data from a source. The source events sometimes arrive out of order, and multiple updates may occur with the same update_timestamp but with different update_sequence_id .

What should the data engineer do to ensure events are sequenced correctly?

Options:

A.

O Set track_history_column_list to [event_timestamp, event_id] in AUTO CDC APIs.

B.

O Use dropDuplicates() to remove out-of-order and duplicate records in LDP.

C.

O Use SEQUENCE BY STRUCT(event_timestamp, update_sequence_id) in AUTO CDC APIs.

D.

O Use a window function to sort update_sequence_id within the same partition, i.e., update_timestamp in the LDP pipeline.

Buy Now
Questions 29

The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table named users .

Assuming that user_id is a unique identifying key and that delete_requests contains all users that have requested deletion, which statement describes whether successfully executing the above logic guarantees that the records to be deleted are no longer accessible and why?

Options:

A.

Yes; Delta Lake ACID guarantees provide assurance that the delete command succeeded fully and permanently purged these records.

B.

No; the Delta cache may return records from previous versions of the table until the cluster is restarted.

C.

Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.

D.

No; the Delta Lake delete command only provides ACID guarantees when combined with the merge into command.

E.

No; files containing deleted records may still be accessible with time travel until a vacuum command is used to remove invalidated data files.

Buy Now
Questions 30

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

" device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT "

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

A.

withWatermark( " event_time " , " 10 minutes " )

B.

awaitArrival( " event_time " , " 10 minutes " )

C.

await( " event_time + ‘10 minutes ' " )

D.

slidingWindow( " event_time " , " 10 minutes " )

E.

delayWrite( " event_time " , " 10 minutes " )

Buy Now
Questions 31

To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries.

The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.

Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?

Options:

A.

Send all users notice that the schema for the table will be changing; include in the communication the logic necessary to revert the new table schema to match historic queries.

B.

Configure a new table with all the requisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table.

C.

Create a new table with the required schema and new fields and use Delta Lake ' s deep clone functionality to sync up changes committed to one table to the corresponding table.

D.

Replace the current table definition with a logical view defined with the query logic currently writing the aggregate table; create a new table to power the customer-facing application.

E.

Add a table comment warning all users that the table schema and field names will be changing on a given date; overwrite the table in place to the specifications of the customer-facing application.

Buy Now
Questions 32

A data engineer needs to capture pipeline settings from an existing in the workspace, and use them to create and version a JSON file to create a new pipeline.

Which command should the data engineer enter in a web terminal configured with the Databricks CLI?

Options:

A.

Use the get command to capture the settings for the existing pipeline; remove the pipeline_id and rename the pipeline; use this in a create command

B.

Stop the existing pipeline; use the returned settings in a reset command

C.

Use the alone command to create a copy of an existing pipeline; use the get JSON command to get the pipeline definition; save this to git

D.

Use list pipelines to get the specs for all pipelines; get the pipeline spec from the return results parse and use this to create a pipeline

Buy Now
Questions 33

The data science team has created and logged a production using MLFlow. The model accepts a list of column names and returns a new column of type DOUBLE.

The following code correctly imports the production model, load the customer table containing the customer_id key column into a Dataframe, and defines the feature columns needed for the model.

Which code block will output DataFrame with the schema ' ' customer_id LONG, predictions DOUBLE ' ' ?

Options:

A.

Model, predict (df, columns)

B.

Df, map (lambda k:midel (x [columns]) ,select ( ' ' customer_id predictions ' ' )

C.

Df. Select ( ' ' customer_id ' ' .

Model ( ' ' columns) alias ( ' ' predictions ' ' )

D.

Df.apply(model, columns). Select ( ' ' customer_id, prediction ' '

Buy Now
Questions 34

Which statement regarding spark configuration on the Databricks platform is true?

Options:

A.

Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.

B.

When the same spar configuration property is set for an interactive to the same interactive cluster.

C.

Spark configuration set within an notebook will affect all SparkSession attached to the same interactive cluster

D.

The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs.

Buy Now
Questions 35

A CHECK constraint has been successfully added to the Delta table named activity_details using the following logic:

A batch job is attempting to insert new records to the table, including a record where latitude = 45.50 and longitude = 212.67.

Which statement describes the outcome of this batch insert?

Options:

A.

The write will fail when the violating record is reached; any records previously processed will be recorded to the target table.

B.

The write will fail completely because of the constraint violation and no records will be inserted into the target table.

C.

The write will insert all records except those that violate the table constraints; the violating records will be recorded to a quarantine table.

D.

The write will include all records in the target table; any violations will be indicated in the boolean column named valid_coordinates.

E.

The write will insert all records except those that violate the table constraints; the violating records will be reported in a warning log.

Buy Now
Questions 36

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database.

After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged).

Which statement describes what will happen when the above code is executed?

Options:

A.

The connection to the external table will fail; the string " redacted " will be printed.

B.

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded password will be saved to DBFS.

C.

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the password will be printed in plain text.

D.

The connection to the external table will succeed; the string value of password will be printed in plain text.

E.

The connection to the external table will succeed; the string " redacted " will be printed.

Buy Now
Questions 37

A data engineer is running a groupBy aggregation on a massive user activity log grouped by user_id. A few users have millions of records, causing task skew and long runtimes.

Which technique will fix the skew in this aggregation?

Options:

A.

Use salting by adding a random prefix to skewed keys before aggregation, then aggregate again after removing the prefix.

B.

Increase the Spark driver memory and retry.

C.

Use reduceByKey instead of groupBy to avoid shuffles.

D.

Filter out the skewed users before the aggregation.

Buy Now
Questions 38

A data engineer is designing a pipeline in Databricks that processes records from a Kafka stream where late-arriving data is common.

Which approach should the data engineer use?

Options:

A.

Implement a custom solution using Databricks Jobs to periodically reprocess all historical data.

B.

Use batch processing and overwrite the entire output table each time to ensure late data is incorporated correctly.

C.

Use an Auto CDC pipeline with batch tables to simplify late data handling.

D.

Use a watermark to specify the allowed lateness to accommodate records that arrive after their expected window, ensuring correct aggregation and state management.

Buy Now
Questions 39

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

Options:

A.

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: Unlimited

B.

Cluster: New Job Cluster;

Retries: None;

Maximum Concurrent Runs: 1

C.

Cluster: Existing All-Purpose Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

D.

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

E.

Cluster: Existing All-Purpose Cluster;

Retries: None;

Maximum Concurrent Runs: 1

Buy Now
Questions 40

Two data engineers are working on the same Databricks notebook in separate branches. Both have edited the same section of code. When one tries to merge the other’s branch into their own using the Databricks Git folders UI, a merge conflict occurs on that notebook file. The UI highlights the conflict and presents options for resolution.

How should the data engineers resolve this merge conflict using Databricks Git folders?

Options:

A.

Abort the merge, discard all local changes, and try the merge operation again without reviewing the conflicting code.

B.

Delete the conflicted notebook file via the Databricks workspace UI, commit the deletion, and recreate the notebook from scratch in a new commit to bypass the conflict entirely.

C.

Use the Git CLI in the cluster’s web terminal to force-push the conflicted merge (git push -force), overriding the remote branch with the local version and discarding changes.

D.

Use the Git folders UI to manually edit the notebook file, selecting the desired lines from both versions and removing the conflict markers, then mark the conflict as resolved.

Buy Now
Questions 41

A healthcare analytics team is implementing a dimensional model in Delta Lake for patient care analysis. They have a date dimension table and are evaluating design options to ensure it supports a wide range of time-based analyses.

Which design approach for the date dimension will support efficient time-based querying and aggregation?

Options:

A.

Store the date as a string in the format YYYY-MM-DD for readability.

B.

Create separate dimension tables for different calendar systems (fiscal, academic, etc.).

C.

Store only the date value and calculate all time attributes dynamically in queries.

D.

Pre-calculate attributes like fiscal_period, quarter, month_name, day_of_week, and holiday.

Buy Now
Questions 42

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

Which statement describes this implementation?

Options:

A.

The customers table is implemented as a Type 3 table; old values are maintained as a new column alongside the current value.

B.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

C.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

D.

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

E.

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

Buy Now
Questions 43

A data engineer is using Lakeflow Declarative Pipelines Expectations feature to track the data quality of their incoming sensor data. Periodically, sensors send bad readings that are out of range, and they are currently flagging those rows with a warning and writing them to the silver table along with the good data. They’ve been given a new requirement – the bad rows need to be quarantined in a separate quarantine table and no longer included in the silver table.

This is the existing code for their silver table:

@dlt.table

@dlt.expect( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

What code will satisfy the requirements?

Options:

A.

@dlt.table

@dlt.expect( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

B.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading < 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

C.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect_or_drop( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

D.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

Buy Now
Questions 44

A data engineer wants to ingest a large collection of image files (JPEG and PNG) from cloud object storage into a Unity Catalog–managed table for analysis and visualization.

Which two configurations and practices are recommended to incrementally ingest these images into the table? (Choose 2 answers)

Options:

A.

Move files to a volume and read with SQL editor.

B.

Use Auto Loader and set cloudFiles.format to " BINARYFILE " .

C.

Use Auto Loader and set cloudFiles.format to " TEXT " .

D.

Use Auto Loader and set cloudFiles.format to " IMAGE " .

E.

Use the pathGlobFilter option to select only image files (e.g., " *.jpg,*.png " ).

Buy Now
Questions 45

A data engineer is using Lakeflow Declarative Pipeline to propagate row deletions from a source bronze table (user_bronze) to a target silver table (user_silver) . The engineer wants deletions in user_bronze to automatically delete corresponding rows in user_silver during pipeline execution.

Which configuration ensures deletions in the bronze table are propagated to the silver table?

Options:

A.

Use apply_changes without CDF and filter rows where _soft_deleted is true.

B.

Enable Change Data Feed (CDF) on user_bronze, read its CDF stream, and use apply_changes() with apply_as_deletes=True for user_silver.

C.

Enable CDF on user_silver, read its transaction log, and use MERGE to sync deletions.

D.

Configure VACUUM on user_bronze to delete files, then rebuild user_silver from scratch.

Buy Now
Questions 46

Which statement characterizes the general programming model used by Spark Structured Streaming?

Options:

A.

Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.

B.

Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.

C.

Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency for data transfer.

D.

Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.

E.

Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.

Buy Now
Questions 47

What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?

Options:

A.

Use and Pip install in a notebook cell

B.

Run source env/bin/activate in a notebook setup script

C.

Install libraries from PyPi using the cluster UI

D.

Use and sh install in a notebook cell

Buy Now
Questions 48

A platform engineer is creating catalogs and schemas for the development team to use.

The engineer has created an initial catalog, catalog_A, and initial schema, schema_A. The engineer has also granted USE CATALOG, USE

SCHEMA, and CREATE TABLE to the development team so that the engineer can begin populating the schema with new tables.

Despite being owner of the catalog and schema, the engineer noticed that they do not have access to the underlying tables in Schema_A.

What explains the engineer ' s lack of access to the underlying tables?

Options:

A.

The platform engineer needs to execute a REFRESH statement as the table permissions did not automatically update for owners.

B.

Users granted with USE CATALOG can modify the owner ' s permissions to downstream tables.

C.

The owner of the schema does not automatically have permission to tables within the schema, but can grant them to themselves at any point.

D.

Permissions explicitly given by the table creator are the only way the Platform Engineer could access the underlying tables in their

schema.

Buy Now
Questions 49

A data engineer is using Auto Loader to read incoming JSON data as it arrives. They have configured Auto Loader to quarantine invalid JSON records but notice that over time, some records are being quarantined even though they are well-formed JSON .

The code snippet is:

df = (spark.readStream

.format( " cloudFiles " )

.option( " cloudFiles.format " , " json " )

.option( " badRecordsPath " , " /tmp/somewhere/badRecordsPath " )

.schema( " a int, b int " )

.load( " /Volumes/catalog/schema/raw_data/ " ))

What is the cause of the missing data?

Options:

A.

At some point, the upstream data provider switched everything to multi-line JSON.

B.

The badRecordsPath location is accumulating many small files.

C.

The source data is valid JSON but does not conform to the defined schema in some way.

D.

The engineer forgot to set the option " cloudFiles.quarantineMode " = " rescue " .

Buy Now
Questions 50

A data team ' s Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows: Note that proposed changes are in bold.

Which step must also be completed to put the proposed query into production?

Options:

A.

Increase the shuffle partitions to account for additional aggregates

B.

Specify a new checkpointlocation

C.

Run REFRESH TABLE delta, /item_agg '

D.

Remove .option (mergeSchema ' , true ' ) from the streaming write

Buy Now
Questions 51

The data engineer is using Spark ' s MEMORY_ONLY storage level.

Which indicators should the data engineer look for in the spark UI ' s Storage tab to signal that a cached table is not performing optimally?

Options:

A.

Size on Disk is > 0

B.

The number of Cached Partitions > the number of Spark Partitions

C.

The RDD Block Name included the ' ' annotation signaling failure to cache

D.

On Heap Memory Usage is within 75% of off Heap Memory usage

Buy Now
Questions 52

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM ' s resources?

Options:

A.

The five Minute Load Average remains consistent/flat

B.

Bytes Received never exceeds 80 million bytes per second

C.

Network I/O never spikes

D.

Total Disk Space remains constant

E.

CPU Utilization is around 75%

Buy Now
Questions 53

The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing specific fields have not been approval for the sales org.

Which of the following solutions addresses the situation while emphasizing simplicity?

Options:

A.

Create a view on the marketing table selecting only these fields approved for the sales team alias the names of any fields that should be standardized to the sales naming conventions.

B.

Use a CTAS statement to create a derivative table from the marketing table configure a production jon to propagation changes.

C.

Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from marketing table.

D.

Create a new table with the required schema and use Delta Lake ' s DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.

Buy Now
Questions 54

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

This table is partitioned by the date column. A query is run with the following filter:

longitude < 20 and longitude > -20

Which statement describes how data will be filtered?

Options:

A.

Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.

B.

No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.

C.

The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.

D.

Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.

E.

The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.

Buy Now
Questions 55

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the " Owner " for each job. They attempt to transfer " Owner " privileges to the " DevOps " group, but cannot successfully accomplish this task.

Which statement explains what is preventing this privilege transfer?

Options:

A.

Databricks jobs must have exactly one owner; " Owner " privileges cannot be assigned to a group.

B.

The creator of a Databricks job will always have " Owner " privileges; this configuration cannot be changed.

C.

Other than the default " admins " group, only individual users can be granted privileges on jobs.

D.

A user can only transfer job ownership to a group if they are also a member of that group.

E.

Only workspace administrators can grant " Owner " privileges to a group.

Buy Now
Questions 56

A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():

Options:

A.

return spark.readStream.table( " bronze " )

B.

return spark.readStream.load( " bronze " )

C.

D.

return spark.read.option( " readChangeFeed " , " true " ).table ( " bronze " )

E.

Buy Now
Questions 57

A data engineer is designing a system to process batch patient encounter data stored in an S3 bucket, creating a Delta table (patient_encounters) with columns encounter_id, patient_id, encounter_date, diagnosis_code, and treatment_cost. The table is queried frequently by patient_id and encounter_date, requiring fast performance. Fine-grained access controls must be enforced. The engineer wants to minimize maintenance and boost performance.

How should the data engineer create the patient_encounters table?

Options:

A.

Create an external table in Unity Catalog, specifying an S3 location for the data files. Enable predictive optimization through table properties, and configure Unity Catalog permissions for access controls.

B.

Create a managed table in Unity Catalog . Configure Unity Catalog permissions for access controls, and rely on predictive optimization to enhance query performance and simplify maintenance.

C.

Create a managed table in Unity Catalog. Configure Unity Catalog permissions for access controls, schedule jobs to run OPTIMIZE and VACUUM commands daily to achieve best performance.

D.

Create a managed table in Hive Metastore. Configure Hive Metastore permissions for access controls, and rely on predictive optimization to enhance query performance and simplify maintenance.

Buy Now
Questions 58

A data engineer is designing a Lakeflow Declarative Pipeline to process streaming order data. The pipeline uses Auto Loader to ingest data and must enforce data quality by ensuring customer_id and amount are greater than zero. Invalid records should be dropped.

Which Lakeflow Declarative Pipelines configurations implement this requirement using Python?

Options:

A.

@dlt.table

def silver_orders():

return (

dlt.read_stream( " bronze_orders " )

.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

.expect_or_drop( " valid_amount " , " amount > 0 " )

)

B.

@dlt.table

@dlt.expect( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

C.

@dlt.table

def silver_orders():

return (

dlt.read_stream( " bronze_orders " )

.expect( " valid_customer " , " customer_id IS NOT NULL " )

.expect( " valid_amount " , " amount > 0 " )

)

D.

@dlt.table

@dlt.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect_or_drop( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

Buy Now
Questions 59

A data organization has adopted Delta Sharing to securely distribute curated datasets from a Unity Catalog-enabled workspace . The data engineering team shares large Delta tables internally via Databricks-to-Databricks and externally via Open Sharing for aggregated reports. While testing, they encounter challenges related to access control, data update visibility, and shareable object types.

What is a limitation of the Delta Sharing protocol or implementation when used with Databricks-to-Databricks or Open Sharing?

Options:

A.

With Open Sharing, recipients cannot access Volumes, Models, or notebooks — only static Delta tables are supported.

B.

Delta Sharing does not support Unity Catalog–enabled tables; only legacy Hive Metastore tables are shareable.

C.

With Databricks-to-Databricks sharing, Unity Catalog recipients must re-ingest data manually using COPY INTO or REST APIs.

D.

Delta Sharing (both Databricks-to-Databricks and Open Sharing) allows recipients to modify the source data if they have select privileges.

Buy Now
Questions 60

Which statement describes the correct use of pyspark.sql.functions.broadcast?

Options:

A.

It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.

B.

It marks a column as small enough to store in memory on all executors, allowing a broadcast join.

C.

It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.

D.

It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.

E.

It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.

Buy Now
Exam Name: Databricks Certified Data Engineer Professional Exam
Last Update: May 12, 2026
Questions: 195
Databricks-Certified-Professional-Data-Engineer pdf

Databricks-Certified-Professional-Data-Engineer PDF

$25.5  $84.99
Databricks-Certified-Professional-Data-Engineer Engine

Databricks-Certified-Professional-Data-Engineer Testing Engine

$30  $99.99
Databricks-Certified-Professional-Data-Engineer PDF + Engine

Databricks-Certified-Professional-Data-Engineer PDF + Testing Engine

$40.5  $134.99