Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Questions 4

Given a CSV file with the content:

And the following code:

from pyspark.sql.types import *

schema = StructType([

StructField("name", StringType()),

StructField("age", IntegerType())

])

spark.read.schema(schema).csv(path).collect()

What is the resulting output?

Options:

[Row(name='bambi'), Row(name='alladin', age=20)]

[Row(name='alladin', age=20)]

[Row(name='bambi', age=None), Row(name='alladin', age=20)]

The code throws an error due to a schema mismatch.

Buy Now

Questions 5

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

Use the applylnPandas API

Options:

Use the applyInPandas API:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()

Use the mapInPandas API:

df.mapInPandas(mean_func, schema="user_id long, value double").show()

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()

Buy Now

Questions 6

A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

Options:

Optimize the data processing logic by repartitioning the DataFrame.

Modify the Spark configuration to disable garbage collection

Increase the memory allocated to the Spark Driver.

Cache large DataFrames to persist them in memory.

Buy Now

Questions 7

31 of 55.

Given a DataFrame df that has 10 partitions, after running the code:

df.repartition(20)

How many partitions will the result DataFrame have?

Options:

Same number as the cluster executors

Buy Now

Questions 8

A data engineer wants to create a Streaming DataFrame that reads from a Kafka topic called feed.

Which code fragment should be inserted in line 5 to meet the requirement?

Code context:

spark \

.readStream \

.format("kafka") \

.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \

.[LINE 5] \

.load()

Options:

.option("subscribe", "feed")

.option("subscribe.topic", "feed")

.option("kafka.topic", "feed")

.option("topic", "feed")

Buy Now

Questions 9

A data engineer needs to write a Streaming DataFrame as Parquet files.

Given the code:

Which code fragment should be inserted to meet the requirement?

Options:

.format("parquet")

.option("location", "path/to/destination/dir")

CopyEdit

.option("format", "parquet")

.option("destination", "path/to/destination/dir")

.option("format", "parquet")

.option("location", "path/to/destination/dir")

.format("parquet")

.option("path", "path/to/destination/dir")

Buy Now

Questions 10

A Spark DataFrame df is cached using the MEMORY_AND_DISK storage level, but the DataFrame is too large to fit entirely in memory.

What is the likely behavior when Spark runs out of memory to store the DataFrame?

Options:

Spark duplicates the DataFrame in both memory and disk. If it doesn't fit in memory, the DataFrame is stored and retrieved from the disk entirely.

Spark splits the DataFrame evenly between memory and disk, ensuring balanced storage utilization.

Spark will store as much data as possible in memory and spill the rest to disk when memory is full, continuing processing with performance overhead.

Spark stores the frequently accessed rows in memory and less frequently accessed rows on disk, utilizing both resources to offer balanced performance.

Buy Now

Questions 11

6 of 55.

Which components of Apache Spark’s Architecture are responsible for carrying out tasks when assigned to them?

Options:

Driver Nodes

Executors

CPU Cores

Worker Nodes

Buy Now

Questions 12

A data scientist is analyzing a large dataset and has written a PySpark script that includes several transformations and actions on a DataFrame. The script ends with a collect() action to retrieve the results.

How does Apache Spark™'s execution hierarchy process the operations when the data scientist runs this script?

Options:

The script is first divided into multiple applications, then each application is split into jobs, stages, and finally tasks.

The entire script is treated as a single job, which is then divided into multiple stages, and each stage is further divided into tasks based on data partitions.

The collect() action triggers a job, which is divided into stages at shuffle boundaries, and each stage is split into tasks that operate on individual data partitions.

Spark creates a single task for each transformation and action in the script, and these tasks are grouped into stages and jobs based on their dependencies.

Buy Now

Questions 13

A data engineer wants to create an external table from a JSON file located at /data/input.json with the following requirements:

Create an external table named users

Automatically infer schema

Merge records with differing schemas

Which code snippet should the engineer use?

Options:

CREATE TABLE users USING json OPTIONS (path '/data/input.json')

CREATE EXTERNAL TABLE users USING json OPTIONS (path '/data/input.json')

CREATE EXTERNAL TABLE users USING json OPTIONS (path '/data/input.json', mergeSchema 'true')

CREATE EXTERNAL TABLE users USING json OPTIONS (path '/data/input.json', schemaMerge 'true')

Buy Now

Questions 14

26 of 55.

A data scientist at an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user.

Before further processing, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns.

The PII columns in df_user are name, email, and birthdate.

Which code snippet can be used to meet this requirement?

Options:

df_user_non_pii = df_user.drop("name", "email", "birthdate")

df_user_non_pii = df_user.dropFields("name", "email", "birthdate")

df_user_non_pii = df_user.select("name", "email", "birthdate")

df_user_non_pii = df_user.remove("name", "email", "birthdate")

Buy Now

Questions 15

24 of 55.

Which code should be used to display the schema of the Parquet file stored in the location events.parquet?

Options:

spark.sql("SELECT * FROM events.parquet").show()

spark.read.format("parquet").load("events.parquet").show()

spark.read.parquet("events.parquet").printSchema()

spark.sql("SELECT schema FROM events.parquet").show()

Buy Now

Questions 16

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

Options:

Replace .bucketBy() with .partitionBy("event_year", "event_month")

Change the bucket count (42) to a lower number

Add .sortBy() after .bucketBy()

Replace .bucketBy() with .partitionBy("event_year") only

Buy Now

Questions 17

A developer is running Spark SQL queries and notices underutilization of resources. Executors are idle, and the number of tasks per stage is low.

What should the developer do to improve cluster utilization?

Options:

Increase the value of spark.sql.shuffle.partitions

Reduce the value of spark.sql.shuffle.partitions

Increase the size of the dataset to create more partitions

Enable dynamic resource allocation to scale resources as needed

Buy Now

Questions 18

A data engineer wants to process a streaming DataFrame that receives sensor readings every second with columns sensor_id, temperature, and timestamp. The engineer needs to calculate the average temperature for each sensor over the last 5 minutes while the data is streaming.

Which code implementation achieves the requirement?

Options from the images provided:

Options:

Option A

Option B

Option C

Option D

Buy Now

Questions 19

A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

Options:

groupBy

filter

select

coalesce

Buy Now

Questions 20

13 of 55.

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

region_id

region_name

North

East

West

The resulting Python dictionary must contain a mapping of region_id to region_name, containing the smallest 3 region_id values.

Which code fragment meets the requirements?

Options:

regions_dict = dict(regions.take(3))

regions_dict = regions.select("region_id", "region_name").take(3)

regions_dict = dict(regions.select("region_id", "region_name").rdd.collect())

regions_dict = dict(regions.orderBy("region_id").limit(3).rdd.map(lambda x: (x.region_id, x.region_name)).collect())

Buy Now

Questions 21

A data engineer observes that an upstream streaming source sends duplicate records, where duplicates share the same key and have at most a 30-minute difference in event_timestamp. The engineer adds:

dropDuplicatesWithinWatermark("event_timestamp", "30 minutes")

What is the result?

Options:

It is not able to handle deduplication in this scenario

It removes duplicates that arrive within the 30-minute window specified by the watermark

It removes all duplicates regardless of when they arrive

It accepts watermarks in seconds and the code results in an error

Buy Now

Questions 22

A data scientist wants each record in the DataFrame to contain:

The first attempt at the code does read the text files but each record contains a single line. This code is shown below:

The entire contents of a file

The full file path

The issue: reading line-by-line rather than full text per file.

Code:

corpus = spark.read.text("/datasets/raw_txt/*") \

.select('*', '_metadata.file_path')

Which change will ensure one record per file?

Options:

Add the option wholetext=True to the text() function

Add the option lineSep='\n' to the text() function

Add the option wholetext=False to the text() function

Add the option lineSep=", " to the text() function

Buy Now

Questions 23

22 of 55.

A Spark application needs to read multiple Parquet files from a directory where the files have differing but compatible schemas.

The data engineer wants to create a DataFrame that includes all columns from all files.

Which code should the data engineer use to read the Parquet files and include all columns using Apache Spark?

Options:

spark.read.parquet("/data/parquet/")

spark.read.option("mergeSchema", True).parquet("/data/parquet/")

spark.read.format("parquet").option("inferSchema", "true").load("/data/parquet/")

spark.read.parquet("/data/parquet/").option("mergeAllCols", True)

Buy Now

Questions 24

Given the schema:

event_ts TIMESTAMP,

sensor_id STRING,

metric_value LONG,

ingest_ts TIMESTAMP,

source_file_path STRING

The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.

Options:

dropDuplicates on all columns (wrong criteria)

dropDuplicates with no arguments (removes based on all columns)

groupBy without aggregation (invalid use)

dropDuplicates on the exact matching fields

Buy Now

Questions 25

A data engineer uses a broadcast variable to share a DataFrame containing millions of rows across executors for lookup purposes. What will be the outcome?

Options:

The job may fail if the memory on each executor is not large enough to accommodate the DataFrame being broadcasted

The job may fail if the executors do not have enough CPU cores to process the broadcasted dataset

The job will hang indefinitely as Spark will struggle to distribute and serialize such a large broadcast variable to all executors

The job may fail because the driver does not have enough CPU cores to serialize the large DataFrame

Buy Now

Questions 26

Given the following code snippet in my_spark_app.py:

What is the role of the driver node?

Options:

The driver node orchestrates the execution by transforming actions into tasks and distributing them to worker nodes

The driver node only provides the user interface for monitoring the application

The driver node holds the DataFrame data and performs all computations locally

The driver node stores the final result after computations are completed by worker nodes

Buy Now

Questions 27

47 of 55.

A data engineer has written the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv")

df2 = spark.read.csv("product_data.csv")

df_joined = df1.join(df2, df1.product_id == df2.product_id)

The DataFrame df1 contains ~10 GB of sales data, and df2 contains ~8 MB of product data.

Which join strategy will Spark use?

Options:

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan.

Shuffle join because no broadcast hints were provided.

Broadcast join, as df2 is smaller than the default broadcast threshold.

Buy Now

Questions 28

What is the difference between df.cache() and df.persist() in Spark DataFrame?

Options:

Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_SER)

Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.

persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) and cache() - Can be used to set different storage levels to persist the contents of the DataFrame.

cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK) and persist() - Can be used to set different storage levels to persist the contents of the DataFrame

Buy Now

Questions 29

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values.

Which code fragment meets the requirements?

The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values.

Which code fragment meets the requirements?

Options:

regions = dict(

regions_df

.select('region', 'region_id')

.sort('region_id')

.take(3)

)

regions = dict(

regions_df

.select('region_id', 'region')

.sort('region_id')

.take(3)

)

regions = dict(

regions_df

.select('region_id', 'region')

.limit(3)

.collect()

)

regions = dict(

regions_df

.select('region', 'region_id')

.sort(desc('region_id'))

.take(3)

)

Buy Now

Questions 30

Given this code:

.withWatermark("event_time", "10 minutes")

.groupBy(window("event_time", "15 minutes"))

.count()

What happens to data that arrives after the watermark threshold?

Options:

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

The watermark ensures that late data arriving within 10 minutes of the latest event_time will be processed and included in the windowed aggregation.

Buy Now

Questions 31

What is the benefit of using Pandas on Spark for data transformations?

Options:

It is available only with Python, thereby reducing the learning curve.

It computes results immediately using eager execution, making it simple to use.

It runs on a single node only, utilizing the memory with memory-bound DataFrames and hence cost-efficient.

It executes queries faster using all the available cores in the cluster as well as provides Pandas’s rich set of features.

Buy Now

Questions 32

Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

Options:

It provides a way to run Spark applications remotely in any programming language

It can be used to interact with any remote cluster using the REST API

It allows for remote execution of Spark jobs

It is primarily used for data ingestion into Spark from external sources

Buy Now

Questions 33

What is a feature of Spark Connect?

Options:

It supports DataStreamReader, DataStreamWriter, StreamingQuery, and Streaming APIs

Supports DataFrame, Functions, Column, SparkContext PySpark APIs

It supports only PySpark applications

It has built-in authentication

Buy Now

Questions 34

An engineer wants to join two DataFrames df1 and df2 on the respective employee_id and emp_id columns:

df1: employee_id INT, name STRING

df2: emp_id INT, department STRING

The engineer uses:

result = df1.join(df2, df1.employee_id == df2.emp_id, how='inner')

What is the behaviour of the code snippet?

Options:

The code fails to execute because the column names employee_id and emp_id do not match automatically

The code fails to execute because it must use on='employee_id' to specify the join column explicitly

The code fails to execute because PySpark does not support joining DataFrames with a different structure

The code works as expected because the join condition explicitly matches employee_id from df1 with emp_id from df2

Buy Now

Questions 35

An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.

The initial code is:

def in_spanish_inner(df: pd.Series) -> pd.Series:

model = get_translation_model(target_lang='es')

return df.apply(model)

in_spanish = sf.pandas_udf(in_spanish_inner, StringType())

How can the MLOps engineer change this code to reduce how many times the language model is loaded?

Options:

Convert the Pandas UDF to a PySpark UDF

Convert the Pandas UDF from a Series → Series UDF to a Series → Scalar UDF

Run the in_spanish_inner() function in a mapInPandas() function call

Convert the Pandas UDF from a Series → Series UDF to an Iterator[Series] → Iterator[Series] UDF

Buy Now

Questions 36

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

Options:

Execute their pyspark shell with the option --remote "https://localhost "

Execute their pyspark shell with the option --remote "sc://localhost"

Set the environment variable SPARK_REMOTE="sc://localhost" before starting the pyspark shell

Add .remote("sc://localhost") to their SparkSession.builder calls in their Spark code

Ensure the Spark property spark.connect.grpc.binding.port is set to 15002 in the application code

Buy Now

Questions 37

A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.

Which combination of Apache Spark modules should the data scientist use in this scenario?

Options:

Spark DataFrames, Structured Streaming, and GraphX

Spark SQL, Pandas API on Spark, and Structured Streaming

Spark Streaming, GraphX, and Pandas API on Spark

Spark DataFrames, Spark SQL, and MLlib

Buy Now

Questions 38

A Spark application developer wants to identify which operations cause shuffling, leading to a new stage in the Spark execution plan.

Which operation results in a shuffle and a new stage?

Options:

DataFrame.groupBy().agg()

DataFrame.filter()

DataFrame.withColumn()

DataFrame.select()

Buy Now

Questions 39

17 of 55.

A data engineer has noticed that upgrading the Spark version in their applications from Spark 3.0 to Spark 3.5 has improved the runtime of some scheduled Spark applications.

Looking further, the data engineer realizes that Adaptive Query Execution (AQE) is now enabled.

Which operation should AQE be implementing to automatically improve the Spark application performance?

Options:

Dynamically switching join strategies

Collecting persistent table statistics and storing them in the metastore for future use

Improving the performance of single-stage Spark jobs

Optimizing the layout of Delta files on disk

Buy Now

Questions 40

1 of 55. A data scientist wants to ingest a directory full of plain text files so that each record in the output DataFrame contains the entire contents of a single file and the full path of the file the text was read from.

The first attempt does read the text files, but each record contains a single line. This code is shown below:

txt_path = "/datasets/raw_txt/*"

df = spark.read.text(txt_path) # one row per line by default

df = df.withColumn("file_path", input_file_name()) # add full path

Which code change can be implemented in a DataFrame that meets the data scientist's requirements?

Options:

Add the option wholetext to the text() function.

Add the option lineSep to the text() function.

Add the option wholetext=False to the text() function.

Add the option lineSep=", " to the text() function.

Buy Now

Databricks Certification |

Exam Code: Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5

Exam Name: Databricks Certified Associate Developer for Apache Spark 3.5 – Python

Last Update: Dec 4, 2025

Questions: 136

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF

$25.5 ~~$84.99~~

Add to Cart

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Engine

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Testing Engine

$30 ~~$99.99~~

Add to Cart

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF + Engine

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF + Testing Engine

$40.5 ~~$134.99~~

Add to Cart

Cyber Monday Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: cramtick70

cramtick logo

Navigation:

Hot Vendors:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation: