Exam DP-600 topic 1 question 51 discussion

Actual exam question from Microsoft's DP-600

Question #: 51
Topic #: 1

You are analyzing customer purchases in a Fabric notebook by using PySpark.
You have the following DataFrames:
transactions: Contains five columns named transaction_id, customer_id, product_id, amount, and date and has 10 million rows, with each row representing a transaction. customers: Contains customer details in 1,000 rows and three columns named customer_id, name, and country.
You need to join the DataFrames on the customer_id column. The solution must minimize data shuffling.
You write the following code.
from pyspark.sql import functions as F
results =
Which code should you run to populate the results DataFrame?

A. transactions.join(F.broadcast(customers), transactions.customer_id == customers.customer_id)
B. transactions.join(customers, transactions.customer_id == customers.customer_id).distinct()
C. transactions.join(customers, transactions.customer_id == customers.customer_id)
D. transactions.crossJoin(customers).where(transactions.customer_id == customers.customer_id)

Show Suggested Answer

Suggested Answer: A 🗳️

by Momoanwar at Feb. 17, 2024, 8:25 p.m.

Comments

Submit Cancel

Momoanwar

Highly Voted 10 months ago

Selected Answer: A

In Apache Spark, broadcasting refers to an optimization technique for join operations. When you join two DataFrames or RDDs and one of them is significantly smaller than the other, Spark can "broadcast" the smaller table to all nodes in the cluster. This approach avoids the need for network shuffles for each row of the larger table, significantly reducing the execution time of the join operation.

upvoted 31 times

...

sraakesh95

Highly Voted 9 months, 2 weeks ago

Selected Answer: A

A - Broadcasting generates a copy of the data across all the nodes in the Spark cluster. Therefore, during a join operation, it won't require any I/Os from other nodes, thereby, reducing the shuffling requirement.

upvoted 7 times

...

282b85d

Most Recent 6 months, 2 weeks ago

Selected Answer: A

Broadcasting: The F.broadcast(customers) function is used to broadcast the smaller DataFrame (customers). This ensures that the smaller DataFrame is replicated across all nodes, and each node can perform the join locally with its partition of the larger DataFrame (transactions). This significantly reduces the data movement (shuffling) required during the join operation.

upvoted 1 times

...

stilferx

7 months, 1 week ago

Selected Answer: A

IMHO, "A" is correct! Broadcast joining copies the smaller table to each worker in Spark, which may significantly improve performance by reducing shuffling

upvoted 3 times

...

SamuComqi

10 months ago

Selected Answer: A

A. transactions.join(F.broadcast(customers), transactions.customer_id == customers.customer_id) Optimized method to perform a join between a very large table and a smaller one. Source: https://sparkbyexamples.com/spark/broadcast-join-in-spark/"

upvoted 2 times

...