exam questions

Exam DP-600 All Questions

View all questions & answers for the DP-600 exam

Exam DP-600 topic 1 question 51 discussion

Actual exam question from Microsoft's DP-600
Question #: 51
Topic #: 1
[All DP-600 Questions]

You are analyzing customer purchases in a Fabric notebook by using PySpark.
You have the following DataFrames:
transactions: Contains five columns named transaction_id, customer_id, product_id, amount, and date and has 10 million rows, with each row representing a transaction. customers: Contains customer details in 1,000 rows and three columns named customer_id, name, and country.
You need to join the DataFrames on the customer_id column. The solution must minimize data shuffling.
You write the following code.
from pyspark.sql import functions as F
results =
Which code should you run to populate the results DataFrame?

  • A. transactions.join(F.broadcast(customers), transactions.customer_id == customers.customer_id)
  • B. transactions.join(customers, transactions.customer_id == customers.customer_id).distinct()
  • C. transactions.join(customers, transactions.customer_id == customers.customer_id)
  • D. transactions.crossJoin(customers).where(transactions.customer_id == customers.customer_id)
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Momoanwar
Highly Voted 10 months ago
Selected Answer: A
In Apache Spark, broadcasting refers to an optimization technique for join operations. When you join two DataFrames or RDDs and one of them is significantly smaller than the other, Spark can "broadcast" the smaller table to all nodes in the cluster. This approach avoids the need for network shuffles for each row of the larger table, significantly reducing the execution time of the join operation.
upvoted 31 times
...
sraakesh95
Highly Voted 9 months, 2 weeks ago
Selected Answer: A
A - Broadcasting generates a copy of the data across all the nodes in the Spark cluster. Therefore, during a join operation, it won't require any I/Os from other nodes, thereby, reducing the shuffling requirement.
upvoted 7 times
...
282b85d
Most Recent 6 months, 2 weeks ago
Selected Answer: A
Broadcasting: The F.broadcast(customers) function is used to broadcast the smaller DataFrame (customers). This ensures that the smaller DataFrame is replicated across all nodes, and each node can perform the join locally with its partition of the larger DataFrame (transactions). This significantly reduces the data movement (shuffling) required during the join operation.
upvoted 1 times
...
stilferx
7 months, 1 week ago
Selected Answer: A
IMHO, "A" is correct! Broadcast joining copies the smaller table to each worker in Spark, which may significantly improve performance by reducing shuffling
upvoted 3 times
...
SamuComqi
10 months ago
Selected Answer: A
A. transactions.join(F.broadcast(customers), transactions.customer_id == customers.customer_id) Optimized method to perform a join between a very large table and a smaller one. Source: https://sparkbyexamples.com/spark/broadcast-join-in-spark/"
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...