exam questions

Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 53 discussion

Which of the following pairs of arguments cannot be used in DataFrame.join() to perform an inner join on two DataFrames, named and aliased with "a" and "b" respectively, to specify two key columns?

  • A. on = [a.column1 == b.column1, a.column2 == b.column2]
  • B. on = [col("column1"), col("column2")]
  • C. on = [col("a.column1") == col("b.column1"), col("a.column2") == col("b.column2")]
  • D. All of these options can be used to perform an inner join with two key columns.
  • E. on = ["column1", "column2"]
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
NirajBhise
2 months, 3 weeks ago
Selected Answer: A
The correct answer is: A This is because the on parameter in DataFrame.join() expects either a string, list of strings, or a single expression. The correct way to specify multiple key columns for an inner join is to use a list of column names or column expressions. Options B, C, and E are valid ways to specify the key columns for an inner join.
upvoted 1 times
...
azure_bimonster
12 months ago
Selected Answer: B
B cannot be used as this seems ambiguous
upvoted 1 times
...
Gurdel
1 year, 1 month ago
Selected Answer: B
B throws AnalysisException: [AMBIGUOUS_REFERENCE] Reference `column1` is ambiguous, could be: [`a`.`column1`, `b`.`column1`]
upvoted 1 times
...
juliom6
1 year, 2 months ago
Selected Answer: B
According to the following code, only response B returns an error. The key concept here is that dataframes must be "named" AND "aliased". from pyspark.sql.functions import col a = spark.createDataFrame([(1, 2), (3, 4)], ['column1', 'column2']) b = spark.createDataFrame([(1, 2), (5, 6)], ['column1', 'column2']) a = a.alias('a') b = b.alias('b') df = a.join(b, on = [a.column1 == b.column1, a.column2 == b.column2]) display(df) # df = a.join(b, on = [col("column1"), col("column2")]) df = a.join(b, on = [col("a.column1") == col("b.column1"), col("a.column2") == col("b.column2")]) display(df) df = a.join(b, on = ["column1", "column2"]) display(df)
upvoted 3 times
...
newusername
1 year, 3 months ago
Selected Answer: B
100% B Below code to test: dataA = [Row(column1=1, column2=2), Row(column1=2, column2=4), Row(column1=3, column2=6)] dfA = spark.createDataFrame(dataA)
upvoted 3 times
newusername
1 year, 3 months ago
# Sample data for DataFrame 'b' dataB = [Row(column1=1, column2=2), Row(column1=2, column2=5), Row(column1=3, column2=4)] dfB = spark.createDataFrame(dataB) # Alias DataFrames as 'a' and 'b' a = dfA.alias("a") b = dfB.alias("b") a.show() b.show() #Option A joinedDF_A = a.join(b, [a.column1 == b.column1, a.column2 == b.column2]) joinedDF_A.show() #Option B #joinedDF_B = a.join(b, [col("column1"), col("column2")]) #joinedDF_B.show() #Option C joinedDF_C = a.join(b, [col("a.column1") == col("b.column1"), col("a.column2") == col("b.column2")]) joinedDF_C.show() #Option E joinedDF_E = a.join(b, ["column1", "column2"]) joinedDF_E.show()
upvoted 3 times
...
...
juadaves
1 year, 3 months ago
I tried all of the options and I got 2 errors from: B AMBIGUOUS_REFERENCE] Reference `Category` is ambiguous, could be: [`Category`, `Category`] C: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `df_1`.`Category` cannot be resolved. Did you mean one of the following? [`Category`, `Category`, `Truth`, `Truth`, `Value`].;
upvoted 2 times
Ahmadkt
1 year, 3 months ago
it's B, it seems you didn't do the alias a = df1.alias("a") b = df2.alias("b")
upvoted 1 times
...
...
Singh_Sumit
1 year, 4 months ago
from pyspark.sql.functions import col df2.alias('a').join(df3.alias('b'), [col("a.name") == col("b.name"), col("a.name") == col("b.name")], 'full_outer').select(df2['name'],'height','age').show() It worked. so every answer is correct.
upvoted 1 times
...
cookiemonster42
1 year, 6 months ago
Selected Answer: C
should be C as in col() we specify only a column name as a string, not a dataframe
upvoted 3 times
...
Jtic
1 year, 8 months ago
Selected Answer: A
A. on = [a.column1 == b.column1, a.column2 == b.column2] This option is valid and can be used to perform an inner join on two key columns. It specifies the key columns using the syntax a.column1 == b.column1 and a.column2 == b.column2.
upvoted 2 times
ZSun
1 year, 8 months ago
I think the question "which one cannot be used to perform inner join", is confusing, Because only A works, the rest of answer is incorrect. The question should be "which one can be used"
upvoted 2 times
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...