Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 53 discussion

Actual exam question from Databricks's Certified Associate Developer for Apache Spark

Question #: 53
Topic #: 1

[All Certified Associate Developer for Apache Spark Questions]

Which of the following pairs of arguments cannot be used in DataFrame.join() to perform an inner join on two DataFrames, named and aliased with "a" and "b" respectively, to specify two key columns?

A. on = [a.column1 == b.column1, a.column2 == b.column2]
B. on = [col("column1"), col("column2")]
C. on = [col("a.column1") == col("b.column1"), col("a.column2") == col("b.column2")]
D. All of these options can be used to perform an inner join with two key columns.
E. on = ["column1", "column2"]

Show Suggested Answer

Suggested Answer: B 🗳️

by Jtic at May 29, 2023, 3:58 a.m.

Comments

Submit Cancel

NirajBhise

2 months, 3 weeks ago

Selected Answer: A

The correct answer is: A This is because the on parameter in DataFrame.join() expects either a string, list of strings, or a single expression. The correct way to specify multiple key columns for an inner join is to use a list of column names or column expressions. Options B, C, and E are valid ways to specify the key columns for an inner join.

upvoted 1 times

...

azure_bimonster

12 months ago

Selected Answer: B

B cannot be used as this seems ambiguous

upvoted 1 times

...

Gurdel

1 year, 1 month ago

Selected Answer: B

B throws AnalysisException: [AMBIGUOUS_REFERENCE] Reference `column1` is ambiguous, could be: [`a`.`column1`, `b`.`column1`]

upvoted 1 times

...

juliom6

1 year, 2 months ago

Selected Answer: B

According to the following code, only response B returns an error. The key concept here is that dataframes must be "named" AND "aliased". from pyspark.sql.functions import col a = spark.createDataFrame([(1, 2), (3, 4)], ['column1', 'column2']) b = spark.createDataFrame([(1, 2), (5, 6)], ['column1', 'column2']) a = a.alias('a') b = b.alias('b') df = a.join(b, on = [a.column1 == b.column1, a.column2 == b.column2]) display(df) # df = a.join(b, on = [col("column1"), col("column2")]) df = a.join(b, on = [col("a.column1") == col("b.column1"), col("a.column2") == col("b.column2")]) display(df) df = a.join(b, on = ["column1", "column2"]) display(df)

upvoted 3 times

...

newusername

1 year, 3 months ago

Selected Answer: B

100% B Below code to test: dataA = [Row(column1=1, column2=2), Row(column1=2, column2=4), Row(column1=3, column2=6)] dfA = spark.createDataFrame(dataA)

upvoted 3 times

newusername

1 year, 3 months ago

# Sample data for DataFrame 'b' dataB = [Row(column1=1, column2=2), Row(column1=2, column2=5), Row(column1=3, column2=4)] dfB = spark.createDataFrame(dataB) # Alias DataFrames as 'a' and 'b' a = dfA.alias("a") b = dfB.alias("b") a.show() b.show() #Option A joinedDF_A = a.join(b, [a.column1 == b.column1, a.column2 == b.column2]) joinedDF_A.show() #Option B #joinedDF_B = a.join(b, [col("column1"), col("column2")]) #joinedDF_B.show() #Option C joinedDF_C = a.join(b, [col("a.column1") == col("b.column1"), col("a.column2") == col("b.column2")]) joinedDF_C.show() #Option E joinedDF_E = a.join(b, ["column1", "column2"]) joinedDF_E.show()

upvoted 3 times

...

juadaves

1 year, 3 months ago

I tried all of the options and I got 2 errors from: B AMBIGUOUS_REFERENCE] Reference `Category` is ambiguous, could be: [`Category`, `Category`] C: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `df_1`.`Category` cannot be resolved. Did you mean one of the following? [`Category`, `Category`, `Truth`, `Truth`, `Value`].;

upvoted 2 times

Ahmadkt

1 year, 3 months ago

it's B, it seems you didn't do the alias a = df1.alias("a") b = df2.alias("b")

upvoted 1 times

...

Singh_Sumit

1 year, 4 months ago

from pyspark.sql.functions import col df2.alias('a').join(df3.alias('b'), [col("a.name") == col("b.name"), col("a.name") == col("b.name")], 'full_outer').select(df2['name'],'height','age').show() It worked. so every answer is correct.

upvoted 1 times

...

cookiemonster42

1 year, 6 months ago

Selected Answer: C

should be C as in col() we specify only a column name as a string, not a dataframe

upvoted 3 times

...

Jtic

1 year, 8 months ago

Selected Answer: A

A. on = [a.column1 == b.column1, a.column2 == b.column2] This option is valid and can be used to perform an inner join on two key columns. It specifies the key columns using the syntax a.column1 == b.column1 and a.column2 == b.column2.

upvoted 2 times

ZSun

1 year, 8 months ago

I think the question "which one cannot be used to perform inner join", is confusing, Because only A works, the rest of answer is incorrect. The question should be "which one can be used"

upvoted 2 times

...