Which of the following pairs of arguments cannot be used in DataFrame.join() to perform an inner join on two DataFrames, named and aliased with "a" and "b" respectively, to specify two key columns?
A.
on = [a.column1 == b.column1, a.column2 == b.column2]
B.
on = [col("column1"), col("column2")]
C.
on = [col("a.column1") == col("b.column1"), col("a.column2") == col("b.column2")]
D.
All of these options can be used to perform an inner join with two key columns.
The correct answer is: A
This is because the on parameter in DataFrame.join() expects either a string, list of strings, or a single expression. The correct way to specify multiple key columns for an inner join is to use a list of column names or column expressions.
Options B, C, and E are valid ways to specify the key columns for an inner join.
According to the following code, only response B returns an error. The key concept here is that dataframes must be "named" AND "aliased".
from pyspark.sql.functions import col
a = spark.createDataFrame([(1, 2), (3, 4)], ['column1', 'column2'])
b = spark.createDataFrame([(1, 2), (5, 6)], ['column1', 'column2'])
a = a.alias('a')
b = b.alias('b')
df = a.join(b, on = [a.column1 == b.column1, a.column2 == b.column2])
display(df)
# df = a.join(b, on = [col("column1"), col("column2")])
df = a.join(b, on = [col("a.column1") == col("b.column1"), col("a.column2") == col("b.column2")])
display(df)
df = a.join(b, on = ["column1", "column2"])
display(df)
I tried all of the options and I got 2 errors from:
B
AMBIGUOUS_REFERENCE] Reference `Category` is ambiguous, could be: [`Category`, `Category`]
C:
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `df_1`.`Category` cannot be resolved.
Did you mean one of the following? [`Category`, `Category`, `Truth`, `Truth`, `Value`].;
from pyspark.sql.functions import col
df2.alias('a').join(df3.alias('b'),
[col("a.name") == col("b.name"), col("a.name") == col("b.name")],
'full_outer').select(df2['name'],'height','age').show()
It worked. so every answer is correct.
A. on = [a.column1 == b.column1, a.column2 == b.column2]
This option is valid and can be used to perform an inner join on two key columns. It specifies the key columns using the syntax a.column1 == b.column1 and a.column2 == b.column2.
I think the question "which one cannot be used to perform inner join", is confusing,
Because only A works, the rest of answer is incorrect.
The question should be "which one can be used"
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
NirajBhise
2 months, 3 weeks agoazure_bimonster
12 months agoGurdel
1 year, 1 month agojuliom6
1 year, 2 months agonewusername
1 year, 3 months agonewusername
1 year, 3 months agojuadaves
1 year, 3 months agoAhmadkt
1 year, 3 months agoSingh_Sumit
1 year, 4 months agocookiemonster42
1 year, 6 months agoJtic
1 year, 8 months agoZSun
1 year, 8 months ago