Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.

Unlimited Access

Get Unlimited Contributor Access to the all ExamTopics Exams!
Take advantage of PDF Files for 1000+ Exams along with community discussions and pass IT Certification Exams Easily.

Exam DP-203 topic 1 question 39 discussion

Actual exam question from Microsoft's DP-203
Question #: 39
Topic #: 1
[All DP-203 Questions]

You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.

FactPurchase will have 1 million rows of data added daily and will contain three years of data.
Transact-SQL queries similar to the following query will be executed daily.

SELECT -
SupplierKey, StockItemKey, COUNT(*)

FROM FactPurchase -

WHERE DateKey >= 20210101 -

AND DateKey <= 20210131 -
GROUP By SupplierKey, StockItemKey
Which table distribution will minimize query times?

  • A. replicated
  • B. hash-distributed on PurchaseKey
  • C. round-robin
  • D. hash-distributed on DateKey
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️
Hash-distributed tables improve query performance on large fact tables, and are the focus of this article. Round-robin tables are useful for improving loading speed.
Incorrect:
Not D: Do not use a date column. . All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
AugustineUba
Highly Voted 2 years, 8 months ago
From the documentation the answer is clear enough. B is the right answer. When choosing a distribution column, select a distribution column that: "Is not a date column. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work."
upvoted 60 times
YipingRuan
2 years, 6 months ago
To minimize data movement, select a distribution column that: Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses. "PurchaseKey" is not used in the group by
upvoted 8 times
cem_kalender
1 year, 5 months ago
A distribution column should have high cardinality to ensure even distribution over nodes.
upvoted 2 times
...
...
YipingRuan
2 years, 6 months ago
Consider using the round-robin distribution for your table in the following scenarios: When getting started as a simple starting point since it is the default If there is no obvious joining key If there is no good candidate column for hash distributing the table If the table does not share a common join key with other tables If the join is less significant than other joins in the query
upvoted 7 times
...
...
waterbender19
Highly Voted 2 years, 8 months ago
I think the answer should be D for that specific query. If you look at the datatypes, DateKey is an INT datatype not a DATE datatype.
upvoted 19 times
waterbender19
2 years, 8 months ago
and thet statement that Fact table will be added 1 million rows daily means that each datekey value has an equal amount of rows associated with that value.
upvoted 5 times
Lucky_me
2 years, 3 months ago
But the DateKey is used in the WHERE clause.
upvoted 2 times
kamil_k
2 years, 1 month ago
I agree, date key is int, and besides, even if it was a date, when you query a couple days then 1 million rows per distribution is not that much. So what if you are going to use only a couple distributions to do the job? Isn't it still faster than using all distributions to process all of the records to get the required date range?
upvoted 1 times
...
...
...
AnandEMani
2 years, 7 months ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute this link says date filed , NOT a date Data type. B is correct
upvoted 7 times
...
kamil_k
2 years, 1 month ago
n.b. if we look at the example query itself the date range is 31 days so we will use 31 distributions out of 60, and only process ~31 million records
upvoted 2 times
...
...
kitesh1994
Most Recent 4 days, 3 hours ago
Question 20 and 39 Is same
upvoted 1 times
...
AKTommy
1 month, 2 weeks ago
Selected Answer: B
B is my correct answer
upvoted 2 times
...
pawades
2 months, 1 week ago
Explain me something - if you use Purchasekey as a hash distribution, and then want to do a partition, which column will you use for partition, we mostly date column for partition, but if we use date column then during query execution where you want to query data for let's say Jan month, wouldn't the query will need data from multiple nodes? eventually slowing the results? isn't it easy to keep data on a single node get faster results. Am I missing anything here?
upvoted 1 times
...
moneytime
2 months, 3 weeks ago
For me. I chose D. The reason is that the" datekey" is of tyoe "integer" not " Date" .This qualifies it to be used as a non-auto-incremental surrogate key for the fact table.
upvoted 1 times
lola_mary5
2 months, 2 weeks ago
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute#choose-a-distribution-column-with-data-that-distributes-evenly
upvoted 1 times
...
...
jongert
4 months ago
Selected Answer: B
Something not immediately clear to me was that distributing and partitioning are different, hence I was confused that one should not distribute over date columns. Bottom line is, do not distribute over date columns but you can partition over them. In this question they specifically ask about distribution method. Query optimization for large tables directly points to hashing.
upvoted 3 times
...
MarkJoh
4 months, 3 weeks ago
You want to distribute by productKey and partition by date. Then all distributions will be looked at in parallel and then, within each distribution, only the desired partitions will be looked at. Thereby, the query is fully scaled out and the quickest it can be.
upvoted 4 times
...
AlejandroU
6 months ago
B) the chosen distribution column should not be used in WHERE clauses; thus, we can discard DateKey (even though it is not a Date data type) to minimize data movement. The chosen distribution column must have many unique values; thus we potentially have 2 candidates: PurchaseKey or PurchaseOrderID; however, the chosen one should have no NULLS or only a few, making PurchaseKey the ideal in order to distribute evenly. https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
upvoted 3 times
...
Vanq69
6 months, 3 weeks ago
Selected Answer: D
Is there any "official" answer to this? A. Replicated: Replicated tables have copies of the entire table on each distribution. While this option can eliminate data movement, it may not be the most efficient choice for very large tables with frequent updates. B. Hash-Distributed on PurchaseKey: Hash distribution on "PurchaseKey" may lead to data skew if "PurchaseKey" doesn't have a wide range of unique values. Additionally, it doesn't align with the primary filtering condition on "DateKey." C. Round-Robin: Round-robin distribution ensures even data distribution, but it doesn't take advantage of data locality for specific types of queries. D. Hash-Distributed on DateKey: Distributing on "DateKey" aligns with your primary filtering condition, but it's a date column. This could lead to clustering by date, especially if many users filter on the same date. None of the answers seem to fit. D could be the best guess but it's a date column.
upvoted 2 times
...
kkk5566
7 months, 3 weeks ago
Selected Answer: B
B is correct
upvoted 1 times
...
gozdek
9 months, 4 weeks ago
Selected Answer: C
B is total nonsense if PurchaseKey has a unique value for every row it would end up distributing it evenly so same as round-robin. Distributing by date would slow down the query because in a situation presented in the question only 31 out of 60 distributions would be used. So in my opinion C is the correct answer.
upvoted 1 times
...
SHENOOOO
1 year, 2 months ago
Selected Answer: B
B is the correct Answer
upvoted 2 times
...
DindaS
1 year, 3 months ago
To me the answer should be D. A query on the table that has a WHERE clause filtering on column A will perform partition elimination and scan one partition. That same query may run faster in scenario 2 as there are fewer rows to scan in a partition. A query that has a WHERE clause filtering on column B will scan all partitions. The query may run faster in scenario 1 than in scenario 2 as there are fewer partitions to scan. https://learn.microsoft.com/en-us/sql/relational-databases/partitions/partitioned-tables-and-indexes?view=sql-server-ver16 Wanted to hear from the experts here.
upvoted 1 times
...
shakes103
1 year, 3 months ago
Selected Answer: B
B is the obvious answer. Hash is optimized for higher analytical performance while Round-robin is optimized for higher loading speed.
upvoted 2 times
...
Deeksha1234
1 year, 9 months ago
B is right
upvoted 1 times
...
vlad888
1 year, 10 months ago
Anyone who even one time run similar query in Synapse and look into execution plan understand that PurchaseKey doesn't help: there will be shuffle move dms operation! I suppose all these qyestions has mistake here. Because only column from GROUP BY clause will help. Or round_robin (although it will has almost the same cost as PurchaseKey if last one evenly distributed)
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...