Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.

Unlimited Access

Get Unlimited Contributor Access to the all ExamTopics Exams!
Take advantage of PDF Files for 1000+ Exams along with community discussions and pass IT Certification Exams Easily.

Exam Certified Data Engineer Professional topic 1 question 110 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 110
Topic #: 1
[All Certified Data Engineer Professional Questions]

A large company seeks to implement a near real-time solution involving hundreds of pipelines with parallel updates of many tables with extremely high volume and high velocity data.

Which of the following solutions would you implement to achieve this requirement?

  • A. Use Databricks High Concurrency clusters, which leverage optimized cloud storage connections to maximize data throughput.
  • B. Partition ingestion tables by a small time duration to allow for many data files to be written in parallel.
  • C. Configure Databricks to save all data to attached SSD volumes instead of object storage, increasing file I/O significantly.
  • D. Isolate Delta Lake tables in their own storage containers to avoid API limits imposed by cloud vendors.
  • E. Store all tables in a single database to ensure that the Databricks Catalyst Metastore can load balance overall throughput.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
Er5
1 month, 1 week ago
A. B is only useful to improve performance of large tables ingestions.
upvoted 1 times
...
Curious76
2 months, 2 weeks ago
Selected Answer: D
Why not D?
upvoted 2 times
...
vctrhugo
3 months, 1 week ago
Both options A and B could be relevant depending on the specific details of the use case. If the emphasis is on optimizing concurrent queries and overall data throughput, option A might be more appropriate. If the primary concern is parallel updates of tables with high-volume, high-velocity data, option B is a more targeted approach.
upvoted 1 times
...
PrincipalJoe
3 months, 2 weeks ago
Selected Answer: B
The best way to deal with high volume and high velocity data is to use partitioning
upvoted 1 times
...
bacckom
4 months, 1 week ago
Selected Answer: A
Databricks High Concurrency cluster
upvoted 1 times
...
petrv
5 months, 2 weeks ago
Selected Answer: A
1) Partitioning by Time: Partitioning tables by a small time duration allows for efficient parallelism in data writes. Each time partition can be processed independently, enabling parallel updates to multiple partitions concurrently. 2)Optimizing for Parallelism: By partitioning the tables based on time, data can be ingested and processed in parallel, providing the ability to handle high volume and high velocity data effectively. Regarding option A, Databricks High Concurrency clusters are more focused on supporting a large number of concurrent users, which might not directly address the requirement for parallel updates of many tables with extremely high volume and high velocity data
upvoted 1 times
petrv
5 months, 2 weeks ago
sorry, the selected answer should have been B
upvoted 1 times
...
...
aragorn_brego
5 months, 4 weeks ago
Selected Answer: A
High Concurrency clusters in Databricks are designed for multiple concurrent users and workloads. They provide fine-grained sharing of cluster resources and are optimized for operations such as running multiple parallel queries and updates. This would be suitable for a solution that involves many pipelines with parallel updates, especially with high volume and high velocity data.
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...