Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.

Unlimited Access

Get Unlimited Contributor Access to the all ExamTopics Exams!
Take advantage of PDF Files for 1000+ Exams along with community discussions and pass IT Certification Exams Easily.

Exam Professional Data Engineer topic 1 question 42 discussion

Actual exam question from Google's Professional Data Engineer
Question #: 42
Topic #: 1
[All Professional Data Engineer Questions]

Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs. What should you recommend they do?

  • A. Rewrite the job in Pig.
  • B. Rewrite the job in Apache Spark.
  • C. Increase the size of the Hadoop cluster.
  • D. Decrease the size of the Hadoop cluster but also rewrite the job in Hive.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
jvg637
Highly Voted 4 years, 1 month ago
I would say B since Apache Spark is faster than Hadoop/Pig/MapReduce
upvoted 35 times
Trocinek
1 month, 1 week ago
But it requires much more memory causing it more expensive, which is not what we're aiming for here..
upvoted 1 times
...
...
[Removed]
Highly Voted 4 years, 1 month ago
Answer: B Description: Spark performs in-memory processing and faster, which results in optimization of job’s processing time
upvoted 17 times
...
axantroff
Most Recent 5 months, 1 week ago
Selected Answer: B
Just a regular Spark. B
upvoted 1 times
...
DataFrame
5 months, 1 week ago
C. I think it should be C because intent of asking question is to realize the problem of on-prem auto-scaling not the optimization that we achieve using spark in-memory features. Its GCP exam they want to highlight if hadoop cluster commodity hard doesn't increase when data increases then it can create problem unlike GCP. Hence migrate to GCP.
upvoted 1 times
...
itsmynickname
9 months, 3 weeks ago
None. Being a GCP exam, it must be either Dataflow or BigQuery :D
upvoted 7 times
...
KHAN0007
1 year ago
I would like to take a moment to thank you all guys You guys are awesome!!!
upvoted 3 times
...
ler_mp
1 year, 3 months ago
Wow, a question that does not recommend to use Google product
upvoted 14 times
...
Whoswho
1 year, 4 months ago
looks like he's trying to spark the company up.
upvoted 7 times
itsmynickname
9 months, 3 weeks ago
It seems he's not well paid.
upvoted 1 times
...
...
Krish6488
1 year, 4 months ago
Selected Answer: B
Both Pig & Spark requires rewriting the code so its an additional overhead, but as an architect I would think about a long lasting solution. Resizing Hadoop cluster can resolve the problem statement for the workloads at that point in time but not on longer run. So Spark is the right choice, although its a cost to start with, it will certainly be a long lasting solution
upvoted 2 times
...
Mamta072
1 year, 10 months ago
Ans is B . Apache spark.
upvoted 2 times
...
alecuba16
2 years ago
Selected Answer: B
SPARK > hadoop, pig, hive
upvoted 4 times
...
kped21
2 years, 2 months ago
B - Apache Spark
upvoted 1 times
luamail
1 year ago
https://www.ibm.com/cloud/blog/hadoop-vs-spark
upvoted 2 times
...
...
kped21
2 years, 2 months ago
B Spark for optimization and processing.
upvoted 1 times
...
sraakesh95
2 years, 3 months ago
Selected Answer: B
B: Spark is suitable for the given operation is much more powerful
upvoted 1 times
...
medeis_jar
2 years, 3 months ago
Selected Answer: B
as explained by pr2web
upvoted 1 times
...
pr2web
2 years, 4 months ago
Selected Answer: B
Ans B: Spark is a 100 times faster and utilizes memory, instead of Hadoop Mapreduce's two-stage paradigm.
upvoted 1 times
...
MaxNRG
2 years, 5 months ago
B as Spark can improve the performance as it performs lazy in-memory execution. Spark is important because it does part of its pipeline processing in memory rather than copying from disk. For some applications, this makes Spark extremely fast.
upvoted 1 times
MaxNRG
2 years, 5 months ago
With a Spark pipeline, you have two different kinds of operations, transforms and actions. Spark builds its pipeline used an abstraction called a directed graph. Each transform builds additional nodes into the graph but spark doesn't execute the pipeline until it sees an action. Spark waits until it has the whole story, all the information. This allows Spark to choose the best way to distribute the work and run the pipeline. The process of waiting on transforms and executing on actions is called, lazy execution. For a transformation, the input is an RDD and the output is an RDD. When Spark sees a transformation, it registers it in the directed graph and then it waits. An action triggers Spark to process the pipeline, the output is usually a result format, such as a text file, rather than an RDD.
upvoted 1 times
MaxNRG
2 years, 5 months ago
Option A is wrong as Pig is wrapper and would initiate Map Reduce jobs Option C is wrong as it would increase the cost. Option D is wrong Hive is wrapper and would initiate Map Reduce jobs. Also, reducing the size would reduce performance.
upvoted 3 times
kastuarr
1 year, 6 months ago
Wont Option B increase the cost ? Cost of re-writing the job in Spark + Cost of additional memory ?
upvoted 1 times
...
...
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...