Exam DP-201 All Questions

View all questions & answers for the DP-201 exam

Exam DP-201 topic 2 question 28 discussion

Actual exam question from Microsoft's DP-201

Question #: 28
Topic #: 2

You are designing an Azure Data Factory pipeline for processing data. The pipeline will process data that is stored in general-purpose standard Azure storage.
You need to ensure that the compute environment is created on-demand and removed when the process is completed.
Which type of activity should you recommend?

A. Databricks Python activity
B. Data Lake Analytics U-SQL activity
C. HDInsight Pig activity
D. Databricks Jar activity

Show Suggested Answer

Suggested Answer: C 🗳️
The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand HDInsight cluster.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-hadoop-pig

by methodidacte at Jan. 13, 2020, 7:06 a.m.

Comments

Submit Cancel

GabiN

Highly Voted 5 years, 4 months ago

According to Microsoft documentation: https://docs.microsoft.com/en-us/azure/data-factory/transform-data only 4 external transformations can be executed on-demand: HDInsight MapReduce Activity, HDInsight Hive Activity, HDInsight Pig Activity and HDInsight Streaming Activity. On-demand means that the computing environment is automatically created by the Data Factory service before a job is submitted to process data and removed when the job is completed. Therefore, the correct answer is C.

upvoted 53 times

...

methodidacte

Highly Voted 5 years, 5 months ago

I agree with the solution C : "With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice needs to be processed unless there is an existing live cluster (timeToLive) and is deleted when the processing is done." But why are the others false ?

upvoted 7 times

...

H_S

Most Recent 4 years, 3 months ago

NOT IN THE DP-201 ANY MORE

upvoted 6 times

...

Deepu1987

4 years, 4 months ago

I would go with HDInsight Pig activity - rather than option A as per the given condition in the question where we're using ADLS n data bricks is ideally used during ADLS Gen2

upvoted 1 times

...

syu31svc

4 years, 6 months ago

I would agree with the answer From https://docs.microsoft.com/en-us/azure/data-factory/v1/data-factory-compute-linked-services#:~:text=When%20the%20job%20is%20finished,cluster%20management%2C%20and%20bootstrapping%20actions.: "Data Factory automatically creates the compute environment before a job is submitted for processing data. When the job is finished, Data Factory removes the compute environment." "The Azure Storage linked service to be used by the on-demand cluster for storing and processing data. The HDInsight cluster is created in the same region as this storage account. Currently, you can't create an on-demand HDInsight cluster that uses Azure Data Lake Store as the storage. If you want to store the result data from HDInsight processing in Data Lake Store, use Copy Activity to copy the data from Blob storage to Data Lake Store."

upvoted 1 times

...

GraceCyborg

4 years, 7 months ago

HDinsight is not in dp201 anymore

upvoted 2 times

...

Abhilvs

5 years ago

Azure Databricks also supports on-demand. when running from Az Datafactory, Databricks cluster gets created as an Automated cluster and destroyed after completion. The question is ambiguous.

upvoted 2 times

...

Runi

5 years ago

The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand Windows/Linux-based HDInsight cluster. See Pig activity article for details about this activity. Same as Mapreduce , streaming and hive activity - mentioned explicitly "on your own or on-demand" and based on on demand "On-Demand: In this case, the computing environment is fully managed by Data Factory. It is automatically created by the Data Factory service before a job is submitted to process data and removed when the job is completed. You can configure and control granular settings of the on-demand compute environment for job execution, cluster management, and bootstrapping actions." However, python or jar activities doesn't do any on-demand process. So answer is C.

upvoted 1 times

...

Leonido

5 years, 2 months ago

It's the strange question. Every one of them could answer the demand.

upvoted 3 times

azurearch

5 years, 1 month ago

The Azure Databricks Python Activity in a Data Factory pipeline runs a Python file in your Azure Databricks cluster. This article builds on the data transformation activities article, which presents a general overview of data transformation and the supported transformation activities. Azure Databricks is a managed platform for running Apache Spark.

upvoted 1 times

...

Narender_Bhadrecha

5 years, 4 months ago

A is also correct answer.

upvoted 2 times

...

mustaphaa

5 years, 5 months ago

A and D are correct too, u can use automatic created cluster option in linked services

upvoted 1 times

...