Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 212 discussion

Exam question from Amazon's AWS Certified Machine Learning - Specialty

Question #: 212
Topic #: 1

[All AWS Certified Machine Learning - Specialty Questions]

A data scientist has 20 TB of data in CSV format in an Amazon S3 bucket. The data scientist needs to convert the data to Apache Parquet format.

How can the data scientist convert the file format with the LEAST amount of effort?

A. Use an AWS Glue crawler to convert the file format.
B. Write a script to convert the file format. Run the script as an AWS Glue job.
C. Write a script to convert the file format. Run the script on an Amazon EMR cluster.
D. Write a script to convert the file format. Run the script in an Amazon SageMaker notebook.

Show Suggested Answer

Suggested Answer: B 🗳️

by drcok87 at Feb. 10, 2023, 10:57 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

jopaca1216

9 months, 2 weeks ago

B is right. Is very simple to create a conversion file JOB in AWS Glue, using just 3 workflow steps. WITH NO CODE.. CREATED AUTOMATICALLY BY GLUE (Scala or Python) (s3 - source data file) --> (Data Mapping) --> (target transformed data file)

upvoted 1 times

...

loict

9 months, 3 weeks ago

Selected Answer: B

A. NO - Crawler is to populate the data catalog B. YES - leverage serverless for distributed processing C. NO - Altough EMR can run Spark like Glue, it is not serverless D. NO - using the PySpark kernel will be single instance (running in the notebook)

upvoted 1 times

...

Mickey321

10 months, 2 weeks ago

Selected Answer: B

Option B is better than option A because option A uses an AWS Glue crawler to convert the file format. A crawler is a component of AWS Glue that scans your data sources and infers the schema, format, partitioning, and other properties of your data. A crawler can create or update a table in the AWS Glue Data Catalog that points to your data source. However, a crawler cannot change the format of your data source itself. You still need to write a script or use a tool to convert your CSV files to Parquet files.

upvoted 2 times

...

GiyeonShin

1 year, 4 months ago

Selected Answer: B

Option B. A - Glue crawler creates Glue Data Catalog from S3 buckets. It can be used to query by athena. C, D - not serverless and not generally used for etl.

upvoted 2 times

...

AjoseO

1 year, 4 months ago

Selected Answer: B

AWS Glue is a fully-managed ETL service that makes it easy to move data between data stores. AWS Glue can be used to automate the conversion of CSV files to Parquet format with minimal effort. AWS Glue supports reading data from CSV files, transforming the data, and writing the transformed data to Parquet files. Option A is incorrect because AWS Glue crawler is used to infer the schema of data stored in S3 and create AWS Glue Data Catalog tables. Option C is incorrect because while Amazon EMR can be used to process large amounts of data and perform data conversions, it requires more operational effort than AWS Glue. Option D is incorrect because Amazon SageMaker is a machine learning service, and while it can be used for data processing, it is not the best option for simple data format conversion tasks.

upvoted 2 times

...

drcok87

1 year, 4 months ago

in sagemaker notebook, you'd have to write python code but question is asking for something easy so i choose option b https://blog.searce.com/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f

upvoted 2 times

Jerry84

1 year, 4 months ago

From you link, A(Glue crawler) Should be correct.

upvoted 1 times

drcok87

1 year, 4 months ago

crawler just creates the data catalog (schema), it does not actually converts the data to another format. As per details in that article, you are creating a job where source is schema created by crawler and destination is output s3 where we store formatted data.

upvoted 3 times

...