exam questions

Exam AWS Certified Big Data - Specialty All Questions

View all questions & answers for the AWS Certified Big Data - Specialty exam

Exam AWS Certified Big Data - Specialty topic 1 question 28 discussion

Exam question from Amazon's AWS Certified Big Data - Specialty
Question #: 28
Topic #: 1
[All AWS Certified Big Data - Specialty Questions]

A customer is collecting clickstream data using Amazon Kinesis and is grouping the events by IP address into
5-minute chunks stored in Amazon S3.
Many analysts in the company use Hive on Amazon EMR to analyze this data. Their queries always reference a single IP address. Data must be optimized for querying based on IP address using Hive running on Amazon
EMR.
What is the most efficient method to query the data with Hive?

  • A. Store an index of the files by IP address in the Amazon DynamoDB metadata store for EMRFS.
  • B. Store the Amazon S3 objects with the following naming scheme: bucket_name/source=ip_address/ year=yy/month=mm/day=dd/hour=hh/filename.
  • C. Store the data in an HBase table with the IP address as the row key.
  • D. Store the events for an IP address as a single file in Amazon S3 and add metadata with keys: Hive_Partitioned_IPAddress.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
muhsin
Highly Voted 3 years, 8 months ago
Hi, it is not A. Hive on EMR can just use Aruro-RDS or Glue as external metastore it is not C. HBase no-sql database. it is not D. Copying all files into one single file does not help partioning. It is B. partitioning in Hive supported in S3 based on typically date (this scenario ip address)
upvoted 15 times
mattyb123
3 years, 8 months ago
thanks @muhsin
upvoted 1 times
exams
3 years, 8 months ago
I support B
upvoted 3 times
...
...
...
Bulti
Highly Voted 3 years, 7 months ago
After researching more I think the answer is B. When we create the hive metadata table for the bucket where these objects are stored, we will be able to query that data in S3 using a Single IP address as the table structure will have an IP address column and columns associated with month, year, day, hour.
upvoted 11 times
...
skytango
Most Recent 3 years, 7 months ago
A is correct. "for querying based on IP address using Hive running on Amazon EMR" can be the clue. C can not be the answer because Hbase itself is not related to AWS services.
upvoted 1 times
skytango
3 years, 7 months ago
After researching more, I would choose B. C also can be the alternative but B is simpler to implement.
upvoted 1 times
...
...
faloameme
3 years, 7 months ago
Answer: A Query on a HIVE table that covers an index, avoids table scan. HIVE checks the index first and then goes to the particular column and performs the operation. Using option B will require queries to specify all the column partitions in every where clause
upvoted 2 times
...
Bulti
3 years, 7 months ago
Answer : D This is a tough one. Not A- Its not efficient although this might not be even doable. Not B- I didn’t select “B” only because the question clearly states that “Their queries always reference a single IP address”. So why create more partitions ( year, month, day, hour) and introduce latency when reading the metadata catalog in Hive created from S3 while executing the Hive QL when the intent is to just query using IP address alone. Not C – This doesn’t make sense. Why would you use HBase table when your data is in S3 and Hive can query S3 objects directly using the metadata stored in EMR D is the correct answer- Query will be performant if there are fewer partitions in S3. Also as per the question there is a need to query the data only using IP address and therefore creating a metadata with Partition ID as the key is the right option.
upvoted 1 times
...
san2020
3 years, 7 months ago
my selection B
upvoted 5 times
...
ME2000
3 years, 7 months ago
This AWS Big Data Blog proves option A is correct. https://aws.amazon.com/blogs/big-data/data-lake-ingestion-automatically-partition-hive-external-tables-with-aws/
upvoted 1 times
practicioner
3 years, 7 months ago
No. In our case, we use Kynesis for storing data and we have an opportunity to store with necessary structure. In your link lambda+ dynamoDB use for creating partition for hive before loading. It's not our case
upvoted 1 times
...
...
Raju_k
3 years, 7 months ago
I would choose B over C because analysts are, currently, using HIve and storing the data in organized manner on S3 would be efficient solution.
upvoted 3 times
...
harry_123
3 years, 7 months ago
Q: What is the most efficient method to query the data with Hive? We can't design new solution with HBase .. Just have to use Hive only. B is the answer.
upvoted 1 times
...
asadao
3 years, 8 months ago
I support C, In my first attempt I choose B and was wong
upvoted 1 times
jiedee
3 years, 7 months ago
How did you know that b was wrong?
upvoted 7 times
...
...
Shatamjeev
3 years, 8 months ago
which answer is correct A OR C?
upvoted 1 times
mattyb123
3 years, 8 months ago
it should be C. As per the link: Apache Hadoop is not a perfect big data framework for real-time analytics and this is when HBase can be used i.e. For real-time querying of data. HBase is an ideal big data solution if the application requires random read or random write operations or both. If the application requires to access some data in real-time then it can be stored in a NoSQL database. HBase has its own set of wonderful API’s that can be used to pull or push data. HBase can also be integrated perfectly with Hadoop MapReduce for bulk operations like analytics, indexing, etc. The best way to use HBase is to make Hadoop the repository for static data and HBase the data store for data that is going to change in real-time after some processing. HBase should be used when – There is large amount of data. ACID properties are not mandatory but just required. Data model schema is sparse. When your applications needs to scale gracefully.
upvoted 6 times
...
...
jlpl
3 years, 8 months ago
hbase? selected C is answered key?
upvoted 3 times
mattyb123
3 years, 8 months ago
https://www.dezyre.com/article/hive-vs-hbase-different-technologies-that-work-better-together/322
upvoted 2 times
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...