AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80 (Page 1) — Amazon Certifications

1 Topic by bolun 2024-04-30 07:36:50

bolun
New member
Offline

Topic: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is less than 100 MB.
Which solution will meet these requirements MOST cost-effectively?

A.
Write a custom Python application. Host the application on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.
B.
Write a PySpark ETL script. Host the script on an Amazon EMR cluster.
C.
Write an AWS Glue PySpark job. Use Apache Spark to transform the data.
D.
Write an AWS Glue Python shell job. Use pandas to transform the data.

2 Reply by lucas_rfsb 2024-04-30 08:15:40

lucas_rfsb
New member
Offline

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

AWS Glue Python Shell Job is billed $0.44 per DPU-Hour for each job
AWS Glue PySpark is billed $0.29 per DPU-Hour for each job with flexible execution and $0.44 per DPU-Hour for each job with standard execution
Source: https://aws.amazon.com/glue/pricing/

3 Reply by halogi 2024-04-30 09:52:53

halogi
New member
Offline

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

AWS Glue Python Shell Job is billed $0.44 per DPU-Hour for each job
AWS Glue PySpark is billed $0.29 per DPU-Hour for each job with flexible execution and $0.44 per DPU-Hour for each job with standard execution
Source: https://aws.amazon.com/glue/pricing/

4 Reply by chakka90 2024-04-30 12:06:59

chakka90
New member
Offline

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

D.
Because the pyspark is still being the cheap you have to use minimum of 2 DPU. Which would increase the cost anyway so, i feel that d should be correct

5 Reply by khchan123 2024-04-30 12:16:36

khchan123
New member
Offline

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

D.

While AWS Glue PySpark jobs are scalable and suitable for large workloads, C may be overkill for processing small .csv files (less than 100 MB each). The overhead of using Apache Spark may not be cost-effective for this specific use case.

6 Reply by Leo87656789 2024-04-30 13:58:22

Leo87656789
New member
Offline

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

Option D:

Even though the Python Shell Job is more expensive on a DPU-Hour basis, you can select the option "1/16 DPU" in the Job details for a Python Shell Job, which is definetly cheaper than a Pyspark job.

7 Reply by agg42 2024-04-30 14:45:35

agg42
New member
Offline

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

https://medium.com/@navneetsamarth/reduce-aws-cost-using-glue-python-shell-jobs-70a955d4359f#:~:text=The%20cheapest%20Glue%20Spark%20ETL,1%2F16th%20of%20a%20DPU.&text=This%20can%20result%20in%20massive,just%20a%20better%20design%20overall!

8 Reply by GiorgioGss 2024-04-30 14:57:29

GiorgioGss
New member
Offline

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

D is more cheaper than C. Not so scalable but is cheaper...

9 Reply by atu1789 2024-04-30 15:47:03

atu1789
New member
Offline

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

Option D: Write an AWS Glue Python shell job and use pandas to transform the data, is the most cost-effective solution for the described scenario.

AWS Glue’s Python shell jobs are a good fit for smaller-scale ETL tasks, especially when dealing with .csv files that are less than 100 MB each. The use of pandas, a powerful and efficient data manipulation library in Python, makes it an ideal tool for processing and transforming these types of files. This approach avoids the overhead and additional costs associated with more complex solutions like Amazon EKS or EMR, which are generally more suited for larger-scale, more complex data processing tasks.

Given the requirements – processing daily incoming small-sized .csv files – this solution provides the necessary functionality with minimal resources, aligning well with the goal of cost-effectiveness.

10 Reply by rralucard_ 2024-04-30 15:49:46

rralucard_
New member
Offline

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

AWS Glue is a fully managed ETL service, which means you don't need to manage infrastructure, and it automatically scales to handle your data processing needs. This reduces operational overhead and cost.

PySpark, as a part of AWS Glue, is a powerful and widely-used framework for distributed data processing, and it's well-suited for handling data transformations on a large scale.

AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

Posts: 10

1 Topic by bolun 2024-04-30 07:36:50

Topic: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

2 Reply by lucas_rfsb 2024-04-30 08:15:40

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

3 Reply by halogi 2024-04-30 09:52:53

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

4 Reply by chakka90 2024-04-30 12:06:59

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

5 Reply by khchan123 2024-04-30 12:16:36

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

6 Reply by Leo87656789 2024-04-30 13:58:22

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

7 Reply by agg42 2024-04-30 14:45:35

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

8 Reply by GiorgioGss 2024-04-30 14:57:29

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

9 Reply by atu1789 2024-04-30 15:47:03

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

10 Reply by rralucard_ 2024-04-30 15:49:46

Re: AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80

Posts: 10