Google Data Engineer topic 1 question 42 (Page 1) — Google Certifications

1 Topic by Billy_Bob_Je 2024-03-26 16:21:25

Billy_Bob_Je
New member
Offline

Topic: Google Data Engineer topic 1 question 42

Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs. What should you recommend they do?

A.
Rewrite the job in Pig.
B.
Rewrite the job in Apache Spark.
C.
Increase the size of the Hadoop cluster.
D.
Decrease the size of the Hadoop cluster but also rewrite the job in Hive.

2 Reply by jvg637 2024-03-26 17:30:28

jvg637
New member
Offline

Re: Google Data Engineer topic 1 question 42

I would say B since Apache Spark is faster than Hadoop/Pig/MapReduce

3 Reply by Trocinek 2024-03-26 19:41:50

Trocinek
New member
Offline

Re: Google Data Engineer topic 1 question 42

But it requires much more memory causing it more expensive, which is not what we're aiming for here..

4 Reply by [Removed] 2024-03-26 21:48:39

[Removed]
New member
Offline

Re: Google Data Engineer topic 1 question 42

Answer: B
Description: Spark performs in-memory processing and faster, which results in optimization of job’s processing time

5 Reply by axantroff 2024-03-26 23:15:44

axantroff
New member
Offline

Re: Google Data Engineer topic 1 question 42

Just a regular Spark. B

6 Reply by DataFrame 2024-03-27 00:26:59

DataFrame
New member
Offline

Re: Google Data Engineer topic 1 question 42

C. I think it should be C because intent of asking question is to realize the problem of on-prem auto-scaling not the optimization that we achieve using spark in-memory features. Its GCP exam they want to highlight if hadoop cluster commodity hard doesn't increase when data increases then it can create problem unlike GCP. Hence migrate to GCP.

7 Reply by itsmynickname 2024-03-27 02:48:31

itsmynickname
New member
Offline

Re: Google Data Engineer topic 1 question 42

None. Being a GCP exam, it must be either Dataflow or BigQuery

8 Reply by KHAN0007 2024-03-27 04:01:13

KHAN0007
New member
Offline

Re: Google Data Engineer topic 1 question 42

I would like to take a moment to thank you all guys
You guys are awesome!!!

9 Reply by ler_mp 2024-03-27 04:25:33

ler_mp
New member
Offline

Re: Google Data Engineer topic 1 question 42

Wow, a question that does not recommend to use Google product

10 Reply by Whoswho 2024-03-27 07:07:26

Whoswho
New member
Offline

Re: Google Data Engineer topic 1 question 42

looks like he's trying to spark the company up.

11 Reply by itsmynickname 2024-03-27 09:34:38

itsmynickname
New member
Offline

Re: Google Data Engineer topic 1 question 42

It seems he's not well paid.

12 Reply by Krish6488 2024-03-27 10:30:33

Krish6488
New member
Offline

Re: Google Data Engineer topic 1 question 42

Both Pig & Spark requires rewriting the code so its an additional overhead, but as an architect I would think about a long lasting solution. Resizing Hadoop cluster can resolve the problem statement for the workloads at that point in time but not on longer run. So Spark is the right choice, although its a cost to start with, it will certainly be a long lasting solution

13 Reply by Mamta072 2024-03-27 12:07:37

Mamta072
New member
Offline

Re: Google Data Engineer topic 1 question 42

Ans is B . Apache spark.

14 Reply by alecuba16 2024-03-27 14:03:44

alecuba16
New member
Offline

Re: Google Data Engineer topic 1 question 42

SPARK > hadoop, pig, hive

15 Reply by kped21 2024-03-27 15:00:45

kped21
New member
Offline

Re: Google Data Engineer topic 1 question 42

B - Apache Spark

16 Reply by luamail 2024-03-27 16:24:44

luamail
New member
Offline

Re: Google Data Engineer topic 1 question 42

https://www.ibm.com/cloud/blog/hadoop-vs-spark

17 Reply by kped21 2024-03-27 17:20:00

kped21
New member
Offline

Re: Google Data Engineer topic 1 question 42

B Spark for optimization and processing.

18 Reply by sraakesh95 2024-03-27 17:36:06

sraakesh95
New member
Offline

Re: Google Data Engineer topic 1 question 42

B: Spark is suitable for the given operation is much more powerful

19 Reply by medeis_jar 2024-03-27 17:54:01

medeis_jar
New member
Offline

Re: Google Data Engineer topic 1 question 42

as explained by pr2web

20 Reply by pr2web 2024-03-27 20:05:00

pr2web
New member
Offline

Re: Google Data Engineer topic 1 question 42

Ans B:
Spark is a 100 times faster and utilizes memory, instead of Hadoop Mapreduce's two-stage paradigm.

21 Reply by MaxNRG 2024-03-27 20:13:34

MaxNRG
New member
Offline

Re: Google Data Engineer topic 1 question 42

B as Spark can improve the performance as it performs lazy in-memory execution.
Spark is important because it does part of its pipeline processing in memory rather than copying from disk. For some applications, this makes Spark extremely fast.

22 Reply by MaxNRG 2024-03-27 21:48:43

MaxNRG
New member
Offline

Re: Google Data Engineer topic 1 question 42

With a Spark pipeline, you have two different kinds of operations, transforms and actions. Spark builds its pipeline used an abstraction called a directed graph. Each transform builds additional nodes into the graph but spark doesn't execute the pipeline until it sees an action.
Spark waits until it has the whole story, all the information. This allows Spark to choose the best way to distribute the work and run the pipeline. The process of waiting on transforms and executing on actions is called, lazy execution. For a transformation, the input is an RDD and the output is an RDD. When Spark sees a transformation, it registers it in the directed graph and then it waits. An action triggers Spark to process the pipeline, the output is usually a result format, such as a text file, rather than an RDD.

23 Reply by MaxNRG 2024-03-27 23:18:01

MaxNRG
New member
Offline

Re: Google Data Engineer topic 1 question 42

Option A is wrong as Pig is wrapper and would initiate Map Reduce jobs
Option C is wrong as it would increase the cost.
Option D is wrong Hive is wrapper and would initiate Map Reduce jobs. Also, reducing the size would reduce performance.

24 Reply by kastuarr 2024-03-28 01:38:10

kastuarr
New member
Offline

Re: Google Data Engineer topic 1 question 42

Wont Option B increase the cost ? Cost of re-writing the job in Spark + Cost of additional memory ?

Google Data Engineer topic 1 question 42

Posts: 24

1 Topic by Billy_Bob_Je 2024-03-26 16:21:25

Topic: Google Data Engineer topic 1 question 42

2 Reply by jvg637 2024-03-26 17:30:28

Re: Google Data Engineer topic 1 question 42

3 Reply by Trocinek 2024-03-26 19:41:50

Re: Google Data Engineer topic 1 question 42

4 Reply by [Removed] 2024-03-26 21:48:39

Re: Google Data Engineer topic 1 question 42

5 Reply by axantroff 2024-03-26 23:15:44

Re: Google Data Engineer topic 1 question 42

6 Reply by DataFrame 2024-03-27 00:26:59

Re: Google Data Engineer topic 1 question 42

7 Reply by itsmynickname 2024-03-27 02:48:31

Re: Google Data Engineer topic 1 question 42

8 Reply by KHAN0007 2024-03-27 04:01:13

Re: Google Data Engineer topic 1 question 42

9 Reply by ler_mp 2024-03-27 04:25:33

Re: Google Data Engineer topic 1 question 42

10 Reply by Whoswho 2024-03-27 07:07:26

Re: Google Data Engineer topic 1 question 42

11 Reply by itsmynickname 2024-03-27 09:34:38

Re: Google Data Engineer topic 1 question 42

12 Reply by Krish6488 2024-03-27 10:30:33

Re: Google Data Engineer topic 1 question 42

13 Reply by Mamta072 2024-03-27 12:07:37

Re: Google Data Engineer topic 1 question 42

14 Reply by alecuba16 2024-03-27 14:03:44

Re: Google Data Engineer topic 1 question 42

15 Reply by kped21 2024-03-27 15:00:45

Re: Google Data Engineer topic 1 question 42

16 Reply by luamail 2024-03-27 16:24:44

Re: Google Data Engineer topic 1 question 42

17 Reply by kped21 2024-03-27 17:20:00

Re: Google Data Engineer topic 1 question 42

18 Reply by sraakesh95 2024-03-27 17:36:06

Re: Google Data Engineer topic 1 question 42

19 Reply by medeis_jar 2024-03-27 17:54:01

Re: Google Data Engineer topic 1 question 42

20 Reply by pr2web 2024-03-27 20:05:00

Re: Google Data Engineer topic 1 question 42

21 Reply by MaxNRG 2024-03-27 20:13:34

Re: Google Data Engineer topic 1 question 42

22 Reply by MaxNRG 2024-03-27 21:48:43

Re: Google Data Engineer topic 1 question 42

23 Reply by MaxNRG 2024-03-27 23:18:01

Re: Google Data Engineer topic 1 question 42

24 Reply by kastuarr 2024-03-28 01:38:10

Re: Google Data Engineer topic 1 question 42

Posts: 24