Topic: Google Data Engineer topic 1 question 42

Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs. What should you recommend they do?

A.
Rewrite the job in Pig.
B.
Rewrite the job in Apache Spark.
C.
Increase the size of the Hadoop cluster.
D.
Decrease the size of the Hadoop cluster but also rewrite the job in Hive.

Re: Google Data Engineer topic 1 question 42

I would say B since Apache Spark is faster than Hadoop/Pig/MapReduce

Re: Google Data Engineer topic 1 question 42

But it requires much more memory causing it more expensive, which is not what we're aiming for here..

Re: Google Data Engineer topic 1 question 42

Answer: B
Description: Spark performs in-memory processing and faster, which results in optimization of job’s processing time

Re: Google Data Engineer topic 1 question 42

Just a regular Spark. B

Re: Google Data Engineer topic 1 question 42

C. I think it should be C because intent of asking question is to realize the problem of on-prem auto-scaling not the optimization that we achieve using spark in-memory features. Its GCP exam they want to highlight if hadoop cluster commodity hard doesn't increase when data increases then it can create problem unlike GCP. Hence migrate to GCP.

Re: Google Data Engineer topic 1 question 42

None. Being a GCP exam, it must be either Dataflow or BigQuery big_smile

Re: Google Data Engineer topic 1 question 42

I would like to take a moment to thank you all guys
You guys are awesome!!!

Re: Google Data Engineer topic 1 question 42

Wow, a question that does not recommend to use Google product

Re: Google Data Engineer topic 1 question 42

looks like he's trying to spark the company up.

Re: Google Data Engineer topic 1 question 42

It seems he's not well paid.

Re: Google Data Engineer topic 1 question 42

Both Pig & Spark requires rewriting the code so its an additional overhead, but as an architect I would think about a long lasting solution. Resizing Hadoop cluster can resolve the problem statement for the workloads at that point in time but not on longer run. So Spark is the right choice, although its a cost to start with, it will certainly be a long lasting solution

Re: Google Data Engineer topic 1 question 42

Ans is B . Apache spark.

Re: Google Data Engineer topic 1 question 42

SPARK > hadoop, pig, hive

Re: Google Data Engineer topic 1 question 42

B - Apache Spark

Re: Google Data Engineer topic 1 question 42

https://www.ibm.com/cloud/blog/hadoop-vs-spark

Re: Google Data Engineer topic 1 question 42

B Spark for optimization and processing.

Re: Google Data Engineer topic 1 question 42

B: Spark is suitable for the given operation is much more powerful

Re: Google Data Engineer topic 1 question 42

as explained by pr2web

Re: Google Data Engineer topic 1 question 42

Ans B:
Spark is a 100 times faster and utilizes memory, instead of Hadoop Mapreduce's two-stage paradigm.

Re: Google Data Engineer topic 1 question 42

B as Spark can improve the performance as it performs lazy in-memory execution.
Spark is important because it does part of its pipeline processing in memory rather than copying from disk. For some applications, this makes Spark extremely fast.

Re: Google Data Engineer topic 1 question 42

With a Spark pipeline, you have two different kinds of operations, transforms and actions. Spark builds its pipeline used an abstraction called a directed graph. Each transform builds additional nodes into the graph but spark doesn't execute the pipeline until it sees an action.
Spark waits until it has the whole story, all the information. This allows Spark to choose the best way to distribute the work and run the pipeline. The process of waiting on transforms and executing on actions is called, lazy execution. For a transformation, the input is an RDD and the output is an RDD. When Spark sees a transformation, it registers it in the directed graph and then it waits. An action triggers Spark to process the pipeline, the output is usually a result format, such as a text file, rather than an RDD.

Re: Google Data Engineer topic 1 question 42

Option A is wrong as Pig is wrapper and would initiate Map Reduce jobs
Option C is wrong as it would increase the cost.
Option D is wrong Hive is wrapper and would initiate Map Reduce jobs. Also, reducing the size would reduce performance.

Re: Google Data Engineer topic 1 question 42

Wont Option B increase the cost ? Cost of re-writing the job in Spark + Cost of additional memory ?