Professional Data Engineer topic 1 question 21 (Page 1) — Google Certifications

1 Topic by slesh 2024-03-26 22:36:16

slesh
New member
Offline

Topic: Professional Data Engineer topic 1 question 21

Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the data. How should you deduplicate the data most efficiency?

A.
Assign global unique identifiers (GUID) to each data entry.
B.
Compute the hash value of each data entry, and compare it with all historical data.
C.
Store each data entry as the primary key in a separate database and apply an index.
D.
Maintain a database table to store the hash value and other metadata for each data entry.

2 Reply by dg63 2024-03-27 00:24:50

dg63
New member
Offline

Re: Professional Data Engineer topic 1 question 21

The best answer is "A".
Answer "D" is not as efficient or error-proof due to two reasons
1. You need to calculate hash at sender as well as at receiver end to do the comparison. Waste of computing power.
2. Even if we discount the computing power, we should note that the system is sending inventory information. Two messages sent at different can denote same inventory level (and thus have same hash). Adding sender time stamp to hash will defeat the purpose of using hash as now retried messages will have different timestamp and a different hash.
if timestamp is used as message creation timestamp than that can also be used as a UUID.

3 Reply by emmylou 2024-03-27 02:46:11

emmylou
New member
Offline

Re: Professional Data Engineer topic 1 question 21

If you add a unique ID aren't you by definition not getting a duplicate record. Honestly I hate all these answers.

4 Reply by retax 2024-03-27 05:26:57

retax
New member
Offline

Re: Professional Data Engineer topic 1 question 21

If the goal is to ensure at least ONE of each pair of entries is inserted into the db, then how is assigning a GUID to each entry resolving the duplicates? Keep in mind if the 1st entry fails, then hopefully the 2nd (duplicate) is successful.

5 Reply by MarcoDipa 2024-03-27 07:52:09

MarcoDipa
New member
Offline

Re: Professional Data Engineer topic 1 question 21

Answer is D. Using Hash values we can remove duplicate values from a database. Hash values will be same for duplicate data and thus can be easily rejected. Obviously you won't check hash for timestmp.
D is better thatn B because maintaning a different table will reduce cost for hash computation for all historical data

6 Reply by Mathew106 2024-03-27 10:08:52

Mathew106
New member
Offline

Re: Professional Data Engineer topic 1 question 21

Why can't it be A, where the GUID is a hash value? Why do we need to store the hash with the metadata in a separate database to do the deduplication?

7 Reply by ralf_cc 2024-03-27 12:27:27

ralf_cc
New member
Offline

Re: Professional Data Engineer topic 1 question 21

A - In D, same message with different timestamp will have different hash, though the message content is the same.

8 Reply by MaxNRG 2024-03-27 14:29:36

MaxNRG
New member
Offline

Re: Professional Data Engineer topic 1 question 21

agreed, the key here is "payload of several fields and the timestamp"

9 Reply by MaxNRG 2024-03-27 15:42:15

MaxNRG
New member
Offline

Re: Professional Data Engineer topic 1 question 21

"payload of several fields and the timestamp of the transmission"

10 Reply by BigDataBB 2024-03-27 17:22:38

BigDataBB
New member
Offline

Re: Professional Data Engineer topic 1 question 21

Hi Max, I also think that the hash value would be worng because the timestamp is part of payload and is not written that the hash value is generated without the ts; but it also not written if GUID is linked or not with sending. Often this is a point where the answer is vague. Because don't specify if the GUID is related to the data or to the send.

11 Reply by omakin 2024-03-27 18:35:36

omakin
New member
Offline

Re: Professional Data Engineer topic 1 question 21

Strong Answer is A - in another question on the gcp sample questions: the correct answer to that particular question was "You are building a new real-time data warehouse for your company and will use BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?"
This means you need a "uniqueid" and timestamps to properly dedupe a data.

12 Reply by Tanzu 2024-03-27 20:39:01

Tanzu
New member
Offline

Re: Professional Data Engineer topic 1 question 21

U need a uniqueid but in this scenario, there is none. So u have to calculate by hashing w/ some of the fields in the dataset.

A is assigning guid in processing side will not solve the issue. Cause u will assign diff. ids...

13 Reply by cetanx 2024-03-27 22:44:08

cetanx
New member
Offline

Re: Professional Data Engineer topic 1 question 21

Answer - D
Key statement is "Transmitted data includes a payload of several fields and the timestamp of the transmission."

So the timestamp is appended to message while sending, in other words that field is subject to change if message is retransmitted. However, adding a GUID doesn't help much because if message is transmitted twice you will have different GUID for both messages but they will be the same/duplicate data.

You can simply calculate a hash based on not all data but from a select of columns (with the payload of several fields AND definitely by excluding the timestamp). By doing so, you can assure a different hash for each message.

14 Reply by [Removed] 2024-03-27 23:44:26

[Removed]
New member
Offline

Re: Professional Data Engineer topic 1 question 21

Answer: D
Description: Using Hash values we can remove duplicate values from a database. Hashvalues will be same for duplicate data and thus can be easily rejected.

15 Reply by stefanop 2024-03-28 00:12:25

stefanop
New member
Offline

Re: Professional Data Engineer topic 1 question 21

Hash values for same data will be the same, but in this case data contains also the timestamp

16 Reply by DGames 2024-03-28 00:18:34

DGames
New member
Offline

Re: Professional Data Engineer topic 1 question 21

While calculating Hash value we exclude the timestamp.

17 Reply by TVH_Data_Engineer 2024-03-28 02:09:56

TVH_Data_Engineer
New member
Offline

Re: Professional Data Engineer topic 1 question 21

To deduplicate the data most efficiently, especially in a cloud environment where the data is sent periodically and re-transmissions can occur, the recommended approach would be:

D. Maintain a database table to store the hash value and other metadata for each data entry.

This approach allows you to quickly check if an incoming data entry is a duplicate by comparing hash values, which is much faster than comparing all fields of a data entry. The metadata, which includes the timestamp and possibly other relevant information, can help resolve any ambiguities that may arise if the hash function ever produces collisions.

18 Reply by JustQ 2024-03-28 02:28:56

JustQ
New member
Offline

Re: Professional Data Engineer topic 1 question 21

B. Compute the hash value of each data entry, and compare it with all historical data.

Explanation:

Efficiency: Hashing is a fast and efficient operation, and comparing hash values is generally quicker than comparing the entire payload or using other methods.

Space Efficiency: Storing hash values requires less storage space compared to storing entire payloads or using global unique identifiers (GUIDs).

Deduplication: By computing the hash value of each data entry and comparing it with historical data, you can easily identify duplicate transmissions. If the hash value matches an existing one, it indicates that the payload is the same.

19 Reply by steghe 2024-03-28 03:17:38

steghe
New member
Offline

Re: Professional Data Engineer topic 1 question 21

I though the answer was A 'cos it's more efficient. But I read the answer with more attention: GUID is given "at each data entry". It's not said that GUID was given from publisher. If GUID is given in data entry (subscriber), two equal messages can have different GUID.
D is not complete 'cos it's not so precise about hash field that are used.
I'm in doubt on this answer :-(

20 Reply by Lestrang 2024-03-28 04:37:52

Lestrang
New member
Offline

Re: Professional Data Engineer topic 1 question 21

Data entry means record, it is not an action. that means that each record will have a unique id. so assuming our sink will not accept duplicates based on a key, the GUID will work.

21 Reply by rocky48 2024-03-28 07:14:35

rocky48
New member
Offline

Re: Professional Data Engineer topic 1 question 21

Answer : A
"D" is not as efficient or error-proof due to two reasons
1. You need to calculate hash at sender as well as at receiver end to do the comparison. Waste of computing power.
2. Even if we discount the computing power, we should note that the system is sending inventory information. Two messages sent at different can denote same inventory level (and thus have same hash). Adding sender time stamp to hash will defeat the purpose of using hash as now retried messages will have different timestamp and a different hash.
if timestamp is used as message creation timestamp than that can also be used as a UUID.

22 Reply by rtcpost 2024-03-28 09:48:08

rtcpost
New member
Offline

Re: Professional Data Engineer topic 1 question 21

D. Maintain a database table to store the hash value and other metadata for each data entry.

Storing a database table with hash values and metadata is an efficient way to deduplicate data. When new data is transmitted, you can calculate the hash of the payload and check whether it already exists in the database. This approach allows for efficient duplicate detection without the need to compare the new data with all historical data. It's a common and scalable technique used to ensure data consistency and avoid processing the same data multiple times.

Options A (assigning GUIDs to each data entry) and C (storing each data entry as the primary key) can work, but they might be less efficient than using hash values when dealing with a large volume of data. Option B (computing the hash value of each data entry and comparing it with all historical data) can be computationally expensive and slow, especially if there's a significant amount of historical data to compare against. Storing hash values in a table allows for fast and efficient deduplication.

23 Reply by alihabib 2024-03-28 12:30:06

alihabib
New member
Offline

Re: Professional Data Engineer topic 1 question 21

Why not D ? Generate a Hash for payload entry and maintain the value as metadata. Do the validation check on Dataflow..... A GUID will generate 2 different entries for same payload entry, it will not tackle duplication check

24 Reply by Hungry_guy 2024-03-28 13:44:00

Hungry_guy
New member
Offline

Re: Professional Data Engineer topic 1 question 21

Answer is B - although the time stamp is diff for each transmission - the hash value is computed for the payload, not for the timestamp - which is just an added field for transmission. So, has val remains the same for all transmissions of the same data - which is what we can use for comparision.

So, much more efficient to just directly compare the hash values with the historical data - to check and remove duplicates - instead of again wasting space storing stuff - in option D

25 Reply by Mark_86 2024-03-28 14:06:38

Mark_86
New member
Offline

Re: Professional Data Engineer topic 1 question 21

This question is formulated very badly.
From the way that A is formulated, you would not deduplicate but rather the duplicates would have the same GUID.
Then we have D, which is storing the information (assuming the hash is created without the timestamp). B is doing it right away. D only alludes to the actual deduplication. But it would be more efficient.

Professional Data Engineer topic 1 question 21 (Page 1 of 2)

Posts: 1 to 25 of 33

1 Topic by slesh 2024-03-26 22:36:16

Topic: Professional Data Engineer topic 1 question 21

2 Reply by dg63 2024-03-27 00:24:50

Re: Professional Data Engineer topic 1 question 21

3 Reply by emmylou 2024-03-27 02:46:11

Re: Professional Data Engineer topic 1 question 21

4 Reply by retax 2024-03-27 05:26:57

Re: Professional Data Engineer topic 1 question 21

5 Reply by MarcoDipa 2024-03-27 07:52:09

Re: Professional Data Engineer topic 1 question 21

6 Reply by Mathew106 2024-03-27 10:08:52

Re: Professional Data Engineer topic 1 question 21

7 Reply by ralf_cc 2024-03-27 12:27:27

Re: Professional Data Engineer topic 1 question 21

8 Reply by MaxNRG 2024-03-27 14:29:36

Re: Professional Data Engineer topic 1 question 21

9 Reply by MaxNRG 2024-03-27 15:42:15

Re: Professional Data Engineer topic 1 question 21

10 Reply by BigDataBB 2024-03-27 17:22:38

Re: Professional Data Engineer topic 1 question 21

11 Reply by omakin 2024-03-27 18:35:36

Re: Professional Data Engineer topic 1 question 21

12 Reply by Tanzu 2024-03-27 20:39:01

Re: Professional Data Engineer topic 1 question 21

13 Reply by cetanx 2024-03-27 22:44:08

Re: Professional Data Engineer topic 1 question 21

14 Reply by [Removed] 2024-03-27 23:44:26

Re: Professional Data Engineer topic 1 question 21

15 Reply by stefanop 2024-03-28 00:12:25

Re: Professional Data Engineer topic 1 question 21

16 Reply by DGames 2024-03-28 00:18:34

Re: Professional Data Engineer topic 1 question 21

17 Reply by TVH_Data_Engineer 2024-03-28 02:09:56

Re: Professional Data Engineer topic 1 question 21

18 Reply by JustQ 2024-03-28 02:28:56

Re: Professional Data Engineer topic 1 question 21

19 Reply by steghe 2024-03-28 03:17:38

Re: Professional Data Engineer topic 1 question 21

20 Reply by Lestrang 2024-03-28 04:37:52

Re: Professional Data Engineer topic 1 question 21

21 Reply by rocky48 2024-03-28 07:14:35

Re: Professional Data Engineer topic 1 question 21

22 Reply by rtcpost 2024-03-28 09:48:08

Re: Professional Data Engineer topic 1 question 21

23 Reply by alihabib 2024-03-28 12:30:06

Re: Professional Data Engineer topic 1 question 21

24 Reply by Hungry_guy 2024-03-28 13:44:00

Re: Professional Data Engineer topic 1 question 21

25 Reply by Mark_86 2024-03-28 14:06:38

Re: Professional Data Engineer topic 1 question 21

Posts: 1 to 25 of 33