Topic: Google Professional Data Engineer topic 1 question 158

You need to deploy additional dependencies to all nodes of a Cloud Dataproc cluster at startup using an existing initialization action. Company security policies require that Cloud Dataproc nodes do not have access to the Internet so public initialization actions cannot fetch resources. What should you do?

A.
Deploy the Cloud SQL Proxy on the Cloud Dataproc master
B.
Use an SSH tunnel to give the Cloud Dataproc cluster access to the Internet
C.
Copy all dependencies to a Cloud Storage bucket within your VPC security perimeter
D.
Use Resource Manager to add the service account used by the Cloud Dataproc cluster to the Network User role

Re: Google Professional Data Engineer topic 1 question 158

Correct: C

If you create a Dataproc cluster with internal IP addresses only, attempts to access the Internet in an initialization action will fail unless you have configured routes to direct the traffic through a NAT or a VPN gateway. Without access to the Internet, you can enable Private Google Access, and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.

Re: Google Professional Data Engineer topic 1 question 158

Thank you for detailed explanation. C is right

Re: Google Professional Data Engineer topic 1 question 158

Should be C:

https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions

Re: Google Professional Data Engineer topic 1 question 158

c looks good

Re: Google Professional Data Engineer topic 1 question 158

Security Compliance: This option aligns with your company's security policies, which prohibit public Internet access from Cloud Dataproc nodes. Placing the dependencies in a Cloud Storage bucket within your VPC security perimeter ensures that the data remains within your private network.

VPC Security: By placing the dependencies within your VPC security perimeter, you maintain control over network access and can restrict access to the necessary nodes only.

Dataproc Initialization Action: You can use a custom initialization action or script to fetch and install the dependencies from the secure Cloud Storage bucket to the Dataproc cluster nodes during startup.

By copying the dependencies to a secure Cloud Storage bucket and using an initialization action to install them on the Dataproc nodes, you can meet your security requirements while providing the necessary dependencies to your cluster.

Re: Google Professional Data Engineer topic 1 question 158

C is correct

Re: Google Professional Data Engineer topic 1 question 158

C seems good

Re: Google Professional Data Engineer topic 1 question 158

Answer C,
It needs practical experience to understand this question. You create cluster with some package/software i.e dependencies such as python packages that you store in .zip file, then you save a jar file to run the cluster as an application such as you need java while running spark session. and some config yaml file.
These dependencies you can save in bucket and can use to configure cluster from external window , sdk or api. without going into UI.
Then you need to use VPC to access these files

Re: Google Professional Data Engineer topic 1 question 158

C is the answer.

https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/network#and_vpc-sc_networks
With VPC Service Controls, administrators can define a security perimeter around resources of Google-managed services to control communication to and between those services.

Re: Google Professional Data Engineer topic 1 question 158

Without access to the internet, you can enable Private Google Access and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.

Re: Google Professional Data Engineer topic 1 question 158

https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/network#create_a_cloud_dataproc_cluster_with_internal_ip_address_only

Re: Google Professional Data Engineer topic 1 question 158

When creating a Dataproc cluster, you can specify initialization actions in executables or scripts that Dataproc will run on all nodes in your Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run

Re: Google Professional Data Engineer topic 1 question 158

Correct: C

Re: Google Professional Data Engineer topic 1 question 158

c it is!

Re: Google Professional Data Engineer topic 1 question 158

Should be C

Re: Google Professional Data Engineer topic 1 question 158

Should be C

Re: Google Professional Data Engineer topic 1 question 158

I think the correct answer might be C instead, due to https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/network#create_a_cloud_dataproc_cluster_with_internal_ip_address_only