Unable to run Spark Cluster on Google DataProc

Question

I am running a 6 node spark cluster on Google Data Proc and within few minutes of launching spark, and performing basic operations, I get the below error

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000fbe00000, 24641536, 0) failed; error='Cannot allocate memory' (errno=12)
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 24641536 bytes for committing reserved memory.
An error report file with more information is saved as:/home/chris/hs_err_pid21047.log

The only two commands I ran are the following

data = (
     spark.read.format("text")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("gs://bucketpath/csv")
)
data.show()

The csv file is stored in the google storage bucket and the size of the file is 170 MB

Below is the details about my cluster configuration

Name    cluster
Region  australia-southeast1
Zone    australia-southeast1-b
Master node 
Machine type    n1-highcpu-4 (4 vCPU, 3.60 GB memory)
Primary disk type   pd-standard
Primary disk size   50 GB
Worker nodes    5
Machine type    n1-highcpu-4 (4 vCPU, 3.60 GB memory)
Primary disk type   pd-standard
Primary disk size   15 GB
Local SSDs  0
Preemptible worker nodes    0
Cloud Storage staging bucket    dataproc-78f5e64b-a26d-4fe4-bcf9-e1b894db9d8f-au-southeast1
Subnetwork  default
Network tags    None
Internal IP only    No
Image version   1.3.14-deb8

This looked like an issue with the memory, hence I Tried to change the machine type to n1-highcpu-8 (8 vCPU, 7.2 GB memory), however I am unable to launch the instances post that as I am getting the following error

Quota 'CPUS' exceeded. Limit: 24.0 in region australia-southeast1.

So I am not sure what should be done to resolve the issue. I am very new to Google Cloud Platform and I would really appreciate any help in order to resolve this. This for a super critical project

score 0 · Accepted Answer · answered Oct 31 '18 at 05:57

Per the error, you hit the CPU quota limit for your GCP region - australia-southeast1. You have have at least two options -

Request a quota increase for compute engine CPUs. Visit the quotas page in IAM, selection your region under location, select "Compute Engine API CPUs" and click "Edit Quota" to request for an increase.

Direct Link (please change "YOUR-GCP-PROJECT-ID") - https://console.cloud.google.com/iam-admin/quotas?project=YOUR-GCP-PROJECT-ID&location=australia-southeast1
Create dataproc cluster with smaller number of worker nodes OR small vCPU machine types. If the standard machine types provided don't meet your requirements, try custom machine types.

You can also check the CPU quota limit using the gcloud cli tool -

$ gcloud compute regions list --filter='name=australia-southeast1' NAME CPUS DISKS_GB ADDRESSES RESERVED_ADDRESSES STATUS TURNDOWN_DATE australia-southeast1 0/8 0/2048 0/8 0/1 UP

Thanks for the suggestion. I have sent a quota increase. Meanwhile I was wondering what if I create the cluster in US Zone. Does choosing a US zone would impact a lot on pricing. — Tushar Mehta, Oct 31 '18 at 21:00
The US regions/zones would actually be cheaper that the Australia ones on the compute engine used by dataproc jobs. To further minimize cost, you can also use preemptible nodes for your dataproc. Use this link to estimate cost - https://cloud.google.com/dataproc/pricing — Daniel t., Nov 01 '18 at 12:38
Thanks for all the suggestions. For now I would be using US regions/zones — Tushar Mehta, Nov 04 '18 at 23:51

Unable to run Spark Cluster on Google DataProc

1 Answers1