Questions tagged [apache-spark]

29 questions
1
vote
0 answers

ambari cluster + when need to set Block replication to 1

We get the following in Spark logs: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage DatanodeInfoWithStorage\ The current…
shalom
  • 451
  • 12
  • 26
1
vote
1 answer

How is the number of RDD partitions decided in Apache Spark?

Question How is the number of partitions decided by Spark? Do I need to specify the number of available CPU cores somewhere explicitly so that the number of partitions will be the same (such as numPartition arg of parallelize method, but then need…
mon
  • 225
  • 3
  • 9
1
vote
0 answers

Throttle Spark Cassandra Connector Reads on a Production Cluster

We're currently running a 24 Node Cassandra Cluster in Production that holds 30Tb of data and handles an average live load of 100k Requests Per Min 24/7. We support multiple partners. One of our partners are leaving our Org, so we have to filter…
Mano
  • 11
  • 1
1
vote
0 answers

Fastest way to import files in Spark?

I’m playing around with Spark 3.0.1 and I’m really impressed by the performance with Spark SQL on GB of data. I’m trying to understand what’s the best way to import multiple JSON files in the Spark dataframe before running the analysis…
int 2Eh
  • 183
  • 1
  • 2
  • 6
1
vote
1 answer

Zstd parquet decompression

I have parquet file compressed by zstd. It is possible to decompress it somehow? I tried to use zstd command, but without any luck: [x@xyz tmp]# zstd -d part-00016-303a375a-e443-4f86-a59e-b5d82d15bd26.c000.zstd.parquet -o test.parquet zstd:…
Jacfal
  • 21
  • 3
1
vote
0 answers

How to install cosmosdb spark connector in databricks init script

I'm tried to install cosmosdb spark connector (https://docs.microsoft.com/en-us/azure/cosmos-db/spark-connector) in azure databricks on a cluster in init script, but had errors and non working cluster (one of the uber libraries has different…
0
votes
1 answer

is it possible mix different RHEL OS version in hadoop cluster?

we are using the following HDP cluster with ambari , list of nodes and their RHEL version 3 masters machines ( with namenode & resource manager ) , installed on RHEL 7.2 312 DATA-NODES machines , installed on RHEL 7.2 5 kafka machines , installed…
shalom
  • 451
  • 12
  • 26
0
votes
0 answers

Any benefits of ZFS over EXT4 for data stream processing on top of HDFS?

I'm working on a data stream processing project in which i will be using Apache Flink and Apache Spark and I want to use HDFS for storage. The development and testing will be done on a single node cluster with multiple physical disks. I have already…
HUSMEN
  • 1
  • 2
0
votes
1 answer

Unable to run Spark Cluster on Google DataProc

I am running a 6 node spark cluster on Google Data Proc and within few minutes of launching spark, and performing basic operations, I get the below error OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000fbe00000, 24641536, 0)…
0
votes
1 answer

how to install hadoop2.4.1 in windows with spark 2.0.0

i want to setup a cluster using hadoop in yarn mode..i want to use spark API for map-reduce and will use spark submit to deploy my applications..i want to work on cluster..can anyone help me how to install HADOOP in cluster using windows
0
votes
0 answers

Apache Spark Web UI on kubernetes not working as expected

hi im having a problem i'am deploying Apache spark helm chart on kubernetes bitnami chart : helm repo add bitnami https://charts.bitnami.com/bitnami normally the apache spark webui is on port 8080 when i access the webUI here is what i get : what…
0
votes
0 answers

How to read files from a directory having name "/" in S3 bucket?

Code: val df = spark.read.csv("s3a://sample_bucket//csvFiles/file.csv"); Error: 22/06/23 20:02:57 WARN impl.MetricsConfig: Cannot locate configuration: tried…
0
votes
0 answers

Suggestion for Non Analytical Distributed Processing Frameworks

Can someone please suggest a tool, framework or a service to perform the below task faster. Input : The input to the service is a CSV file which consists of an identifier and several image columns with over a million rows. Objective: To check if any…
0
votes
1 answer

Spark-Cassandra-Connector Issue Exception in thread "main" java.lang.NoClassDefFoundError: com/datastax/spark/connector/rdd/reader/RowReaderFactory

What is going wrong with the Spark Cassandra Connector could you please help to solve this? Scala File: import com.datastax.spark.connector._ import org.apache.spark.sql.SparkSession import org.apache.spark.{SparkConf, SparkContext} object…
1
2