Questions tagged [hdfs]

For questions regarding the Hadoop distributed file system (HDFS) which is part of the Apache Hadoop project.

73 questions
1
vote
1 answer

What version of HDFS is compatible with HBase stable?

HBase stable is currently hbase-0.90.4, what version(s) of HDFS is it compatible with?
Aleksandr Levchuk
  • 2,415
  • 3
  • 21
  • 41
1
vote
1 answer

Processing pre-existing log files with Flume

I have a large set of log files that I need to extract data from. Is it possible to use Flume to read these files and dump them into an HDFS (Cassandra, or another data source) which I can then query? The documentation seems to suggest it's all…
duckus
  • 11
  • 2
1
vote
0 answers

HDFS + how to disable the "du -sk" verifcation on data node disks

We are using HDP cluster with 182 data node machines: HDP version - 2.6.4 Ambari version 2.6.1 We note the following behavior on the data nodes machines (its happens on all data-node machines and on all disks). When we perform the command as above…
King David
  • 433
  • 4
  • 17
0
votes
1 answer

AWS FSx for lustre with S3 vs EMR (with EMRFS) for spark jobs

We are currently using EMR for easy job submission for our spark jobs. Recently I came across the "FSx lustre + S3" solution that is being advertised as ideal for HPC situations. EMRFS however is also said to be optimized for this particular…
dimisjim
  • 215
  • 2
  • 10
0
votes
1 answer

is it possible mix different RHEL OS version in hadoop cluster?

we are using the following HDP cluster with ambari , list of nodes and their RHEL version 3 masters machines ( with namenode & resource manager ) , installed on RHEL 7.2 312 DATA-NODES machines , installed on RHEL 7.2 5 kafka machines , installed…
shalom
  • 451
  • 12
  • 26
0
votes
1 answer

HDFS block deletion speed - cause, expectance, tuning?

I have a small (testing) HDFS cluster which I use as snapshot backup space for Flink. Flink creates and deletes roughly 1000 (small) files per second. The namenode seems to handle this without problems at first, but over time the Number of Blocks…
Caesar
  • 111
  • 4
0
votes
0 answers

Any benefits of ZFS over EXT4 for data stream processing on top of HDFS?

I'm working on a data stream processing project in which i will be using Apache Flink and Apache Spark and I want to use HDFS for storage. The development and testing will be done on a single node cluster with multiple physical disks. I have already…
HUSMEN
  • 1
  • 2
0
votes
1 answer

HDFS balancing , how to balanced hdfs data?

we have Hadoop version - 2.6.4 On the datanode machine we can see that hdfs data isn’t balanced On some disks we have different used size as sdb 11G and sdd 17G /dev/sdd 20G 3.0G 17G 15% /grid/sdd /dev/sdb 20G 11G 9.3G 53% /grid/sdb <-- WHY…
shalom
  • 451
  • 12
  • 26
0
votes
0 answers

Datanode machines disks size

is it important that ( workers ) datanode machines disks will be with the same size? for example we have ambari cluster with 3 workers machines ( datanode machines ) each datanode machine have 10 disks ( 7 disk with 50G and the 3 disks with 48G…
shalom
  • 451
  • 12
  • 26
0
votes
1 answer

what is effected when running - hadoop namenode -format

we have amabri cluster ( version 2.6 ) with 24 workers machines we want to run following commands only on worker23 machine ( because problem on worker23 ) , dose these commands effected on all FileSystem of all the workers? or only on worker23 ? $…
jango
  • 59
  • 2
  • 2
  • 12
0
votes
1 answer

copying files in hdfs stalls

Have a 35 node cluster with a high number of blocks in it: ≈450K blocks per data node. After configuration change (which contained rack reassignments and NameNode Xmx increase) HDFS became a problem. It's unable to perform copy operations on random…
inteloid
  • 101
  • 2
0
votes
1 answer

how to install hadoop2.4.1 in windows with spark 2.0.0

i want to setup a cluster using hadoop in yarn mode..i want to use spark API for map-reduce and will use spark submit to deploy my applications..i want to work on cluster..can anyone help me how to install HADOOP in cluster using windows
0
votes
1 answer

Why does DFSZKFailoverController kills Namenode process in hadoop?

I try to configure hadoop high availability cluster by following this tutorial: http://www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/ When I follow that article I faces with two main problems: 1. hdfs namenode…
Oleksandr
  • 703
  • 2
  • 10
  • 17
0
votes
1 answer

Flume- Error Log while using FileChannel

I am using Flume flume-ng-1.5.0 ( with CDH 5.4) to collect logs from many Servers and Sink to HDFS Here is my configuration : #Define Source , Sinks, Channel collector.sources = avro collector.sinks = HadoopOut collector.channels = fileChannel #…
Summer Nguyen
  • 214
  • 3
  • 10
0
votes
1 answer

Hadoop: How to configure failover time for a datanode

I need to re-replicate blocks on my HDFS cluster in case of a datanode is failing. Actually, this appears to already happen after a period of maybe 10min. However, I want to decrease this time, but wondering how to do so. I tried to set…
frlan
  • 563
  • 5
  • 27