Questions tagged [big-data]

28 questions
10
votes
7 answers

How does one check huge files identity if hashing is CPU bound?

For small files hashing is just ok, but with huge ones you can easily find md5sum is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)
poige
  • 9,171
  • 2
  • 24
  • 50
6
votes
2 answers

Are we able to edit the schema of bigquery table after creation?

I made a mistake of specifying a field as integer instead of float. I found that I am not able to make correction a field once the table is created. I have to delete and re-create the table again to make things right. Does anyone know of a better…
Kim
  • 315
  • 1
  • 5
  • 10
4
votes
1 answer

Reclaiming free space in filegroup with single chronological partition

-moved here from SO (no coments there) Question: what is a proper way of reclaiming space in big (hundreds of GBs) filegroup with single partition of table that is chronologically ordered and has no index fragmentation and cannot afford no index…
Jan
  • 151
  • 4
3
votes
1 answer

Apache Spark infrastructure - combining compute and storage nodes

I have an infrastructure question around Apache Spark, which I'm looking at rolling out in a greenfield project with (at most) approximately 4 TB of data used for modelling at any given time. Application domain will be analytics, and training of the…
3
votes
2 answers

What is the best way for store images of website?

We are using cassandra database for store website information, but we are not sure how to save images. We can store them in cassandra, but we can also allocate a server for storing images. Cassandra has good performance for big-data storage but if…
Omid Ebrahimi
  • 143
  • 1
  • 1
  • 6
2
votes
0 answers

Cassandra nodetool repair failure - broken pipe

we're trying to check our Cassandra cluster data integrity with: nodetool repair but after several minutes (~2-10min), we got strange connection resets / broken pipe stack trace on a first node: ERROR [STREAM-OUT-/52.xx.xx.xx] 2016-01-14…
Greg M.
  • 41
  • 3
2
votes
2 answers

Just how bad is Network Attached Storage for certain cloud applications?

I have heard it recommended to stay away from AWS hosting for certain "big data" applications (e.g. Hadoop, Cassandra, Solr) because EC2 instances typically use network attached storage (though there are more recently some high i/o instances, but…
1
vote
0 answers

spark.dynamicAllocation + setting the spark parameters according to ambari cluster

we want to find the values for the following spark parameters according to inputs as memory on datanode machine , CPU CORE on data node machine , numbers of datanode machine etc ,, spark.dynamicAllocation.initialExecutors =…
shalom
  • 451
  • 12
  • 26
1
vote
0 answers

Presto Maximum concurrent sessions

Presto can't handle many concurrent sessions. what's the maximum number of concurrent sessions per presto and how to set parameter for this ? and how to handle it's maximum JVM ?
user212051
  • 45
  • 1
  • 9
1
vote
1 answer

Cloudera SCM Agent can't heartbeat but port is contactable

I'm trying to add nodes to a Cloudera cluster. When the agent starts I get a python stacktrace saying it can't heartbeat to master-host:7182, however I can connect to that port just fine. The stacktrace is from Python and ends saying the connection…
shearn89
  • 3,143
  • 2
  • 14
  • 39
1
vote
0 answers

Range requests time out after 10GB in apache or java servlet

There seems to be an issue with serving large 10+GB files with byte-range requests on our RHEL5 64-bit server. The issue that I am noticing is that range requests are timing out for ranges that are crossing the 10GB (ten gigabyte) mark, whereas…
Colin D
  • 111
  • 5
1
vote
0 answers

HDFS performances on apache spark

I have several issues related to HDFS, that may have different roots. I'm posting as much information as I can, with the hope that I can get your opinion on at least some of them. Basically the cases are: HDFS classes not found Connections with…
Bacon
  • 123
  • 7
1
vote
1 answer

Is there a way to send (by mail) external disks to be installed in azure?

We have many TBs of data on external disks (WD passports), and wish to process it using azure's VMs. Uploading will take forever (and the bandwidth will probably cost too much). Is there a way to send a package with those passports to an azure…
Paul Oyster
  • 145
  • 2
0
votes
1 answer

HDFS balancing , how to balanced hdfs data?

we have Hadoop version - 2.6.4 On the datanode machine we can see that hdfs data isn’t balanced On some disks we have different used size as sdb 11G and sdd 17G /dev/sdd 20G 3.0G 17G 15% /grid/sdd /dev/sdb 20G 11G 9.3G 53% /grid/sdb <-- WHY…
shalom
  • 451
  • 12
  • 26
0
votes
0 answers

Avoid kafka disk to became 100% used by Cron job

We want to suggest the following based on our issues on kafka disks We have many HDP clusters ( based on ambari , and all machines are redhat version 7.2 ) Each cluster include 3 kafka machines , while each kafka include disk with ~15 T Because we…
shalom
  • 451
  • 12
  • 26
1
2