Questions tagged [hpc]

High Performance Computing encompasses using "supercomputers" with high numbers of CPUs, large parallel storage systems and advanced networks to perform time-consuming calculations. Parallel algorithms and parallelization of storage are essential to this field, as well as issues with complex, fast networking fabrics such as Infiniband.

High Performance Computing(HPC) encompasses many aspects of traditional computing and is utilized by a variety of fields including but not limited to particle physics, computer animation/CGI for major films, cancer/genomics research and modeling the climate. HPC systems, sometimes called 'supercomputers' are typically large numbers of high-performance servers with large numbers of CPUs and cores, interconnected by a high speed fabric or network.

A list of the top500 fastest computers on the planet is maintained as well as a list of the 500 most energy efficient computers. The performance of these systems is measured using the LINPACK benchmark, though a new benchmark using a conjugate gradient method, which is more representative of modern HPC workloads. IBM, Cray and SGI are major manufacturers of HPC systems and software though it should be noted that over 70% of the systems on the top500 list are based on Intel platforms.

Interconnect fabric technology is also crucial to HPC systems, many of which rely on internal high-speed networks made up of Infiniband or similar low-latency high-bandwidth networks. In addition to interconnect technology, GPU and coprocessors have recently been gaining in popularity for their ability to accelerate certain types of workloads.

Software is an additional concern for HPC systems as typical programs are not equipped to run on such a large scale. Many hardware manufacturers also produce their own software stacks for HPC systems which include compilers, drivers, parallelization and math libraries, system management interfaces and profiling tools specifically designed to work with the hardware they produce.

Most HPC systems use a highly modified linux kernel that is stripped down to only the essential components required to run the software on supplied hardware. Many modern HPC systems are setup in a 'stateless' manner, which means that no OS data is stored locally on compute nodes and an OS image is loaded into RAM typically over the network using PXE boot. This functionally allows the nodes to be rebooted into a clean, known-good working state. This is desirable in HPC systems as it is sometimes difficult to effectively cleanup processes that were running across several nodes in parallel cleanly.

113 questions
11
votes
5 answers

How to allow users to transfer files to other users on linux

We have an environment of a few thousand users running applications on about 40 clusters ranging in size from 20 compute nodes to 98,000 compute nodes. Users on these systems generate massive files (sometimes > 1PB) controlled by traditional unix…
Jon Bringhurst
  • 251
  • 2
  • 8
10
votes
1 answer

Why does requesting GPUs as a generic resource on a cluster running SLURM with the built-in plugin fail?

Disclaimer: This post is quite long as I tried to provide all relevant configuration information. Status and Problem: I adminster a gpu cluster and I want to use slurm for job management. Unfortunatelly, I cannot request GPUs using the respective…
Pixchem
  • 161
  • 1
  • 9
7
votes
1 answer

XFS Adding Quotas - Skip Quota check on first mount/boot

We run a 14TB XFS fileserver on our cluster and want to add quota support. This is running 3.9.2-1.el6.elrepo.x86_64 kernel under CentOS 6.3 (Final). The issue is when we unmount the XFS RAID, and re-mount it adding quota support, the mount command…
Adam
  • 366
  • 3
  • 6
6
votes
4 answers

using i7 "gamer" cpu in a HPC cluster

I'm running WRF weather model. That's a ram intensive, highly parallel application. I need to build a HPC cluster for that. I use 10GB infiniband interconnect. WRF doesn't depends of core count, but on memory bandwidth. That's why a core i7 3820 or…
user1219721
  • 467
  • 1
  • 6
  • 15
4
votes
1 answer

What makes Lustre faster and more scalable than NFS?

I have read in various places (e.g. here and here) that NFS' I/O performance does not scale, while Lustre's does, and that Lustre can deliver better I/O rates in general. There seem to be various architectural differences between the two, but I…
4
votes
2 answers

View Infiniband routing table generated by OpenSM?

As I understand it, the subnet manager of an Infiniband network calculates the best routes between each pair of nodes on the network and provides these routes to the nodes when they want to communicate. Is there any way to get the subnet manager…
ajdecon
  • 1,291
  • 4
  • 14
  • 21
4
votes
4 answers

Set up simple Infiniband Block Storage (SRP or iSER)

I'm trying to figure out how to set up a simple storage system which exports block storage over Infiniband, using either SRP or iSER. I'm very early in the process, and at the moment I'm basically just looking for a tutorial on the level of, "You…
ajdecon
  • 1,291
  • 4
  • 14
  • 21
4
votes
1 answer

Why are most supercomputer using linux?

Referring to this BBC Article: Supercomputing superpowers Almost all the supercomputers are using Linux as operating system. Why is Linux so popular?
sdc
4
votes
3 answers

Is this a HPC or HA mySQL cluster?

Can someone tell me if this is a High Performance Compute or High Available mySQL cluster? There is a picture of the setup. This is part of the config.ini they talk about [ndbd default] NoOfReplicas=2 # Number of replicas Is it correct…
Louise Hoffman
  • 476
  • 2
  • 6
  • 12
3
votes
1 answer

Micosoft HPC Pack 2012 R2 does not run with Network Direct after joining new domain

I am working with a 13 computer cluster, running on Windows Server 2012 R2, using MS HPC Pack 2012 R2. The headnode is working properly. The servers are connected to the corporate network via IPv4 on standard adapters. The nodes however are also…
3
votes
1 answer

Intel Xeon 6134 + One DIMM per channel or two DIMMs per channel for maximum memory bandwidth?

I'm unable to find this critical piece of information in spec sheets. Appreciate any insight. We're purchasing servers for HPC work with intel Xeon Gold 6134 (Skylake) cpus I want maximum memory bandwidth, and not concerned about the total amount of…
3
votes
3 answers

Randomize Slurm Node Allocation

Has anyone had luck randomizing Slurm node allocations? We have a small cluster of 12 nodes that could be used by anywhere from 1-8 people at a time with jobs of various size/length. When testing our new Slurm setup, jobs always go to the first node…
tnallen
  • 31
  • 1
3
votes
1 answer

MAAS for diskless computational hpc cluster

I'm considering to use MAAS to deploy OS for a computational cluster. All nodes are diskless. Only head node and (probably) MAAS rack controller will have hard drives. It seems MAAS have to finish a node commissioning before using it, but how…
rth
  • 135
  • 4
3
votes
2 answers

Does Xeon Phi works with i7 CPUs

Does Xeon Phi coprossesors works with i7 CPUs ? It's advertised for use with Xeons, but for my app (WRF), i7-3930k performs better and is 3 times cheaper than high grade xeons. So I wonder if I could use Xeon Phi with an i7 cpu ?
user1219721
  • 467
  • 1
  • 6
  • 15
3
votes
1 answer

Which is the fastest way to move 1Petabyte from one storage to a new one?

First of all, thanks for reading, and sorry for asking something related to my job. I understand that this is something that I should solve by myself but as you will see its something a bit difficult. A small description: Now Storage => 1PB using…
Marc Riera
  • 1,587
  • 4
  • 21
  • 38
1
2 3 4 5 6 7 8