Estimate FLOPS in Linux?

Question

I am looking for a quick and easy program to estimate FLOPS on my Linux system. I found HPL, but getting it compiled is proving to be irritating. All I need is a ballpark estimate of the FLOPS, without needing to spend a day researching benchmark packages and installing dependent software. Does any such program exist? Would it be sufficient to write a C program that multiples two floats in a loop?

rogerdpack · Answer 1 · 2016-06-13T17:31:22.523

8

apparently there's a "sysbench" benchmark package and command:

sudo apt-get install sysbench (or brew install sysbench OS X)

run it like this:

sysbench --test=cpu --cpu-max-prime=20000 --num-threads=2 run

output for comparisons:

 total time:                          15.3047s

ref: http://www.midwesternmac.com/blogs/jeff-geerling/2013-vps-benchmarks-linode

edited Jun 13 '16 at 17:31

answered Mar 14 '14 at 16:25

rogerdpack

575
2
8
22

5

How does this give the FLOPS? – Martin Thoma Dec 21 '16 at 11:00
Looks like it's more of a generic "cpu benchmark" see also http://www.bnikolic.co.uk/blog/hpc-howto-measure-flops.html – rogerdpack Aug 20 '18 at 13:47

Ronald Pottol · Accepted Answer · 2009-11-26T01:17:29.123

The question is what do you mean by flops? If all you care about is how many of the simplest floating point operations per clock, it is probably 3x your clock speed, but that is about as meaningless as bogomips. Some floating point ops take a long time (divide, for starters), add and multiply are typically quick (one per fp unit per clock). The next issue is memory performance, there is a reason the last classic CRAY had 31 memory banks, ultimately CPU performance is limited by how fast you can read and write to memory, so what level of caching does your problem fit in? Linpack was a real benchmark once, now it fits in cache (L2 if not L1) and is more of a pure theoretical CPU benchmark. And of course, your SSE (etc) units can add floating point performance too.

What distro do you run?

This looked like a good pointer: http://linuxtoolkit.blogspot.com/2009/04/intel-optimized-linpack-benchmark-for.html

http://onemansjourneyintolinux.blogspot.com/2008/12/show-us-yer-flops.html

http://www.phoronix-test-suite.com/ might be an easier way to install a flops benchmark.

Still I do wonder why you care, what you are using it for? If you just want a meaningless number, your systems bogomips is still right there in dmesg.

Phoronix seems to be exactly what I was looking for - thank you! The only reason I wanted this was because I was filling out a survey that asked how many teraflops of computing power I have. The survey wasn't terribly important, so I wasn't concerned about the accuracy of the answer. Still, it would be kind of neat to be able to say, "Our cluster can do X teraflops." Though as you point out, that number doesn't necessarily have much real-world meaning. — molecularbear, Nov 26 '09 at 02:06

Martin Thoma · Answer 3 · 2016-12-21T12:04:15.837

For ballpark-estimates:

Raspberry Pi 2: 299.93 * 10^6 FLOPS (source)
Raspberry Pi 3: 462.07 * 10^6 FLOPS (source)
GTX Titan Black GPU: 5.1 * 10^12 FLOPS (source)
Sunway TaihuLight: 93 * 10^15 FLOPS (source, record holder of 2016)

Linpack

Download it (link)
Extract it
cd benchmarks_2017/linux/mkl/benchmarks/linpack
./runme_xeon64
Wait for quite a while (more than 1 hour)

On a Thinkpad T460p (Intel i7-6700HQ CPU), it gives:

This is a SAMPLE run script for SMP LINPACK. Change it to reflect
the correct number of CPUs/threads, problem input files, etc..
./runme_xeon64: 33: [: -gt: unexpected operator
Mi 21. Dez 11:50:29 CET 2016
Intel(R) Optimized LINPACK Benchmark data

Current date/time: Wed Dec 21 11:50:29 2016

CPU frequency:    3.491 GHz
Number of CPUs: 1
Number of cores: 4
Number of threads: 4

Parameters are set to:

Number of tests: 15
Number of equations to solve (problem size) : 1000  2000  5000  10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array                  : 1000  2000  5008  10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run                     : 4     2     2     2     2     2     2     2     2     2     1     1     1     1     1    
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     4     4     4     1     1     1     1    

Maximum memory requested that can be used=9800701024, at the size=35000

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
1000   1000   4      0.014      46.5838  1.165068e-12 3.973181e-02   pass
1000   1000   4      0.010      64.7319  1.165068e-12 3.973181e-02   pass
1000   1000   4      0.009      77.3583  1.165068e-12 3.973181e-02   pass
1000   1000   4      0.010      67.0096  1.165068e-12 3.973181e-02   pass
2000   2000   4      0.064      83.6177  5.001027e-12 4.350281e-02   pass
2000   2000   4      0.063      84.5568  5.001027e-12 4.350281e-02   pass
5000   5008   4      0.709      117.6800 2.474679e-11 3.450740e-02   pass
5000   5008   4      0.699      119.2350 2.474679e-11 3.450740e-02   pass
10000  10000  4      4.895      136.2439 9.069137e-11 3.197870e-02   pass
10000  10000  4      4.904      135.9888 9.069137e-11 3.197870e-02   pass
15000  15000  4      17.260     130.3870 2.052533e-10 3.232773e-02   pass
15000  15000  4      18.159     123.9303 2.052533e-10 3.232773e-02   pass
18000  18008  4      31.091     125.0738 2.611497e-10 2.859910e-02   pass
18000  18008  4      31.869     122.0215 2.611497e-10 2.859910e-02   pass
20000  20016  4      44.877     118.8622 3.442628e-10 3.047480e-02   pass
20000  20016  4      44.646     119.4762 3.442628e-10 3.047480e-02   pass
22000  22008  4      57.918     122.5811 4.714135e-10 3.452918e-02   pass
22000  22008  4      57.171     124.1816 4.714135e-10 3.452918e-02   pass
25000  25000  4      86.259     120.7747 5.797896e-10 3.297056e-02   pass
25000  25000  4      83.721     124.4356 5.797896e-10 3.297056e-02   pass
26000  26000  4      97.420     120.2906 5.615238e-10 2.952660e-02   pass
26000  26000  4      96.061     121.9924 5.615238e-10 2.952660e-02   pass
27000  27000  4      109.479    119.8722 5.956148e-10 2.904520e-02   pass
30000  30000  1      315.697    57.0225  8.015488e-10 3.159714e-02   pass
35000  35000  1      2421.281   11.8061  1.161127e-09 3.370575e-02   pass

Performance Summary (GFlops)

Size   LDA    Align.  Average  Maximal
1000   1000   4       63.9209  77.3583 
2000   2000   4       84.0872  84.5568 
5000   5008   4       118.4575 119.2350
10000  10000  4       136.1164 136.2439
15000  15000  4       127.1586 130.3870
18000  18008  4       123.5477 125.0738
20000  20016  4       119.1692 119.4762
22000  22008  4       123.3813 124.1816
25000  25000  4       122.6052 124.4356
26000  26000  4       121.1415 121.9924
27000  27000  4       119.8722 119.8722
30000  30000  1       57.0225  57.0225 
35000  35000  1       11.8061  11.8061 

Residual checks PASSED

End of tests

Done: Mi 21. Dez 12:58:23 CET 2016

What a nice, convenient tool! I wonder if there's an equivalent test for AMD CPUs so I could compare our two servers (one AMD, one Intel). — Waldir Leoncio, Sep 14 '22 at 17:07

score 1 · Answer 4 · answered Oct 26 '10 at 15:34

1

I highly recommend the ready-to-run linpack build from Intel: http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download/

answered Oct 26 '10 at 15:34

bugaboo

109
2

score 1 · Answer 5 · answered Oct 26 '10 at 16:45

As you mention cluster, we have used the the HPCC suite. It takes a bit of effort to setup and tune, but in our case the point wasn't bragging per se, it was part of the acceptance criteria for the cluster; some performance benchmarking is IMHO vital to ensure that the hardware works as advertised, everything is cabled together correctly etc.

Now if you just want a theoretical peak FLOPS number, that one is easy. Just check out some article about the CPU (say, on realworldtech.com or somesuch) to get info on how many DP FLOPS a CPU core can do per clock cycle (with current x86 CPU's that's typically 4). Then the total peak FLOPS is just

number of cores * FLOPS/cycle * frequency

Then for a cluster with IB network you should be able to hit around 80% of the peak FLOPS on HPL (which BTW is one of the benchmarks in HPCC).

score 1 · Answer 6 · answered Nov 25 '09 at 22:00

1

One benchmark that has been traditionally used to measure FLOPS is Linpack. Another common FLOPS benchmark is Whetstone.

More reading: The Wikipedia "FLOPS" entry, Whetstone entry, Linpack entry

answered Nov 25 '09 at 22:00

kolypto

10,738
12
51
66

2

I appreciate your answer, however my goal is to obtain a quick n' dirty estimate of flops. Whetstone and Linpack have the same problem as HPL - I start reading about it, then get lost in site after site that all look 20 years old. When I do manage to find source code, I can't seem to compile it without installing a bunch of dependent libraries - even then I run into errors. I could get all this stuff working, but it's not important enough to spend the time. Hopefully there exists some relatively modern software that Just Works for ballparking flops. – molecularbear Nov 25 '09 at 22:32
1

Estimate? Then it's about 4*Hz: for 1GHz CPU it's about 4GFLOPS :)) – kolypto Nov 26 '09 at 01:43

Estimate FLOPS in Linux?

6 Answers6

Linpack