Why Nvidia Pascal has both FP32 and FP64 cores? Why I can not use them simultaneously?

2

2

I am trying to understand Nvidia's GPU architecture but I am a bit stuck on something that appears to be quite simple. Each Streaming Multiprocessor in the Pascal consists of 64xFP32 and 32xFP64 cores. And here are my two questions:

  • Why did Nvidia put both FP32 and FP64 units in the chip? Why not just put FP64 units that are capable of performing 2xFP32 operations per instruction (like the SIMD instruction sets in CPUs).
  • Why I can't use all FP32 and FP64 units at the same time?

I guess both are hardware design decisions, but I would like to know more details about this topic. Any information on this is more than welcome!

EDIT1:

  • If it is possible to do FP32 and FP64 at the same time, does this mean that a GPU which has 8TFLOPS SP and 4TFLOPS DP can give you (theoretically) 12 TFLOPS mixed TFLOPS?
    • In case of CUDA, how is this achieved? Do I just use doubles and floats at the same time in my kernel? Or do I need to pass some kind of flag to NVCC?

AstrOne

Posted 2017-01-22T10:09:00.620

Reputation: 131

1FP64 units probably use a significant amount more real estate and (as a result) higher power draw than an FP32 core. By only using FP64 you would have a much worse power waste and fewer cores meaning less performance for plain FP32 tasks in a given chip size. In home consumer situations this would be an unacceptable power waste and performance penalty as most games would not need FP64. Putting more cores in one chip could compensate, but larger chips mean less chips per fabrication wafer and thus more expensive per chip. – Mokubai – 2017-01-22T10:30:12.373

1an FP64 that can merge, calculate and then re-split 2xFP32 instructions probably needs a lot of control logic, either in hardware (more wasted space) or software which would loose performance. – Mokubai – 2017-01-22T10:35:23.453

Very few, if any, consumer applications use FP64 functionality. Having full FP64 on consumer cards would drive up manufacturing cost and power consumption because a significant amount of die space is used for functionality which is useless to most consumers. It would also mean lower performance for games; today's high-end GPUs are often power- and heat-constrained and wasting power on unused functionality reduces the power available for useful work. – bwDraco – 2017-02-13T02:14:08.277

It's also partly for market segmentation reasons. Given that it's mostly a specialized market that needs GPU-accelerated FP64, limiting full FP64 performance to special cards designed for these markets would allow them to charge a much higher price for these cards. The higher price covers the additional (very expensive) validation and vendor certifications required for critical business applications and increases profit margins. See also: Why do workstation graphics cards cost far more than equivalent consumer graphics cards?

– bwDraco – 2017-02-13T02:26:28.283

Answers

2

Why did Nvidia put both FP32 and FP64 units in the chip?

I think its about market penetration, to sell as many as possible. Without FP64, scientific research guys can't even try a demo of scientifically important gpgpu software that uses FP64(and even games could be using some double precision on an occasion). Without FP32, game physics and simulations would be very slow or GPU would need a nuclear reactor. Without FP16, no fast neural network. If there were only FP32, a neural network simulation would work at half speed or some FP64 summation wouldn't work.

Who knows, maybe in future there will be FP_raytrace dedicated cores that do raytracing ultra fast so no more DX12 DX11 DX9 painful upgradings and better graphics.

Ultimately, I wouldn't say no for an FPGA based GPU that can convert some of cores from FP64 to FP32 or some special function cores for an application, then converting all to FP64 for another application and even converting everything to a single fat core that is doing sequential work(such as compiling shaders). This would benefit for people doing many different things on a computer. For example, I may need more multiplications than additions and FPGA could help here. But now, money talks and it says "fixed function for now" and best income is achieved with a mixture of FP64 and FP32 (and FP16 lately).

Why not just put FP64 units that are capable of performing 2xFP32 operations per instruction (like the SIMD instruction sets in CPUs).

SIMD expects always same operation for multiple data and less fun for scalar GPGPU kernels. ALso making 2xFP32 out of a FP64 would need more transistors than pure FP64, more heat, more latency maybe.

More transistors = more production failure probability so a 1024 FP32 GPU could be more probably produced than a 512 FP64_flexible GPU.

Why I can't use all FP32 and FP64 units at the same time?

Mixed precision computing can be done in cuda and opencl so you can get even faster using all cores but only applicable for non-memory-bottlenecked situations which is rare and hard to code.

Answer to edit 1:

here is a detailed source http://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Accelerating_GPU_Computation_Through_Mixed-Precision_Methods.pdf

long story short, they don't add, there is "diminishing returns" that somehow not letting %100 scaling on all cores because of needed "extra cycles" between different precision calculations. When they are not mixed, then they need "extra iterations" between blocks which also not letting %100 scaling. It seems it is more useful as speeding "FP64" up instead of "FP32" down(but having many FP64 cores should be beneficious(for upping FP32), you could test them with something like a nbody kernel (which is not memory bottlenecked)). FP64 is very much memory consuming(and cache lines(and local memory)) thats why nbody alorithm I suggested which re-uses some data for N(>64k for example) times. My GPU has 1/24 FP64 power so I don't trust my computer. You have a titan? You should try, maybe its having %50 more power than its advertisement GFLOPs value.(but advertisement TDP value could be limiting its frequency that way, and melts down)

This source: http://www.nvidia.com/content/nvision2008/tech_presentations/NVIDIA_Research_Summit/NVISION08-Mixed_Precision_Methods_on_GPUs.pdf

says "outstanding performance and accuracy" but I couldn't find a physics solver for games using FP32 + FP32(truncated FP64), maybe its money talks again, if someone makes this, it would be "outstanding performance and meltdown" on gaming.(maybe worse than furmark exploding gpus)

people even use integers (integer dot product) on top of floats here : https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/

In case of CUDA, how is this achieved? Do I just use doubles and floats at the same time in my kernel? Or do I need to pass some kind of flag to NVCC?

an example to iterative refinement using fp64+fp32 in same function:

https://www.sciencesmaths-paris.fr/upload/Contenu/HM2012/07-dongarra_part2.pdf

pages 26-28.


For the opencl part, here is amd evergreen(hd5000 series) capable of issuing 1dp fma + 1 sp(or 1 sf) every cycle.

http://www.microway.com/download/whitepaper/gpgpu_architecture_and_performance_comparison_2010.pdf

I'll test something like an nbody on my R7-240 which is 1/24 or 1/26 th power of fp32 as fp64 tomorrow.

Edit: its working.

__kernel void sumGPU(__global float * a,__global float * b)
{
  int idx = get_global_id(0);
  float a0=a[idx];
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  //a0=convert_float(convert_double(a0)+2.0);
  //a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  a0+=2.0f;
  b[idx] = a0;

}   

it switches between 13.02 ms and 12.85 ms when only one of the comments are disabled.


Note: cores are not fp32 themselves. There are no cores. There are schedulers binding hardware resources(fp32,fp64,special_function,registers) to kernel instructions of threads. Threads are also not real threads. So when you use fp32 then fp64 then fp32 then fp64_square_root, it will reserve necessary resources when they are needed. When not, they are options for other work items.(but a single work item can't use more than 1-2 fp32 ALUs I suspect(idk, I made this up))


Edit(2018/03): Is FP_raytrace(the second paragraph of this answer above) becoming reality?

(NVIDIA) https://www.geforce.com/whats-new/articles/nvidia-rtx-real-time-game-ray-tracing

(AMD) https://www.gamingonlinux.com/articles/amd-has-announced-radeon-rays-an-open-source-ray-tracing-sdk-using-vulkan.11461

Or is it another marketting gimmick? If it has hardware side, then raytracer people can work faster but it wouldn't be help for a moba gamer or ray-tracerless physics simulator. Why would I pay more for those ray tracers if I'm going to edit some videos? Maybe these can get segmented as others too, but for more money probably.

huseyin tugrul buyukisik

Posted 2017-01-22T10:09:00.620

Reputation: 166

Hello my friend. Thank you so much for your response. I am going to add one more question on my post regarding mixed precision computing. I wasn't aware that mixed precision is possible. I am going to accept your answer by the end of day today regardless if you answer the new question or not. Of course I would be grateful if you would! :) Thanks again. – AstrOne – 2017-02-13T00:30:03.533

added a source, now looking for opencl part – huseyin tugrul buyukisik – 2017-02-13T01:01:30.133

just added my working opencl example on my poor 1/16 or 1/24 FP64. conversion from float to double must be hidden by some more fp32. a titan would work with 3-4 lines of mixed code then – huseyin tugrul buyukisik – 2017-02-13T10:11:09.563

also made gpu 1°C hotter – huseyin tugrul buyukisik – 2017-02-13T10:20:37.393

@huseyintugrulbuyukisik: yes, hardware-accelerated ray-tracing is real. The new GeForce RTX 2000 series (Turing architecture) has dedicated ray-tracing cores. – bwDraco – 2019-01-01T20:48:40.107