Implement an IEEE 754 64-bit binary floating point number through integer manipulation



(I've tagged the question "C" for the time being, but if you're aware of another language that supports unions, you can also use that.)

Your task is to build the four standard mathematical operators + - * / for the following struct:

union intfloat{
    double f;
    uint8_t h[8];
    uint16_t i[4];
    uint32_t j[2]; 
    uint64_t k;
    intfloat(double g){f = g;}
    intfloat(){k = 0;}

such that the operations themselves only ever manipulate or access the integer part (so no comparing with the double anytime during the operation either), and the result is exactly the same (or functionally equivalent, in the case of non-numeric results such as NaN) as if the corresponding mathematical operation had been applied straight to the double instead.

You may choose which integer part to manipulate, perhaps even using different ones among different operators. (You can also choose to remove the "unsigned" from any of the fields in the union, although I'm not sure whether you'd want to do that.)

Your score is the sum of the length of code in characters for each of the four operators. Lowest score wins.

For those of us unfamiliar with the IEEE 754 specification, here is an article about it on Wikipedia.


03-06 08:47 Added constructors to the intfloat struct. You are allowed to use them for testing, rather than manually setting the double / etc.

Joe Z.

Posted 2013-03-01T02:15:57.073

Reputation: 30 589

1Yikes! Tell me you have a solution. – dmckee --- ex-moderator kitten – 2013-03-01T04:13:30.133

I'll come up with one eventually. It'll be ungolfed and act as a last-place example entry. – Joe Z. – 2013-03-01T04:22:41.753

4Hmmm...perhaps it would be better to define intstruct in terms of uint8_8, uint16_t and so on as the absolute sizes of short, int and so on are not defined by the standard (each type has a minimum size and there is a strict ordering in size, but that's it). – dmckee --- ex-moderator kitten – 2013-03-01T04:26:53.397

1I guess that this a great (and challenging) practice even if ungolfed – John Dvorak – 2013-03-01T06:49:13.687

3This question could use documentation of how rounding is handled and a good test suite. – Peter Taylor – 2013-03-01T07:59:12.083

Do we need full IEEE-754 compliance, e.g. proper handling of INF, NAN, denormals, etc ? – Paul R – 2013-03-01T10:07:05.807

@dmckee: Noted. I'll change them to their fixed bit sizes. – Joe Z. – 2013-03-01T12:02:04.637

@PeterTaylor Rounding documentation is probably in IEEE 754 (I believe most computers will round half to even). As for a test suite, that's a good question. I'm not sure how I'd go about writing that. – Joe Z. – 2013-03-01T13:26:18.507

@PaulR Yes, if the C/C++ compiler will do the same. – Joe Z. – 2013-03-01T13:29:19.800

4I'm sure it is in the spec, but the real spec is going to cost a few hundred USD. There are probably descriptions which are available for free, but IMO the onus should be on the question-asker to include that kind of details (or at least a link to a site which is likely to still be around in a couple of years) within the question rather than on the answerers to go searching themselves for the necessary materials to know what the question actually wants. – Peter Taylor – 2013-03-01T13:48:09.300

Understood. In that case, you'll have to wait a while because I need to go looking for a spec. I've been relying on the Wikipedia article. – Joe Z. – 2013-03-01T13:52:14.027

I'm currently creating the reference implementation. It's harder than it looks :S – Joe Z. – 2013-03-05T21:59:54.327



C++, ~1500 chars

Expands the floats into an 8000-binary-digit fixed-point representation, does the operations on that, then converts back.

// an "expanded" float.                                                                                                         
// n is nan, i is infinity, z is zero, s is sign.                                                                               
// nan overrides inf, inf overrides zero, zero overrides digits.                                                                
// sign is valid unless nan.                                                                                                    
// We store the number in fixed-point, little-endian.  Binary point is                                                          
// at N/2.  So 1.0 is N/2 zeros, one, N/2-1 zeros.                                                                              
#define N 8000
struct E {
  int n;
  int i;
  int z;
  long s;
  long d[N];
#define V if(r.n|r.i|r.z)return r
// Converts a regular floating-point number to an expanded one.                                                                 
E R(F x){
  long i,e=x.k<<1>>53,m=x.k<<12>>12;
  E r={e==2047&&m!=0,e==2047,e+m==0,x.k>>63};
  if(e)m+=1L<<52;else e++;
  return r;
E A(E x,E y){
  int i,c,v;
  if(x.s>y.s)return A(y,x);
  E r={x.n|y.n|x.i&y.i&(x.s^y.s),x.i|y.i,x.z&y.z,x.i&x.s|y.i&y.s|~x.i&~y.i&x.s&y.s};V;
  return r;
E M(E x, E y){
  int i;
  E r={x.n|y.n|x.i&y.z|x.z&y.i,x.i|y.i,x.z|y.z,x.s^y.s};V;
  E s={0,0,1};
  return s;
// 1/d using Newton-Raphson:                                                                                                    
// x <- x * (2 - d*x)                                                                                                           
E I(E d){
  int i;
  E r={d.n,d.z,d.i,d.s};V;
  E t={};t.d[4001]=1;
  E x={0,0,0,d.s};x.d[N-i]=1;
  return x;
// Convert expanded number back to regular floating point.                                                                      
F C(E x){
  long i,j,e,m=0;
  if(i>0&x.d[i-53]&(j<i-53|x.d[i-52])){E y={0,0,0,x.s};y.d[i-53]=1;return C(A(x,y));}
  else if(i<5024){e=i-2977;for(j=0;j<52;j++)m+=x.d[i+j-52]<<j;}
  else x.i=1;
  F y;y.k=x.s<<63|e<<52|m;return y;
// expand, do op, unexpand                                                                                                      
F A(F x,F y){return C(A(R(x),R(y)));}
F S(F x,F y){y.k^=1L<<63;return A(x,y);}
F M(F x,F y){return C(M(R(x),R(y)));}
F D(F x,F y){return C(M(R(x),I(R(y))));}

I'm too lazy to remove all the spaces and newlines to get an exact golfed count, but it is about 1500 characters.

Keith Randall

Posted 2013-03-01T02:15:57.073

Reputation: 19 865