Implement multiplication by a constant with addition and bit shifts

6

While looking at the ARM instruction set, you notice that the ADD instruction has the so-called "Flexible second operand", which can be abused for quick multiplication. For example, the following instruction multiplies register r1 by 17 (shifting it left by 4 bits and adding to itself):

ADD r1, r1, r1, LSL #4
; LSL means "logical shift left"

The subtraction instruction has that feature too. So multiplication by 15 is also easy to implement:

RSB r1, r1, r1, LSL #4
; RSB means "reverse subtract", i.e. subtract r1 from (r1 << 4)

So this is an excellent optimization opportunity! Certainly, when you write in a C program x *= 15, the compiler will translate that to some inefficient code, while you could replace it by x = (x << 4) - x, and then the compiler will generate better code!

Quickly, you come up with the following cunning plan:

  1. Write a program or a subroutine, in any language, that receives one parameter m, and outputs a C optimizing macro, like the following (for m = 15):

    #define XYZZY15(x) (((x) << 4) - (x))
    

    or (which is the same)

    #define XYZZY15(x) (-(x) + ((x) << 4))
    
  2. ???

  3. Profit!

Notes:

  • The macro's name is essential: magic requires it.
  • Parentheses around x and the whole expression are required: ancient C arcana cannot be taken lightly.
  • You can assume m is between 2 and 65535.
  • The input and output can be passed through function parameters, stdin/stdout or in any other customary way.
  • In C, addition and subtraction have tighter precedence than bit shift <<; use parentheses to clarify order of calculations. More parentheses than necessary is OK.
  • It is absolutely vital that the C expression be optimal, so e.g. x+x+x is not acceptable, because (x<<1)+x is better.
  • Expressions x + x and x << 1 are equivalent, because both require one ARM instruction to calculate. Also, (x << 1) + (x << 2) (multiplication by 6) requires two instructions, and cannot be improved.
  • You are amazed to discover that the compiler detects that (x << 0) is the same as x, so x + (x << 0) is an acceptable expression. However, there is no chance the compiler does any other optimizations on your expression.
  • Even though multiplication by 17 can be implemented in one instruction, multiplication by 17*17 cannot be implemented in two instructions, because there is no C expression to reflect that:

    #define XYZZY289(x) x + (x << 4) + (x + (x << 4) << 4) // 3 operations; optimal
    

    In other words, the compiler doesn't do common sub-expression elimination. Maybe you will fix that bug in version 2...

This is code-golf: the shortest code wins (but it must be correct, i.e. produce correct and optimal expressions for all m)

Please also give example outputs for m = 2, 10, 100, 14043 and 65535 (there is no requirement on the length of the expression - it can contain extra parentheses, whitespace and shift left by 0 bits, if that makes your code simpler).

anatolyg

Posted 2015-03-23T20:05:08.177

Reputation: 10 719

For some 2 <= m <= 65535, there will be a worst case value where the number of bitshifts is a maximum. Is it guaranteed that even for this worst case that it will be more efficient to do this many shifts instead of one multiplication? – Digital Trauma – 2015-03-23T21:00:10.803

1Define “optimal.” I count two operations in both x+x+x and (x<<1)+x. – FUZxxl – 2015-03-23T21:12:44.603

1@FUZxxl: ARM can do as a single instruction x+(y<<k), x-(y<<k), and (y<<k)-x (for registers x,y and any constant k). So there is only one operation in your second example. – Keith Randall – 2015-03-23T21:53:01.983

1It's not currently clear to me whether output which uses sub-macros is required or not. Please add a worked example for something like m = 24 where that would make a difference. – Peter Taylor – 2015-03-23T22:25:14.220

1Does the compiler do common subexpression elimination? If I have ((x<<8)+1)+(((x<<8)+1)<<5), is that 3 ops, or 2? – Keith Randall – 2015-03-24T03:28:19.377

@PeterTaylor, I think 24 is just (x+(x<<1))<<3 – gnibbler – 2015-03-24T04:36:52.083

@KeithRandall, that's not a multiplication. Did you mean ((x<<8)+x)+(((x<<8)+x)<<5)? That can be done in 2 steps ADD r1, r1, r1, LSL #8 and ADD r1, r1, r1, LSL #5 – gnibbler – 2015-03-24T04:46:11.500

@DigitalTrauma This is just a code-golf challenge; the explanation on performance optimization is just for fun. Compilers do this type of optimization since long ago (but don't tell anyone - it's secret!). Suppose the text starts like You tear a leaf from the calendar - today is 8.9.89... – anatolyg – 2015-03-24T07:12:17.797

@PeterTaylor Sub-macros are not required and not allowed. I added an example for 289. – anatolyg – 2015-03-24T07:13:35.960

@gnibbler, yes, my example should have been 27. Essentially it's the same question as Keith's. – Peter Taylor – 2015-03-24T09:17:47.600

I don't think it's right to not optimise 17*17 etc. Why can't you write it in C as x = (x << 4) + x; x = (x << 4) + x – gnibbler – 2015-03-25T01:51:37.343

Can you provide a sample for m = 11? As I understand the question, there is no possible XYZZY11 macro. I assume I'm missing something. – Matt Noonan – 2015-03-25T03:18:32.107

@gnibbler I guess your suggestion will require some sort of BFS. The rules as stated allow both BFS and bit-fiddling - more flexibility in algorithm choice is better. I agree that utilizing common sub-expressions is "more optimized", but let's leave that for version 2.0 (if at all). – anatolyg – 2015-03-25T08:13:46.543

@MattNoonan #define XYZZY11(x) x + (x<<1) + (x<<3) or #define XYZZY11(x) -x + (x<<2) + (x<<3) or #define XYZZY11(x) -x + ((x + (x<<1)) << 2) (but one should add the required additional parentheses) – anatolyg – 2015-03-25T08:15:57.587

@anatolyg Ah, so we can write expressions which would utilize more than one register? Great! – Matt Noonan – 2015-03-25T12:29:53.270

Answers

2

Python - 414 393 331

This is my entry, which finds the least possible addition/subtraction operations, and subsequently evaluates to a C expression. Not sure it is optimal as multiplication of subexpressions may offer less operations.

import itertools as l
def w(N,s=''):
 for k,v in enumerate(min([t for t in l.product((1,0,-1),repeat=len(bin(N))-1)if sum([v*2**k for k,v in enumerate(t)])==N],key=lambda t:sum(abs(k)for k in t))):
  if v!=0:s+='{:+d}'.format(v)[0];s+='((x)<<{})'.format(k)if k>0 else'(x)'
 return '#define XYZZY{}(x) ({})'.format(N,s.lstrip('+'))

For the cases 2, 10, 100, 14043, 65535:

#define XYZZY2 (((x)<<1))
#define XYZZY10 (((x)<<1)+((x)<<3))
#define XYZZY100 (((x)<<2)+((x)<<5)+((x)<<6))
#define XYZZY14043 (-(x)-((x)<<2)-((x)<<5)-((x)<<8)-((x)<<11)+((x)<<14))
#define XYZZY65535(x) (-(x)+((x)<<16))

The final 65535 took around 11 minutes to compute. Suggestions for improvement (also speedwise) are welcome :-). Thanks to mbomb007 and plg for golfing improvements.

Willem

Posted 2015-03-23T20:05:08.177

Reputation: 1 528

1Change your 2nd line to this: a=lambda t:sum(abs(k)for k in t) to save some bytes. Omit any spaces you can, such as every space before an IF, except where the previous non-whitespace character is a letter. e.g. '+' if -> '+'if and you can do s[0] != '+' else -> s[0]!='+'else – mbomb007 – 2015-04-04T21:32:05.573

1This IS code-golf after all. – mbomb007 – 2015-04-04T21:33:47.617

The return can be written: return '#define XYZZY{}(x) ({})'.format(N,s.lstrip('+')). And abs(v) > 0 is equivalent to v != 0. Calling sorted on g's current definition can also save a char (also you should make g a generator -- replace the angle brackets with parentheses). Also you can define e=enumerate, and inline a into sorted. – ThinkChaos – 2015-04-05T08:50:40.443

Also you can apply what @mbomb007 said to [...]0 else: [...]0else is fine. – ThinkChaos – 2015-04-05T08:51:36.397

r is unused. L and g can be inlined. Your now inlined g can be written in terms of min instead of sorted()[0]. As r is gone, s=''\n can be added as a param: def w(N,s=''):. After all these changes I'm down to 342, and I'm sure there are other things to do. – ThinkChaos – 2015-04-05T08:59:37.030

@plg 0else only works with Python 3. – mbomb007 – 2015-04-06T16:18:20.430

0

Python 508

Slightly more characters than my other answer, but this code runs in roughly one second for all cases 2, 10, 100, 14043, 65535.

def l(n,k=0,o=0):
 while 1:
  if k==0:
   for j in range(1,n):
    yield(j,);yield(-j,)               
   k+=1
  else:
   for j in range(1,n):
    for u in l(j,k-1,1):
     yield(j,)+u;yield(-j,)+u 
   k+=1
  if o:break 
r=lambda t:sum((1,-1)[k<0]*2**(abs(k)-1)for k in t)
def w(N,s=''):
 g = loop(len(bin(N))) 
 while 1:
  a = g.next()
  if r(a)==N:
   for k in a:
    s+='{:+d}'.format(k)[0];s+='((x)<<{})'.format(abs(k-1)) if abs(k)>1 else'(x)'
   return '#define XYZZY{}(x) ({})'.format(N,s.lstrip('+'))""")

Willem

Posted 2015-03-23T20:05:08.177

Reputation: 1 528