x86-16 machine code (BubbleSort int8_t), 20 19 bytes
x86-64/32 machine code (JumpDownSort) 21 19 bytes
Changelog:
Thanks to @ped7g for the lodsb
/cmp [si],al
idea, and putting that together with a pointer increment / reset that I'd been looking at. Not needing al
/ah
lets us use nearly the same code for larger integers.
New (but related) algorithm, many implementation changes: Bubbly SelectionSort allows a smaller x86-64 implementation for bytes or dwords; break-even on x86-16 (bytes or words). Also avoids the bug on size=1 that my BubbleSort has. See below.
It turns out that my Bubbly Selection Sort with swaps every time you find a new min is already a known algorithm, JumpDown Sort. It's mentioned in Bubble Sort: An Archaeological Algorithmic Analysis (i.e. how did Bubble Sort become popular despite sucking).
Sorts 8-bit signed integers in-place. (Unsigned is the same code size, just change the jge
to a jae
). Duplicates are not a problem. We swap using a 16-bit rotate by 8 (with a memory destination).
Bubble Sort sucks for performance, but I've read that it's one of the smallest to implement in machine code. This seems especially true when there are special tricks for swapping adjacent elements. This is pretty much its only advantage, but sometimes (in real life embedded systems) that's enough advantage to use it for very short lists.
I omitted the early termination on no swaps. I used Wikipedia's "optimized" BubbleSort loop which avoids looking at the last n − 1
items when running for the n
-th time, so the outer loop counter is the upper bound for the inner loop.
NASM listing (nasm -l /dev/stdout
), or plain source
2 address 16-bit bubblesort16_v2:
3 machine ;; inputs: pointer in ds:si, size in in cx
4 code ;; requires: DF=0 (cld)
5 bytes ;; clobbers: al, cx=0
6
7 00000000 49 dec cx ; cx = max valid index. (Inner loop stops 1 before cx, because it loads i and i+1).
8 .outer: ; do{
9 00000001 51 push cx ; cx = inner loop counter = i=max_unsorted_idx
10 .inner: ; do{
11 00000002 AC lodsb ; al = *p++
12 00000003 3804 cmp [si],al ; compare with *p (new one)
13 00000005 7D04 jge .noswap
14 00000007 C144FF08 rol word [si-1], 8 ; swap
15 .noswap:
16 0000000B E2F5 loop .inner ; } while(i < size);
17 0000000D 59 pop cx ; cx = outer loop counter
18 0000000E 29CE sub si,cx ; reset pointer to start of array
19 00000010 E2EF loop .outer ; } while(--size);
20 00000012 C3 ret
22 00000013 size = 0x13 = 19 bytes.
push/pop of cx
around the inner loop means it runs with cx
= outer_cx down to 0.
Note that rol r/m16, imm8
is not an 8086 instruction, it was added later (186 or 286), but this isn't trying to be 8086 code, just 16-bit x86. If SSE4.1 phminposuw
would help, I'd use it.
A 32-bit version of this (still operating on 8-bit integers but with 32-bit pointers / counters) is 20 bytes (operand-size prefix on rol word [esi-1], 8
)
Bug: size=1 is treated as size=65536, because nothing stops us from entering the outer do/while with cx=0. (You'd normally use jcxz
for that.) But fortunately the 19-byte JumpDown Sort is 19 bytes and doesn't have that problem.
Original x86-16 20 byte version (without Ped7g's idea). Omitted to save space, see the edit history for it with a description.
Performance
Partially-overlapping store/reload (in memory-destination rotate) causes
a store-forwarding stall on modern x86 CPUs (except in-order Atom). When a high value is bubbling upwards, this extra latency is part of a loop-carried dependency chain. Store/reload sucks in the first place (like 5 cycle store-forwarding latency on Haswell), but a forwarding stall brings it up to more like 13 cycles. Out-of-order execution will have trouble hiding this.
See also: Stack Overflow: bubble sort for sorting string for a version of this with a similar implementation, but with an early-out when no swaps are needed. It uses xchg al, ah
/ mov [si], ax
for swapping, which is 1 byte longer and causes a partial-register stall on some CPUs. (But it may still be better than memory-dst rotate, which needs to load the value again). My comment there has some suggestions...
x86-64 / x86-32 JumpDown Sort, 19 bytes (sorts int32_t)
Callable from C using the x86-64 System V calling convention as
int bubblyselectionsort_int32(int dummy, int *array, int dummy, unsigned long size);
(return value = max(array[])).
This is https://en.wikipedia.org/wiki/Selection_sort, but instead of remembering the position of the min element, swap the current candidate into the array. Once you've found the min(unsorted_region), store it to the end of the sorted region, like normal Selection Sort. This grows the sorted region by one. (In the code, rsi
points to one past the end of the sorted region; lodsd
advances it and mov [rsi-4], eax
stores the min back into it.)
The name Jump Down Sort is used in Bubble Sort: An Archaeological Algorithmic Analysis. I guess my sort is really a Jump Up sort, because high elements jump upwards, leaving the bottom sorted, not the end.
This exchange design leads to the unsorted part of the array ending up in mostly reverse-sorted order, leading to lots of swaps later on. (Because you start with a large candidate, and keep seeing lower and lower candidates, so you keep swapping.) I called it "bubbly" even though it moves elements the other direction. The way it moves elements is also a little bit like a backwards insertion-sort. To watch it in action, use GDB's display (int[12])buf
, set a breakpoint on the inner loop
instruction, and use c
(continue). Press return to repeat. (The "display" command gets GDB to print the whole array state every time we hit the breakpoint).
xchg
with mem has an implicit lock
prefix which makes this extra slow. Probably about an order of magnitude slower than an efficient load/store swap; xchg m,r
is one per 23c throughput on Skylake, but load / store / mov with a tmp reg for an efficient swap(reg, mem) can shift one element per clock. It might be a worse ratio on an AMD CPU where the loop
instruction is fast and wouldn't bottleneck the inner loop as much, but branch misses will still be a big bottleneck because swaps are common (and become more common as the unsorted region gets smaller).
2 Address ;; hybrib Bubble Selection sort
3 machine bubblyselectionsort_int32: ;; working, 19 bytes. Same size for int32 or int8
4 code ;; input: pointer in rsi, count in rcx
5 bytes ;; returns: eax = max
6
7 ;dec ecx ; we avoid this by doing edi=esi *before* lodsb, so we do redundant compares
8 ; This lets us (re)enter the inner loop even for 1 element remaining.
9 .outer:
10 ; rsi pointing at the element that will receive min([rsi]..[rsi+rcx])
11 00000000 56 push rsi
12 00000001 5F pop rdi
13 ;mov edi, esi ; rdi = min-search pointer
14 00000002 AD lodsd
16 00000003 51 push rcx ; rcx = inner counter
17 .inner: ; do {
18 ; rdi points at next element to check
19 ; eax = candidate min
20 00000004 AF scasd ; cmp eax, [rdi++]
21 00000005 7E03 jle .notmin
22 00000007 8747FC xchg [rdi-4], eax ; exchange with new min.
23 .notmin:
24 0000000A E2F8 loop .inner ; } while(--inner);
26 ; swap min-position with sorted position
27 ; eax = min. If it's not [rsi-4], then [rsi-4] was exchanged into the array somewhere
28 0000000C 8946FC mov [rsi-4], eax
29 0000000F 59 pop rcx ; rcx = outer loop counter = unsorted elements left
30 00000010 E2EE loop .outer ; } while(--unsorted);
32 00000012 C3 ret
34 00000013 13 .size: db $ - bubblyselectionsort_int32
0x13 = 19 bytes long
Same code size for int8_t
: use lodsb
/ scasb
, AL
, and change the [rsi/rdi-4]
to -1
. The same machine-code works in 32-bit mode for 8/32-bit elements. 16-bit mode for 8/16-bit elements needs to be re-built with the offsets changed (and 16-bit addressing modes use a different encoding). But still 19 bytes for all.
It avoids the initial dec ecx
by comparing with the element it just loaded before moving on. On the last iteration of the outer loop, it loads the last element, checks if it's less than itself, then is done. This allows it to work with size=1, where my BubbleSort fails (treats it as size = 65536).
I tested this version (in GDB) using the this caller: Try it online!. You can run this it on TIO, but of course no debugger or printing. Still, the _start
that calls it exits with exit-status = largest element = 99, so you can see it works.
8I'm very surprised if this isn't a duplicate, but I don't have the time to check. Anyway, "built-in sorting functions" should be better defined. Can you use a function that indexes all values?
[7 2 4 1] -> [4 2 3 1]
. Also, can the CSV list be inside brackets? Also, the specific input format is very suitable for some languages, and bad for others. This makes input parsing a big part for some submissions, and unnecessary for others. – Stewie Griffin – 2016-04-15T12:52:06.5731@StewieGriffin I've seen many sorting challenges, but none dealing with sorting just a basic integer list. There are many challenges that are easier for some languages, and much more difficult in others. – Michelfrancis Bustillos – 2016-04-15T13:00:35.593
This is very similar, but has a O(Nlog(N)) restriction. – Nathan Merrill – 2016-04-15T14:06:21.807
2
Very closely related to this question, but since some answers here (e.g. Dennis' range filtering) require the input to be integers I won't vote to close as dupe.
– Peter Taylor – 2016-04-15T20:55:09.6501
Relevant: https://www.youtube.com/user/AlgoRythmics/videos — An Youtube channel which teaches sorting algorithms through Hungarian dances!
– sergiol – 2017-04-30T12:05:50.983