Contest: fastest way to sort a big array of Gaussian-distributed data

71

11

Following the interest in this question, I thought it would be interesting to make answers a bit more objective and quantitative by proposing a contest.

The idea is simple: I have generated a binary file containing 50 million gaussian-distributed doubles (average: 0, stdev 1). The goal is to make a program that will sort these in memory as fast as possible. A very simple reference implementation in python takes 1m4s to complete. How low can we go?

The rules are as follows: answer with a program that opens the file "gaussian.dat" and sorts the numbers in memory (no need to output them), and instructions for building and running the program. The program must be able to work on my Arch Linux machine (meaning you can use any programming language or library that is easily installable on this system).

The program must be reasonably readable, so that I can make sure it is safe to launch (no assembler-only solution, please!).

I will run the answers on my machine (quad core, 4 Gigabytes of RAM). The fastest solution will get the accepted answer and a 100 points bounty :)

The program used to generate the numbers:

#!/usr/bin/env python
import random
from array import array
from sys import argv
count=int(argv[1])
a=array('d',(random.gauss(0,1) for x in xrange(count)))
f=open("gaussian.dat","wb")
a.tofile(f)

The simple reference implementation:

#!/usr/bin/env python
from array import array
from sys import argv
count=int(argv[1])
a=array('d')
a.fromfile(open("gaussian.dat"),count)
print "sorting..."
b=sorted(a)

EDIT: only 4 GB of RAM, sorry

EDIT#2: Note that the point of the contest is to see if we can use prior information about the data. it is not supposed to be a pissing match between different programming language implementations!

static_rtti

Posted 2011-06-07T12:58:58.813

Reputation: 713

I'm a bit lost concerning your implementation. I saved the first snippet as generator.py and did python generator.py 1000000. However, as I saved the second snippet as sort.py and entered in the console python sort.py 1000000, I got a EOFError. I know nearly nothing in Python, so if you could help... Edit : The gaussian.dat file is indeed at the right place, but the serialization made it look like garbage in my editor. Normal behaviour with python's tofile()? – None – 2011-06-07T14:14:52.100

@Fezvez: it's supposed to look as garbage since it's stored as raw binary data. However what you did is supposed to work... What version of python did you use? I used python 2.4, which is rather old. Also, how big is the resulting file? – None – 2011-06-07T14:29:24.110

I used python 2.7, and the file size was 8M. If anyone cared to check with this version of python, it'll help. Anyway, that's not the point, I'll generate the data myself, and sort it myself too. It's just that I wondered why I couldn't make the baseline work for me. – None – 2011-06-07T14:42:36.963

@Fezvez: I'll try tonight on a newer machine. In the meantime, can you report the exact size in bytes? If it's exactly 8,000,000 bytes, then the file is probably good. – None – 2011-06-07T14:47:23.797

Hehe : 7,62 MB (8 000 000 bytes)! In the meantime, I'll work on the sorting algorithm =) – None – 2011-06-07T15:24:15.290

@Fezvez: I just tried it again with Python 2.7, and it works fine... – None – 2011-06-07T16:55:17.767

I believe that using the Gaussian distribution needlessly complicates things, if you want to do smart calculations. If we have the uniform distribution, the point (using the knowledge of the distribution) stays the same, and the arithmetics gets much simpler... – None – 2011-06-07T18:06:55.187

@xofon: I picked the Gaussian distribution because it's what is most widely encountered in practice. – None – 2011-06-07T18:49:32.920

@xofon: I would be surprised if the direct indexing method described by Mike didn't hands down win such a method, while in theory with this method a strategy to avoid the cost of the phi method could be created. – None – 2011-06-07T18:52:31.023

1Take each value and move it directly to its "expected" position, repeat for the displaced value. Not sure how to resolve a couple issue with that. When done, bubble sort until complete (a couple passes should do). – None – 2011-06-07T20:29:18.647

1I will post a bucket sort solution tomorrow evening if this hasn't been closed by then :) – None – 2011-06-07T21:22:10.423

1@static_rtti - as a heavy CG user, this is exactly the sort of thing "we" like to hack on at CG.SE. To any reading mods, move this to CG, don't close it. – arrdem – 2011-06-08T00:44:17.967

1Welcome to CodeGolf.SE! I've cleared a lot of the commentary from the SO original concerning where this does or does not belong, and re-tagged to be closer to the CodeGolf.SE mainstream. – dmckee --- ex-moderator kitten – 2011-06-08T13:26:05.390

2The one tricky issue here is that we look for objective winning criteria, and "fastest" introduces platform dependencies...does a O(n^{1.2}) algorithm implemented on the cpython virtual machine beat a O(n^{1.3}) algorithm with a similar constant implemented in c? I generally suggest some discussion on the performance characteristics of each solution, as this may help people to judge what is going on. – dmckee --- ex-moderator kitten – 2011-06-08T13:26:51.837

1Thanks for the welcome! The problem you're referring to is why I run the code on my computer, so we have a single reference. I agree different languages are going to have different performance results, that is why I think discussing the algorithms is as important as looking at the raw numbers. – static_rtti – 2011-06-08T13:31:16.030

Hopefully the winning solution (at least!) will be tested for correctness in addition to speed? I can write sort; exit very, very fast... as far as you know! – Ben Jackson – 2011-06-08T20:22:54.453

@Ben Jackson: of course! This is becomming increasingly needed as solutions become more and more cryptic! – static_rtti – 2011-06-09T09:44:53.043

Note that I'll add a bounty, so the contest should go on for about a week. – static_rtti – 2011-06-09T09:45:14.277

@static_rtti: Would you like to split the contest into two sections: single-threaded (wall-time) vs multi-threaded (real-time)? I have access to a 48 core machine (gcc 4.3.4) which I can use it to time multi-threaded solutions. – Alexandru – 2011-06-09T12:23:44.713

Why don't you give the timings on that machine? I'm not sure it would make sense to start a second question, but the results would definitely be interesting. Maybe increase the problem size as well? – static_rtti – 2011-06-09T12:29:29.393

1

Here are some timings on a 16 core machine: link. The 48 core machine is currently busy, but none of the tested solutions can take advantage of more than 4 cpus. The testing methodology is in the git repository I linked to. Anybody is welcome to add more tests.

– Alexandru – 2011-06-09T14:16:54.297

@Alexandru, the fork/join solution I provided should definitely be able to take advantage of more than 4 cpus. Did you test it? – arjan – 2011-06-09T18:41:42.463

@arjan: Unfortunately, I don't have Java 1.7 on that machine, but 1.6. I expected the TBB solution to use all CPUs, but it wasn't the case. – Alexandru – 2011-06-09T18:50:40.717

I tried to evaluate all new solutions, I hope I didn't miss any! Note that I didn't have much time, so I didn't look at the solutions in depth, but I will try to do so (and also check a few ones for correctness) later. Thanks to everyone for your very nice solutions! – static_rtti – 2011-06-09T19:36:41.260

@Alexandru - and you don't have download rights/capabilities on that 48 core machine? You don't have to install Java 7, but can simply put it in your home dir. I'll see if I can try it tomorrow on a simple (compared to your's ;)) 12 core dev machine. – arjan – 2011-06-09T21:39:18.360

@arjan: Done. Your solution is a second close.

– Alexandru – 2011-06-10T10:14:32.547

@Alexandru: Thanks! :) What are specs of the machine btw? I also executed my solution on our 12 core dev machine, which is ~3.46Ghz dual Intel Nehalem hex core with some 96GB RAM and I think a dual 8x SSD. Hyperthreading was disabled. To get some idea of the scaling factor, I run the code with 1 core till 12 cores and another run with 10 million doubles. See my updated answer for a table with the results of that. – arjan – 2011-06-10T19:43:42.860

OK, if nobody objects, I will retest the fastest solutions tonight, check them for correctness, and award the bounty. – static_rtti – 2011-06-14T09:30:44.853

Nice contest, you should probably point out that a real implementation would have to estimate the mean and standard deviation first or somehow on the fly - might change the algorithm somewhat. – Flash – 2011-06-14T13:07:02.707

@Andrew, I agree, but the idea was to focus on the sorting. Estimation is another problem, and is already well understood. – static_rtti – 2011-06-15T08:09:01.027

Answers

14

Here is a solution in C++ which first partitions the numbers into buckets with same expected number of elements and then sorts each bucket separately. It precomputes a table of the cumulative distribution function based on some formulas from Wikipedia and then interpolates values from this table to get a fast approximation.

Several steps run in multiple threads to make use of the four cores.

#include <cstdlib>
#include <math.h>
#include <stdio.h>
#include <algorithm>

#include <tbb/parallel_for.h>

using namespace std;

typedef unsigned long long ull;

double signum(double x) {
    return (x<0) ? -1 : (x>0) ? 1 : 0;
}

const double fourOverPI = 4 / M_PI;

double erf(double x) {
    double a = 0.147;
    double x2 = x*x;
    double ax2 = a*x2;
    double f1 = -x2 * (fourOverPI + ax2) / (1 + ax2);
    double s1 = sqrt(1 - exp(f1));
    return signum(x) * s1;
}

const double sqrt2 = sqrt(2);

double cdf(double x) {
    return 0.5 + erf(x / sqrt2) / 2;
}

const int cdfTableSize = 200;
const double cdfTableLimit = 5;
double* computeCdfTable(int size) {
    double* res = new double[size];
    for (int i = 0; i < size; ++i) {
        res[i] = cdf(cdfTableLimit * i / (size - 1));
    }
    return res;
}
const double* const cdfTable = computeCdfTable(cdfTableSize);

double cdfApprox(double x) {
    bool negative = (x < 0);
    if (negative) x = -x;
    if (x > cdfTableLimit) return negative ? cdf(-x) : cdf(x);
    double p = (cdfTableSize - 1) * x / cdfTableLimit;
    int below = (int) p;
    if (p == below) return negative ? -cdfTable[below] : cdfTable[below];
    int above = below + 1;
    double ret = cdfTable[below] +
            (cdfTable[above] - cdfTable[below])*(p - below);
    return negative ? 1 - ret : ret;
}

void print(const double* arr, int len) {
    for (int i = 0; i < len; ++i) {
        printf("%e; ", arr[i]);
    }
    puts("");
}

void print(const int* arr, int len) {
    for (int i = 0; i < len; ++i) {
        printf("%d; ", arr[i]);
    }
    puts("");
}

void fillBuckets(int N, int bucketCount,
        double* data, int* partitions,
        double* buckets, int* offsets) {
    for (int i = 0; i < N; ++i) {
        ++offsets[partitions[i]];
    }

    int offset = 0;
    for (int i = 0; i < bucketCount; ++i) {
        int t = offsets[i];
        offsets[i] = offset;
        offset += t;
    }
    offsets[bucketCount] = N;

    int next[bucketCount];
    memset(next, 0, sizeof(next));
    for (int i = 0; i < N; ++i) {
        int p = partitions[i];
        int j = offsets[p] + next[p];
        ++next[p];
        buckets[j] = data[i];
    }
}

class Sorter {
public:
    Sorter(double* data, int* offsets) {
        this->data = data;
        this->offsets = offsets;
    }

    static void radixSort(double* arr, int len) {
        ull* encoded = (ull*)arr;
        for (int i = 0; i < len; ++i) {
            ull n = encoded[i];
            if (n & signBit) {
                n ^= allBits;
            } else {
                n ^= signBit;
            }
            encoded[i] = n;
        }

        const int step = 11;
        const ull mask = (1ull << step) - 1;
        int offsets[8][1ull << step];
        memset(offsets, 0, sizeof(offsets));

        for (int i = 0; i < len; ++i) {
            for (int b = 0, j = 0; b < 64; b += step, ++j) {
                int p = (encoded[i] >> b) & mask;
                ++offsets[j][p];
            }
        }

        int sum[8] = {0};
        for (int i = 0; i <= mask; i++) {
            for (int b = 0, j = 0; b < 64; b += step, ++j) {
                int t = sum[j] + offsets[j][i];
                offsets[j][i] = sum[j];
                sum[j] = t;
            }
        }

        ull* copy = new ull[len];
        ull* current = encoded;
        for (int b = 0, j = 0; b < 64; b += step, ++j) {
            for (int i = 0; i < len; ++i) {
                int p = (current[i] >> b) & mask;
                copy[offsets[j][p]] = current[i];
                ++offsets[j][p];
            }

            ull* t = copy;
            copy = current;
            current = t;
        }

        if (current != encoded) {
            for (int i = 0; i < len; ++i) {
                encoded[i] = current[i];
            }
        }

        for (int i = 0; i < len; ++i) {
            ull n = encoded[i];
            if (n & signBit) {
                n ^= signBit;
            } else {
                n ^= allBits;
            }
            encoded[i] = n;
        }
    }

    void operator() (tbb::blocked_range<int>& range) const {
        for (int i = range.begin(); i < range.end(); ++i) {
            double* begin = &data[offsets[i]];
            double* end = &data[offsets[i+1]];
            //std::sort(begin, end);
            radixSort(begin, end-begin);
        }
    }

private:
    double* data;
    int* offsets;
    static const ull signBit = 1ull << 63;
    static const ull allBits = ~0ull;
};

void sortBuckets(int bucketCount, double* data, int* offsets) {
    Sorter sorter(data, offsets);
    tbb::blocked_range<int> range(0, bucketCount);
    tbb::parallel_for(range, sorter);
    //sorter(range);
}

class Partitioner {
public:
    Partitioner(int bucketCount, double* data, int* partitions) {
        this->data = data;
        this->partitions = partitions;
        this->bucketCount = bucketCount;
    }

    void operator() (tbb::blocked_range<int>& range) const {
        for (int i = range.begin(); i < range.end(); ++i) {
            double d = data[i];
            int p = (int) (cdfApprox(d) * bucketCount);
            partitions[i] = p;
        }
    }

private:
    double* data;
    int* partitions;
    int bucketCount;
};

const int bucketCount = 512;
int offsets[bucketCount + 1];

int main(int argc, char** argv) {
    if (argc != 2) {
        printf("Usage: %s N\n N = the size of the input\n", argv[0]);
        return 1;
    }

    puts("initializing...");
    int N = atoi(argv[1]);
    double* data = new double[N];
    double* buckets = new double[N];
    memset(offsets, 0, sizeof(offsets));
    int* partitions = new int[N];

    puts("loading data...");
    FILE* fp = fopen("gaussian.dat", "rb");
    if (fp == 0 || fread(data, sizeof(*data), N, fp) != N) {
        puts("Error reading data");
        return 1;
    }
    //print(data, N);

    puts("assigning partitions...");
    tbb::parallel_for(tbb::blocked_range<int>(0, N),
            Partitioner(bucketCount, data, partitions));

    puts("filling buckets...");
    fillBuckets(N, bucketCount, data, partitions, buckets, offsets);
    data = buckets;

    puts("sorting buckets...");
    sortBuckets(bucketCount, data, offsets);

    puts("done.");

    /*
    for (int i = 0; i < N-1; ++i) {
        if (data[i] > data[i+1]) {
            printf("error at %d: %e > %e\n", i, data[i], data[i+1]);
        }
    }
    */

    //print(data, N);

    return 0;
}

To compile and run it, use this command:

g++ -O3 -ltbb -o gsort gsort.cpp && time ./gsort 50000000

EDIT: All buckets are now placed into the same array to remove the need to copy the buckets back into the array. Also the size of the table with precomputed values was reduced, because the values are accurate enough. Still, if I change the number of buckets above 256, the program takes longer to run than with that number of buckets.

EDIT: The same algorithm, different programming language. I used C++ instead of Java and the running time reduced from ~3.2s to ~2.35s on my machine. The optimal number of buckets is still around 256 (again, on my computer).

By the way, tbb really is awesome.

EDIT: I was inspired by Alexandru's great solution and replaced the std::sort in the last phase by a modified version of his radix sort. I did use a different method to deal with the positive/negative numbers, even though it needs more passes through the array. I also decided to sort the array exactly and remove the insertion sort. I will later spend some time testing how do these changes influence performance and possibly revert them. However, by using radix sort, the time decreased from ~2.35s to ~1.63s.

k21

Posted 2011-06-07T12:58:58.813

Reputation: 356

Nice. I got 3.055 on mine. Lowest I was able to get mine to was 6.3. I'm picking through yours to get the stats better. Why did you choose 256 as the number of buckets? I tried 128 and 512, yet 256 worked the best. – Scott – 2011-06-09T14:07:41.500

Why did I choose 256 as the number of buckets? I tried 128 and 512, yet 256 worked the best. :) I found it empirically and I am not sure why increasing the number of buckets slows down the algorithm - memory allocation should not take that long. Maybe something related to cache size? – k21 – 2011-06-09T17:08:59.557

2.725s on my machine. Pretty nice for a java solution, taking into account the loading time of the JVM. – static_rtti – 2011-06-09T19:17:04.950

2

I switched your code over to use the nio packages, per my and Arjan's solution(used his syntax, since it was cleaner than mine) and was able to get it .3 seconds faster. I've got an ssd, I wonder what the implications might be if not. It also gets rid of some of your bit twiddling. Modded sections are here.

– Scott – 2011-06-09T20:06:56.517

@Scott: I agree that my way of loading data is suboptimal. I have not used the java.nio.* classes before, so I have had no idea how to do the loading better. My approach was possibly also slower because I tried to load the file in parts and start partitioning the parts that were already loaded before all the data was read. On my machine (no ssd) the time went from ~3.5s to ~3.2s. Thanks for your suggestion, hope you do not mind it if I use it in my program? – k21 – 2011-06-09T20:26:29.053

Have at. You might be able to merge the two behaviors(use nio and start a thread when you have the data), if you use a different call to do the reading. – Scott – 2011-06-09T21:41:49.483

3

This is the fastest parallel solution on my tests (16core cpu). 1.22s far from 1.94s second place.

– Alexandru – 2011-06-10T11:46:09.423

1.188s on my machine! I think we have a winner! (although Alexandru deserves big kudos for his extremely fast sequential version) – static_rtti – 2011-06-15T22:00:31.223

Thanks. I agree that Alexandru is the one who came with the idea of using radix sort on doubles. My solution uses his approach but makes it faster by: a) parallelization, b) using the cache better by decreasing the size of the arrays on which the radix sort is run. – k21 – 2011-06-16T11:41:04.323

13

Without getting smart, just to provide a much faster naïve sorter, here's one in C which should be pretty much equivalent to your Python one:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int cmp(const void* av, const void* bv) {
    double a = *(const double*)av;
    double b = *(const double*)bv;
    return a < b ? -1 : a > b ? 1 : 0;
}
int main(int argc, char** argv) {
    if (argc <= 1)
        return puts("No argument!");
    unsigned count = atoi(argv[1]);

    double *a = malloc(count * sizeof *a);

    FILE *f = fopen("gaussian.dat", "rb");
    if (fread(a, sizeof *a, count, f) != count)
        return puts("fread failed!");
    fclose(f);

    puts("sorting...");
    double *b = malloc(count * sizeof *b);
    memcpy(b, a, count * sizeof *b);
    qsort(b, count, sizeof *b, cmp);
    return 0;
}

Compiled with gcc -O3, on my machine this takes more than a minute less than the Python: about 11 s compared to 87 s.

Deewiant

Posted 2011-06-07T12:58:58.813

Reputation:

1Took 10.086s on my machine, which makes you the current leader! But I'm pretty sure we can do better :) – None – 2011-06-07T17:17:06.470

1Could you try to remove the second ternary operator and simply return 1 for that case because random doubles are like not equal to each other in these amount of data. – Codism – 2011-06-07T17:39:31.330

@Codism: I would add that we don't care about swapping locations of equivalent data so therefore even if we could get equivalent values it would be an appropriate simplification. – None – 2011-06-07T18:43:30.710

10

I partitioned into segments based on the standard deviation that should best break it down into 4ths. Edit: Rewritten to partition based on x value in http://en.wikipedia.org/wiki/Error_function#Table_of_values

http://www.wolframalpha.com/input/?i=percentages+by++normal+distribution

I tried using smaller buckets, but it seemed to have little effect once 2 * beyond the number of cores available. Without any parallel collections, it would take 37 seconds on my box and 24 with the parallel collections. If partitioning via distribution, you can't just use an array, so there is some more overhead. I'm not clear on when a value would be boxed/unboxed in scala.

I'm using scala 2.9, for the parallel collection. You can just download the tar.gz distribution of it.

To compile: scalac SortFile.scala (I just copied it directly in the scala/bin folder.

To run: JAVA_OPTS="-Xmx4096M" ./scala SortFile (I ran it with 2 gigs of ram and got about the same time)

Edit: Removed allocateDirect, slower than just allocate. Removed priming of initial size for array buffers. Actually made it read the whole 50000000 values. Rewrote to hopefully avoid autoboxing issues (still slower than naive c)

import java.io.FileInputStream;
import java.nio.ByteBuffer
import java.nio.ByteOrder
import scala.collection.mutable.ArrayBuilder


object SortFile {

//used partition numbers from Damascus' solution
val partList = List(0, 0.15731, 0.31864, 0.48878, 0.67449, 0.88715, 1.1503, 1.5341)

val listSize = partList.size * 2;
val posZero = partList.size;
val neg = partList.map( _ * -1).reverse.zipWithIndex
val pos = partList.map( _ * 1).zipWithIndex.reverse

def partition(dbl:Double): Int = { 

//for each partition, i am running through the vals in order
//could make this a binary search to be more performant... but our list size is 4 (per side)

  if(dbl < 0) { return neg.find( dbl < _._1).get._2  }
  if(dbl > 0) { return posZero  + pos.find( dbl > _._1).get._2  }
      return posZero; 

}

  def main(args: Array[String])
    { 

    var l = 0
    val dbls = new Array[Double](50000000)
    val partList = new Array[Int](50000000)
    val pa = Array.fill(listSize){Array.newBuilder[Double]}
    val channel = new FileInputStream("../../gaussian.dat").getChannel()
    val bb = ByteBuffer.allocate(50000000 * 8)
    bb.order(ByteOrder.LITTLE_ENDIAN)
    channel.read(bb)
    bb.rewind
    println("Loaded" + System.currentTimeMillis())
    var dbl = 0.0
    while(bb.hasRemaining)
    { 
      dbl = bb.getDouble
      dbls.update(l,dbl) 

      l+=1
    }
    println("Beyond first load" + System.currentTimeMillis());

    for( i <- (0 to 49999999).par) { partList.update(i, partition(dbls(i)))}

    println("Partition computed" + System.currentTimeMillis() )
    for(i <- (0 to 49999999)) { pa(partList(i)) += dbls(i) }
    println("Partition completed " + System.currentTimeMillis())
    val toSort = for( i <- pa) yield i.result()
    println("Arrays Built" + System.currentTimeMillis());
    toSort.par.foreach{i:Array[Double] =>scala.util.Sorting.quickSort(i)};

    println("Read\t" + System.currentTimeMillis());

  }
}

Scott

Posted 2011-06-07T12:58:58.813

Reputation: 201

18.185s! Nice for a scala solution, I guess... Also, bravo for providing the first solution that actually uses the Gaussian distribution in some way! – None – 2011-06-08T06:12:52.397

1I was only aiming to compete with the c# solution. Didn't figure I'd beat c/c++. Also.. it's behaving much differently for you than for me. I'm using openJDK on my end and it's a lot slower. I wonder if adding more partitions would help in your env. – Scott – 2011-06-08T13:00:55.193

9

Just put this in a cs file and compile it with csc in theory: (Requires mono)

using System;
using System.IO;
using System.Threading;

namespace Sort
{
    class Program
    {
        const int count = 50000000;
        static double[][] doubles;
        static WaitHandle[] waiting = new WaitHandle[4];
        static AutoResetEvent[] events = new AutoResetEvent[4];

        static double[] Merge(double[] left, double[] right)
        {
            double[] result = new double[left.Length + right.Length];
            int l = 0, r = 0, spot = 0;
            while (l < left.Length && r < right.Length)
            {
                if (right[r] < left[l])
                    result[spot++] = right[r++];
                else
                    result[spot++] = left[l++];
            }
            while (l < left.Length) result[spot++] = left[l++];
            while (r < right.Length) result[spot++] = right[r++];
            return result;
        }

        static void ThreadStart(object data)
        {
            int index = (int)data;
            Array.Sort(doubles[index]);
            events[index].Set();
        }

        static void Main(string[] args)
        {
            System.Diagnostics.Stopwatch watch = new System.Diagnostics.Stopwatch();
            watch.Start();
            byte[] bytes = File.ReadAllBytes(@"..\..\..\SortGuassian\Data.dat");
            doubles = new double[][] { new double[count / 4], new double[count / 4], new double[count / 4], new double[count / 4] };
            for (int i = 0; i < 4; i++)
            {
                for (int j = 0; j < count / 4; j++)
                {
                    doubles[i][j] = BitConverter.ToDouble(bytes, i * count/4 + j * 8);
                }
            }
            Thread[] threads = new Thread[4];
            for (int i = 0; i < 4; i++)
            {
                threads[i] = new Thread(ThreadStart);
                waiting[i] = events[i] = new AutoResetEvent(false);
                threads[i].Start(i);
            }
            WaitHandle.WaitAll(waiting);
            double[] left = Merge(doubles[0], doubles[1]);
            double[] right = Merge(doubles[2], doubles[3]);
            double[] result = Merge(left, right);
            watch.Stop();
            Console.WriteLine(watch.Elapsed.ToString());
            Console.ReadKey();
        }
    }
}

Guvante

Posted 2011-06-07T12:58:58.813

Reputation: 191

Can I run your solutions with Mono? How should I do it? – None – 2011-06-07T17:21:08.577

Haven't used Mono, didn't think of that, you should be able to compile the F# and then run it. – None – 2011-06-07T17:38:48.087

1Updated to use four threads to improve performance. Now gives me 6 sec. Note that this could be significantly improved (5 sec likely) if you use only one spare array and avoid initializing a ton of memory to zero, which is done by the CLR, since everything is being written to at least once. – None – 2011-06-07T18:41:01.583

19.598s on my machine! You're the current leader :) – None – 2011-06-07T18:55:43.557

(Note that I patched your code to remove the stopwatch and the console stuff: I measure time with time) – None – 2011-06-07T18:56:10.877

@static_rtti: That's fine I just like excluding bootup of CLR etc, but I can see where you would want an overall time for this kind of thing. – None – 2011-06-07T19:46:24.357

@Guvante: it's mostly because it's simpler to do it that way rather than to request that everyone implements a stopwatch mechanism and ensure that the comparison is fair. – None – 2011-06-07T19:47:41.040

1My mother told me to stay away from guys with Mono! – None – 2011-06-08T01:54:41.643

8

Since you know what the distribution is, you can use a direct indexing O(N) sort. (If you're wondering what that is, suppose you have a deck of 52 cards and you want to sort it. Just have 52 bins and toss each card into it's own bin.)

You have 5e7 doubles. Allocate a result array R of 5e7 doubles. Take each number x and get i = phi(x) * 5e7. Basically do R[i] = x. Have a way to handle collisions, such as moving the number it may be colliding with (as in simple hash coding). Alternatively, you could make R a few times larger, filled with a unique empty value. At the end, you just sweep up the elements of R.

phi is just the gaussian cumulative distribution function. It converts a gaussian distributed number between +/- infinity into a uniform distributed number between 0 and 1. A simple way to calculate it is with table lookup and interpolation.

Mike Dunlavey

Posted 2011-06-07T12:58:58.813

Reputation:

3Be careful: you know the approximate distribution, not the exact distribution. You know the data was generated using a Gaussian law, but since it is finite, it doesn't exactly follow a Gaussian. – None – 2011-06-07T18:48:16.600

@static_rtti: In this case the necessary approximation of phi would create a bigger hassle than any irregularities in the data set IMO. – None – 2011-06-07T18:54:25.733

1@static_rtti: it doesn't have to be exact. It only has to spread out the data so it is approximately uniform, so it doesn't bunch up too much in some places. – None – 2011-06-08T04:46:01.320

Suppose you have 5e7 doubles. Why not just make each entry in R a vector of, say, 5e6 vectors of double. Then, push_back each double in its appropriate vector. Sort the vectors and you're done. This should take linear time in the size of the input. – Neil G – 2011-06-08T07:32:13.103

Actually, I see that mdkess already came up with that solution. – Neil G – 2011-06-08T07:33:30.233

8

Here is another sequential solution:

#include <stdio.h>
#include <stdlib.h>
#include <algorithm>
#include <ctime>

typedef unsigned long long ull;

int size;
double *dbuf, *copy;
int cnt[8][1 << 16];

void sort()
{
  const int step = 10;
  const int start = 24;
  ull mask = (1ULL << step) - 1;

  ull *ibuf = (ull *) dbuf;
  for (int i = 0; i < size; i++) {
    for (int w = start, v = 0; w < 64; w += step, v++) {
      int p = (~ibuf[i] >> w) & mask;
      cnt[v][p]++;
    }
  }

  int sum[8] = { 0 };
  for (int i = 0; i <= mask; i++) {
    for (int w = start, v = 0; w < 64; w += step, v++) {
      int tmp = sum[v] + cnt[v][i];
      cnt[v][i] = sum[v];
      sum[v] = tmp;
    }
  }

  for (int w = start, v = 0; w < 64; w += step, v++) {
    ull *ibuf = (ull *) dbuf;
    for (int i = 0; i < size; i++) {
      int p = (~ibuf[i] >> w) & mask;
      copy[cnt[v][p]++] = dbuf[i];
    }

    double *tmp = copy;
    copy = dbuf;
    dbuf = tmp;
  }

  for (int p = 0; p < size; p++)
    if (dbuf[p] >= 0.) {
      std::reverse(dbuf + p, dbuf + size);
      break;
    }

  // Insertion sort
  for (int i = 1; i < size; i++) {
    double value = dbuf[i];
    if (value < dbuf[i - 1]) {
      dbuf[i] = dbuf[i - 1];
      int p = i - 1;
      for (; p > 0 && value < dbuf[p - 1]; p--)
        dbuf[p] = dbuf[p - 1];
      dbuf[p] = value;
    }
  }
}

int main(int argc, char **argv) {
  size = atoi(argv[1]);
  dbuf = new double[size];
  copy = new double[size];

  FILE *f = fopen("gaussian.dat", "r");
  fread(dbuf, size, sizeof(double), f);
  fclose(f);

  clock_t c0 = clock();
  sort();
  printf("Finished after %.3f\n", (double) ((clock() - c0)) / CLOCKS_PER_SEC);
  return 0;
}

I doubt it beats the multi-threaded solution, but the timings on my i7 laptop are (stdsort is the C++ solution provided in another answer):

$ g++ -O3 mysort.cpp -o mysort && ./mysort 50000000
Finished after 2.10
$ g++ -O3 stdsort.cpp -o stdsort && ./stdsort
Finished after 7.12

Note that this solution has linear time complexity (because it uses the special representation of doubles).

EDIT: Fixed the order of elements to be increasing.

EDIT: Improved speed by almost half a second.

EDIT: Improved speed by another 0.7 seconds. Made the algorithm more cache friendly.

EDIT: Improved speed by another 1 second. Since there only 50.000.000 elements I can partially sort the mantissa and use insert sort (which is cache friendly) to fix out-of-place elements. This idea removes around two iterations from the last radix sorting loop.

EDIT: 0.16 less seconds. First std::reverse can be eliminated if the sorting order is reversed.

Alexandru

Posted 2011-06-07T12:58:58.813

Reputation: 5 485

Now that is getting interesting! What kind of sort algorithm is it? – static_rtti – 2011-06-08T11:29:21.317

2Least significant digit radix sort. You can sort the mantissa, then the exponent, then the sign. The algorithm presented here takes this idea one step further. It can be parallelized using a partitioning idea provided in a different answer. – Alexandru – 2011-06-08T11:37:52.660

Pretty fast for a single threaded solution: 2.552s! Do you think you could change your solution to make use of the fact that the data is normally distributed? You could probably do better than the current best multi-threaded solutions. – static_rtti – 2011-06-08T18:13:43.110

1@static_rtti: I see that Damascus Steel already posted a multithreaded version of this implementation. I improved caching behavior of this algorithm, so you should get better timings now. Please test this new version. – Alexandru – 2011-06-09T11:29:27.027

1.525s!! Amazing, you're the leader now. It's nice to see that a sequential solution actually beats the best parallel solution. – static_rtti – 2011-06-09T19:32:20.043

21.459s in my latests tests. While this solution is not the winner per my rules, it really deserves big kudos. Congratulations! – static_rtti – 2011-06-15T22:02:08.757

6

Taking Christian Ammer's solution and parallelizing it with Intel's Threaded Building Blocks

#include <iostream>
#include <fstream>
#include <algorithm>
#include <vector>
#include <ctime>
#include <tbb/parallel_sort.h>

int main(void)
{
    std::ifstream ifs("gaussian.dat", std::ios::binary | std::ios::in);
    std::vector<double> values;
    values.reserve(50000000);
    double d;
    while (ifs.read(reinterpret_cast<char*>(&d), sizeof(double)))
    values.push_back(d);
    clock_t c0 = clock();
    tbb::parallel_sort(values.begin(), values.end());
    std::cout << "Finished after "
              << static_cast<double>((clock() - c0)) / CLOCKS_PER_SEC
              << std::endl;
}

If you have access to Intel's Performance Primitives (IPP) library, you can use its radix sort. Just replace

#include <tbb/parallel_sort.h>

with

#include "ipps.h"

and

tbb::parallel_sort(values.begin(), values.end());

with

std::vector<double> copy(values.size());
ippsSortRadixAscend_64f_I(&values[0], &copy[0], values.size());

On my dual core laptop, the timings are

C               16.4 s
C#              20 s
C++ std::sort   7.2 s
C++ tbb         5 s
C++ ipp         4.5 s
python          too long

Damascus Steel

Posted 2011-06-07T12:58:58.813

Reputation:

12.958s! TBB seems pretty cool and easy to use! – None – 2011-06-08T06:06:16.937

2TBB is absurdly awesome. It's exactly the right level of abstraction for algorithmic work. – drxzcl – 2011-06-08T08:16:28.517

5

How about an implementation of parallel quicksort that chooses its pivot values based on the statistics of the distribution, thereby ensuring equal sized partitions? The first pivot would be at the mean (zero in this case), the next pair would be at the 25th and 75th percentiles (+/- -0.67449 standard deviations), and so on, with each partition halving the remaining data set more or less perfectly.

codegardener

Posted 2011-06-07T12:58:58.813

Reputation: 51

That's effectively what I did on mine.. of course you got this post up before I could finish my writeup. – None – 2011-06-07T21:12:25.933

5

Very ugly (why use arrays when I can use variables ending with numbers), but fast code (my first try to std::threads), whole time (time real) on my system 1,8 s (comparing to std::sort() 4,8 s), compile with g++ -std=c++0x -O3 -march=native -pthread Just pass data through stdin (works only for 50M).

#include <cstdio>
#include <iostream>
#include <algorithm>
#include <thread>
using namespace std;
const size_t size=50000000;

void pivot(double* start,double * end, double middle,size_t& koniec){
    double * beg=start;
    end--;
    while (start!=end){
        if (*start>middle) swap (*start,*end--);
        else start++;
    }
    if (*end<middle) start+=1;
    koniec= start-beg;
}
void s(double * a, double* b){
    sort(a,b);
}
int main(){
    double *data=new double[size];
    FILE *f = fopen("gaussian.dat", "rb");
    fread(data,8,size,f);
    size_t end1,end2,end3,temp;
    pivot(data, data+size,0,end2);
    pivot(data, data+end2,-0.6745,end1);
    pivot(data+end2,data+size,0.6745,end3);
    end3+=end2;
    thread ts1(s,data,data+end1);
    thread ts2(s,data+end1,data+end2);
    thread ts3(s,data+end2,data+end3);
    thread ts4(s,data+end3,data+size);
    ts1.join(),ts2.join(),ts3.join(),ts4.join();
    //for (int i=0; i<size-1; i++){
    //  if (data[i]>data[i+1]) cerr<<"BLAD\n";
    //}
    fclose(f);
    //fwrite(data,8,size,stdout);
}

//Edit changed to read gaussian.dat file.

Zjarek

Posted 2011-06-07T12:58:58.813

Reputation: 71

Could you change it to read gaussian.dat, as the above C++ solutions do? – None – 2011-06-08T06:07:25.127

I'll try later when I come home. – static_rtti – 2011-06-08T08:12:21.777

Very nice solution, you're the current leader (1.949s)! And nice use of the gaussian distribution :) – static_rtti – 2011-06-08T18:11:19.333

4

A C++ solution using std::sort (eventually faster than qsort, regarding Performance of qsort vs std::sort)

#include <iostream>
#include <fstream>
#include <algorithm>
#include <vector>
#include <ctime>

int main(void)
{
    std::ifstream ifs("C:\\Temp\\gaussian.dat", std::ios::binary | std::ios::in);
    std::vector<double> values;
    values.reserve(50000000);
    double d;
    while (ifs.read(reinterpret_cast<char*>(&d), sizeof(double)))
        values.push_back(d);
    clock_t c0 = clock();
    std::sort(values.begin(), values.end());
    std::cout << "Finished after "
              << static_cast<double>((clock() - c0)) / CLOCKS_PER_SEC
              << std::endl;
}

I can't reliable say how long it takes because I only have 1GB on my machine and with the given Python code I could only make a gaussian.dat file with only 25mio doubles (without getting a Memory error). But I'm very interested how long the std::sort algorithm runs.

Christian Ammer

Posted 2011-06-07T12:58:58.813

Reputation: 151

6.425s! As expected, C++ shines :) – None – 2011-06-08T06:04:23.710

@static_rtti: I've tried swensons Timsort algorithm (as suggested from Matthieu M. in your first question). I had to make some changes to the sort.h file to compile it with C++. It was about twice as slow than std::sort. Don't know why, maybe because of compiler optimizations?

– Christian Ammer – 2011-06-08T19:39:46.420

4

Here is a mix of Alexandru's radix sort with Zjarek's threaded smart pivoting. Compile it with

g++ -std=c++0x -pthread -O3 -march=native sorter_gaussian_radix.cxx -o sorter_gaussian_radix

You can change the radix size by defining STEP (e.g. add -DSTEP=11). I found the best for my laptop is 8 (the default).

By default, it splits the problem into 4 pieces and runs that on multiple threads. You can change that by passing a depth parameter to the command line. So if you have two cores, run it as

sorter_gaussian_radix 50000000 1

and if you have 16 cores

sorter_gaussian_radix 50000000 4

The max depth right now is 6 (64 threads). If you put too many levels, you will just slow the code down.

One thing I also tried was the radix sort from the Intel Performance Primitives (IPP) library. Alexandru's implementation soundly trounces IPP, with IPP being about 30% slower. That variation is also included here (commented out).

#include <stdlib.h>
#include <string.h>
#include <algorithm>
#include <ctime>
#include <iostream>
#include <thread>
#include <vector>
#include <boost/cstdint.hpp>
// #include "ipps.h"

#ifndef STEP
#define STEP 8
#endif

const int step = STEP;
const int start_step=24;
const int num_steps=(64-start_step+step-1)/step;
int size;
double *dbuf, *copy;

clock_t c1, c2, c3, c4, c5;

const double distrib[]={-2.15387,
                        -1.86273,
                        -1.67594,
                        -1.53412,
                        -1.4178,
                        -1.31801,
                        -1.22986,
                        -1.15035,
                        -1.07752,
                        -1.00999,
                        -0.946782,
                        -0.887147,
                        -0.830511,
                        -0.776422,
                        -0.724514,
                        -0.67449,
                        -0.626099,
                        -0.579132,
                        -0.53341,
                        -0.488776,
                        -0.445096,
                        -0.40225,
                        -0.36013,
                        -0.318639,
                        -0.27769,
                        -0.237202,
                        -0.197099,
                        -0.157311,
                        -0.11777,
                        -0.0784124,
                        -0.0391761,
                        0,
                        0.0391761,
                        0.0784124,
                        0.11777,
                        0.157311,
                        0.197099,
                        0.237202,
                        0.27769,
                        0.318639,
                        0.36013,
                        0.40225,
                        0.445097,
                        0.488776,
                        0.53341,
                        0.579132,
                        0.626099,
                        0.67449,
                        0.724514,
                        0.776422,
                        0.830511,
                        0.887147,
                        0.946782,
                        1.00999,
                        1.07752,
                        1.15035,
                        1.22986,
                        1.31801,
                        1.4178,
                        1.53412,
                        1.67594,
                        1.86273,
                        2.15387};


class Distrib
{
  const int value;
public:
  Distrib(const double &v): value(v) {}

  bool operator()(double a)
  {
    return a<value;
  }
};


void recursive_sort(const int start, const int end,
                    const int index, const int offset,
                    const int depth, const int max_depth)
{
  if(depth<max_depth)
    {
      Distrib dist(distrib[index]);
      const int middle=std::partition(dbuf+start,dbuf+end,dist) - dbuf;

      // const int middle=
      //   std::partition(dbuf+start,dbuf+end,[&](double a)
      //                  {return a<distrib[index];})
      //   - dbuf;

      std::thread lower(recursive_sort,start,middle,index-offset,offset/2,
                        depth+1,max_depth);
      std::thread upper(recursive_sort,middle,end,index+offset,offset/2,
                        depth+1,max_depth);
      lower.join(), upper.join();
    }
  else
    {
  // ippsSortRadixAscend_64f_I(dbuf+start,copy+start,end-start);

      c1=clock();

      double *dbuf_local(dbuf), *copy_local(copy);
      boost::uint64_t mask = (1 << step) - 1;
      int cnt[num_steps][mask+1];

      boost::uint64_t *ibuf = reinterpret_cast<boost::uint64_t *> (dbuf_local);

      for(int i=0;i<num_steps;++i)
        for(uint j=0;j<mask+1;++j)
          cnt[i][j]=0;

      for (int i = start; i < end; i++)
        {
          for (int w = start_step, v = 0; w < 64; w += step, v++)
            {
              int p = (~ibuf[i] >> w) & mask;
              (cnt[v][p])++;
            }
        }

      c2=clock();

      std::vector<int> sum(num_steps,0);
      for (uint i = 0; i <= mask; i++)
        {
          for (int w = start_step, v = 0; w < 64; w += step, v++)
            {
              int tmp = sum[v] + cnt[v][i];
              cnt[v][i] = sum[v];
              sum[v] = tmp;
            }
        }

      c3=clock();

      for (int w = start_step, v = 0; w < 64; w += step, v++)
        {
          ibuf = reinterpret_cast<boost::uint64_t *>(dbuf_local);

          for (int i = start; i < end; i++)
            {
              int p = (~ibuf[i] >> w) & mask;
              copy_local[start+((cnt[v][p])++)] = dbuf_local[i];
            }
          std::swap(copy_local,dbuf_local);
        }

      // Do the last set of reversals
      for (int p = start; p < end; p++)
        if (dbuf_local[p] >= 0.)
          {
            std::reverse(dbuf_local+p, dbuf_local + end);
            break;
          }

      c4=clock();

      // Insertion sort
      for (int i = start+1; i < end; i++) {
        double value = dbuf_local[i];
        if (value < dbuf_local[i - 1]) {
          dbuf_local[i] = dbuf_local[i - 1];
          int p = i - 1;
          for (; p > 0 && value < dbuf_local[p - 1]; p--)
            dbuf_local[p] = dbuf_local[p - 1];
          dbuf_local[p] = value;
        }
      }
      c5=clock();

    }
}


int main(int argc, char **argv) {
  size = atoi(argv[1]);
  copy = new double[size];

  dbuf = new double[size];
  FILE *f = fopen("gaussian.dat", "r");
  fread(dbuf, size, sizeof(double), f);
  fclose(f);

  clock_t c0 = clock();

  const int max_depth= (argc > 2) ? atoi(argv[2]) : 2;

  // ippsSortRadixAscend_64f_I(dbuf,copy,size);

  recursive_sort(0,size,31,16,0,max_depth);

  if(num_steps%2==1)
    std::swap(dbuf,copy);

  // for (int i=0; i<size-1; i++){
  //   if (dbuf[i]>dbuf[i+1])
  //     std::cout << "BAD "
  //               << i << " "
  //               << dbuf[i] << " "
  //               << dbuf[i+1] << " "
  //               << "\n";
  // }

  std::cout << "Finished after "
            << (double) (c1 - c0) / CLOCKS_PER_SEC << " "
            << (double) (c2 - c1) / CLOCKS_PER_SEC << " "
            << (double) (c3 - c2) / CLOCKS_PER_SEC << " "
            << (double) (c4 - c3) / CLOCKS_PER_SEC << " "
            << (double) (c5 - c4) / CLOCKS_PER_SEC << " "
            << "\n";

  // delete [] dbuf;
  // delete [] copy;
  return 0;
}

EDIT: I implemented Alexandru's cache improvements, and that shaved off about 30% of the time on my machine.

EDIT: This implements a recursive sort, so it should work well on Alexandru's 16 core machine. It also use's Alexandru's last improvement and removes one of the reverse's. For me, this gave a 20% improvement.

EDIT: Fixed a sign bug that caused inefficiency when there are more than 2 cores.

EDIT: Removed the lambda, so it will compile with older versions of gcc. It includes the IPP code variation commented out. I also fixed the documentation for running on 16 cores. As far as I can tell, this is the fastest implementation.

EDIT: Fixed a bug when STEP is not 8. Increased the maximum number of threads to 64. Added some timing info.

Damascus Steel

Posted 2011-06-07T12:58:58.813

Reputation: 143

Nice. Radix sort is very cache unfriendly. See if you can get better results by changing step (11 was optimal on my laptop). – Alexandru – 2011-06-09T09:26:04.090

You have a bug: int cnt[mask] should be int cnt[mask + 1]. For better results use a fixed value int cnt[1 << 16]. – Alexandru – 2011-06-09T09:38:20.407

I'll try all these solutions later today when I get home. – static_rtti – 2011-06-09T09:40:54.353

1.534s!!! I think we have a leader :-D – static_rtti – 2011-06-09T19:27:46.240

@static_rtti: Could you try this again? It has gotten significantly faster than the last time you tried it. On my machine, it is substantially faster than any other solution. – Damascus Steel – 2011-06-16T02:08:48.267

Sorry, I could'nt try it yesterday because for some reason, the code was gone. I'll try again tonight when I come home. – static_rtti – 2011-06-16T06:23:42.477

I get 1.124s on my machine which is actually the fastest solution! I feel bad for not giving you the bounty, but I swear the code wasn't there when I tested a lot of solutions yesterday! Anyways, you have the moral victory, and I will look closely at your code! – static_rtti – 2011-06-16T21:02:32.280

@static_rtti Thanks for trying it out. Do not worry about the bounty. It turns out that in my last revision, I deleted the old version and forgot to include the new version. I do feel that the code can be further optimized, but it would be fairly small gains and specific to a particular machine. – Damascus Steel – 2011-06-16T23:36:09.060

2

I guess this really depends on what you want to do. If you want to sort a bunch of Gaussians, then this won't help you. But if you want a bunch of sorted Gaussians, this will. Even if this misses the problem a bit, I think it will be interesting to compare vs actual sorting routines.

If you want to something to be fast, do less.

Instead of generating a bunch of random samples from the normal distribution and then sorting, you can generate a bunch of samples from the normal distribution in sorted order.

You can use the solution here to generate n uniform random numbers in sorted order. Then you can use the inverse cdf (scipy.stats.norm.ppf) of the normal distribution to turn the uniform random numbers into numbers from the normal distribution via inverse transform sampling.

import scipy.stats
import random

# slightly modified from linked stackoverflow post
def n_random_numbers_increasing(n):
  """Like sorted(random() for i in range(n))),                                
  but faster because we avoid sorting."""
  v = 1.0
  while n:
    v *= random.random() ** (1.0 / n)
    yield 1 - v
    n -= 1

def n_normal_samples_increasing(n):
  return map(scipy.stats.norm.ppf, n_random_numbers_increasing(n))

If you want to get your hands dirtier, I'd guess that you might be able to speed up the many inverse cdf calculations by using some kind of iterative method, and using the previous result as your initial guess. Since the guesses are going to be very close, probably a single iteration will give you great accuracy.

rrenaud

Posted 2011-06-07T12:58:58.813

Reputation: 121

2Nice answer, but that would be cheating :) The idea of my question is that while sorting algorithms have been given enormous attention, there is almost no literature on the use of prior knowledge about the data for sorting, even though the few papers that have addressed the issue have reported nice gains. So let's see what's possible! – None – 2011-06-07T16:50:11.870

2

Try this changing Guvante's solution with this Main(), it starts sorting as soon as 1/4 IO reading is done, it's faster in my test:

    static void Main(string[] args)
    {
        FileStream filestream = new FileStream(@"..\..\..\gaussian.dat", FileMode.Open, FileAccess.Read);
        doubles = new double[][] { new double[count / 4], new double[count / 4], new double[count / 4], new double[count / 4] };
        Thread[] threads = new Thread[4];

        for (int i = 0; i < 4; i++)
        {
            byte[] bytes = new byte[count * 4];
            filestream.Read(bytes, 0, count * 4);

            for (int j = 0; j < count / 4; j++)
            {
                doubles[i][j] = BitConverter.ToDouble(bytes, i * count/4 + j * 8);
            }

            threads[i] = new Thread(ThreadStart);
            waiting[i] = events[i] = new AutoResetEvent(false);
            threads[i].Start(i);    
        }

        WaitHandle.WaitAll(waiting);
        double[] left = Merge(doubles[0], doubles[1]);
        double[] right = Merge(doubles[2], doubles[3]);
        double[] result = Merge(left, right);
        Console.ReadKey();
    }
}

user788194

Posted 2011-06-07T12:58:58.813

Reputation:

8.933s. Slightly faster :) – None – 2011-06-08T06:02:31.447

2

Since you know the distribution, my idea would be to make k buckets, each with the same expected number of elements (since you know the distribution, you can compute this). Then in O(n) time, sweep the array and put elements into their buckets.

Then concurrently sort the buckets. Suppose you have k buckets, and n elements. A bucket will take (n/k) lg (n/k) time to sort. Now suppose that you have p processors that you can use. Since buckets can be sorted independently, you have a multiplier of ceil(k/p) to deal with. This gives a final runtime of n + ceil(k/p)*(n/k) lg (n/k), which should be a good deal faster than n lg n if you choose k well.

mdkess

Posted 2011-06-07T12:58:58.813

Reputation: 121

I think this is the best solution. – Neil G – 2011-06-08T07:33:45.677

You don't exactly know the number of elements who will end up in a bucket, so the math is actually wrong. That being said, this is a good answer I think. – poulejapon – 2011-06-08T19:43:23.407

@pouejapon: You're right. – Neil G – 2011-06-09T06:18:14.050

This answer sounds really nice. The problem is -- it isn't really fast. I implemented this in C99 (see my answer), and it certainly easily beats std::sort(), but it is way slower than Alexandru's radixsort solution. – Sven Marnach – 2011-06-09T17:54:42.087

2

One low-level optimization idea is to fit two doubles in a SSE register, so each thread would work with two items at a time. This might be complicated to do for some algorithms.

Another thing to do is sorting the array in cache-friendly chunks, then merging the results. Two levels should be used: for example first 4 KB for L1 then 64 KB for L2.

This should be very cache-friendly, since the bucket sort will not go outside the cache, and the final merge will walk memory sequentially.

These days computation is much cheaper than memory accesses. However we have a large number of items, so it's hard to tell which is the array size when dumb cache-aware sort is slower than a low-complexity non-cache-aware version.

But I won't provide an implementation of the above since I would do it in Windows (VC++).

Adal

Posted 2011-06-07T12:58:58.813

Reputation:

2

My personal favorite using Intel's Threaded Building Blocks has already been posted, but here's a crude parallel solution using JDK 7 and its new fork/join API:

import java.io.FileInputStream;
import java.nio.channels.FileChannel;
import java.util.Arrays;
import java.util.concurrent.*;
import static java.nio.channels.FileChannel.MapMode.READ_ONLY;
import static java.nio.ByteOrder.LITTLE_ENDIAN;


/**
 * 
 * Original Quicksort: https://github.com/pmbauer/parallel/tree/master/src/main/java/pmbauer/parallel
 *
 */
public class ForkJoinQuicksortTask extends RecursiveAction {

    public static void main(String[] args) throws Exception {

        double[] array = new double[Integer.valueOf(args[0])];

        FileChannel fileChannel = new FileInputStream("gaussian.dat").getChannel();
        fileChannel.map(READ_ONLY, 0, fileChannel.size()).order(LITTLE_ENDIAN).asDoubleBuffer().get(array);

        ForkJoinPool mainPool = new ForkJoinPool();

        System.out.println("Starting parallel computation");

        mainPool.invoke(new ForkJoinQuicksortTask(array));        
    }

    private static final long serialVersionUID = -642903763239072866L;
    private static final int SERIAL_THRESHOLD = 0x1000;

    private final double a[];
    private final int left, right;

    public ForkJoinQuicksortTask(double[] a) {this(a, 0, a.length - 1);}

    private ForkJoinQuicksortTask(double[] a, int left, int right) {
        this.a = a;
        this.left = left;
        this.right = right;
    }

    @Override
    protected void compute() {
        if (right - left < SERIAL_THRESHOLD) {
            Arrays.sort(a, left, right + 1);
        } else {
            int pivotIndex = partition(a, left, right);
            ForkJoinTask<Void> t1 = null;

            if (left < pivotIndex)
                t1 = new ForkJoinQuicksortTask(a, left, pivotIndex).fork();
            if (pivotIndex + 1 < right)
                new ForkJoinQuicksortTask(a, pivotIndex + 1, right).invoke();

            if (t1 != null)
                t1.join();
        }
    }

    public static int partition(double[] a, int left, int right) {
        // chose middle value of range for our pivot
        double pivotValue = a[left + (right - left) / 2];

        --left;
        ++right;

        while (true) {
            do
                ++left;
            while (a[left] < pivotValue);

            do
                --right;
            while (a[right] > pivotValue);

            if (left < right) {
                double tmp = a[left];
                a[left] = a[right];
                a[right] = tmp;
            } else {
                return right;
            }
        }
    }    
}

Important disclaimer: I took the quick-sort adaption for fork/join from: https://github.com/pmbauer/parallel/tree/master/src/main/java/pmbauer/parallel

To run this you need a beta build of JDK 7 (http://jdk7.java.net/download.html).

On my 2.93Ghz Quad core i7 (OS X):

Python reference

time python sort.py 50000000
sorting...

real    1m13.885s
user    1m11.942s
sys     0m1.935s

Java JDK 7 fork/join

time java ForkJoinQuicksortTask 50000000
Starting parallel computation

real    0m2.404s
user    0m10.195s
sys     0m0.347s

I also tried to do some experimenting with parallel reading and converting the bytes to doubles, but I did not see any difference there.

Update:

If anyone wants to experiment with parallel loading of the data, the parallel loading version is below. In theory this could make it go a little faster still, if your IO device has enough parallel capacity (SSDs usually do). There's also some overhead in creating Doubles from bytes, so that could potentially go faster in parallel as well. On my systems (Ubuntu 10.10/Nehalem Quad/Intel X25M SSD, and OS X 10.6/i7 Quad/Samsung SSD) I did not see any real difference.

import static java.nio.ByteOrder.LITTLE_ENDIAN;
import static java.nio.channels.FileChannel.MapMode.READ_ONLY;

import java.io.FileInputStream;
import java.nio.DoubleBuffer;
import java.nio.channels.FileChannel;
import java.util.Arrays;
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.ForkJoinTask;
import java.util.concurrent.RecursiveAction;


/**
 *
 * Original Quicksort: https://github.com/pmbauer/parallel/tree/master/src/main/java/pmbauer/parallel
 *
 */
public class ForkJoinQuicksortTask extends RecursiveAction {

   public static void main(String[] args) throws Exception {

       ForkJoinPool mainPool = new ForkJoinPool();

       double[] array = new double[Integer.valueOf(args[0])];
       FileChannel fileChannel = new FileInputStream("gaussian.dat").getChannel();
       DoubleBuffer buffer = fileChannel.map(READ_ONLY, 0, fileChannel.size()).order(LITTLE_ENDIAN).asDoubleBuffer();

       mainPool.invoke(new ReadAction(buffer, array, 0, array.length));
       mainPool.invoke(new ForkJoinQuicksortTask(array));
   }

   private static final long serialVersionUID = -642903763239072866L;
   private static final int SERIAL_THRESHOLD = 0x1000;

   private final double a[];
   private final int left, right;

   public ForkJoinQuicksortTask(double[] a) {this(a, 0, a.length - 1);}

   private ForkJoinQuicksortTask(double[] a, int left, int right) {
       this.a = a;
       this.left = left;
       this.right = right;
   }

   @Override
   protected void compute() {
       if (right - left < SERIAL_THRESHOLD) {
           Arrays.sort(a, left, right + 1);
       } else {
           int pivotIndex = partition(a, left, right);
           ForkJoinTask<Void> t1 = null;

           if (left < pivotIndex)
               t1 = new ForkJoinQuicksortTask(a, left, pivotIndex).fork();
           if (pivotIndex + 1 < right)
               new ForkJoinQuicksortTask(a, pivotIndex + 1, right).invoke();

           if (t1 != null)
               t1.join();
       }
   }

   public static int partition(double[] a, int left, int right) {
       // chose middle value of range for our pivot
       double pivotValue = a[left + (right - left) / 2];

       --left;
       ++right;

       while (true) {
           do
               ++left;
           while (a[left] < pivotValue);

           do
               --right;
           while (a[right] > pivotValue);

           if (left < right) {
               double tmp = a[left];
               a[left] = a[right];
               a[right] = tmp;
           } else {
               return right;
           }
       }
   }

}

class ReadAction extends RecursiveAction {

   private static final long serialVersionUID = -3498527500076085483L;

   private final DoubleBuffer buffer;
   private final double[] array;
   private final int low, high;

   public ReadAction(DoubleBuffer buffer, double[] array, int low, int high) {
       this.buffer = buffer;
       this.array = array;
       this.low = low;
       this.high = high;
   }

   @Override
   protected void compute() {
       if (high - low < 100000) {
           buffer.position(low);
           buffer.get(array, low, high-low);
       } else {
           int middle = (low + high) >>> 1;

           invokeAll(new ReadAction(buffer.slice(), array, low, middle),  new ReadAction(buffer.slice(), array, middle, high));
       }
   }
}

Update2:

I executed the code on one of our 12 core dev machines with a slight modification to set a fixed amount of cores. This gave the following results:

Cores  Time
1      7.568s
2      3.903s
3      3.325s
4      2.388s
5      2.227s
6      1.956s
7      1.856s
8      1.827s
9      1.682s
10     1.698s
11     1.620s
12     1.503s

On this system I also tried the Python version which took 1m2.994s and Zjarek's C++ version which took 1.925s (for some reason Zjarek's C++ version seems to run relatively faster on static_rtti's computer).

I also tried what happened if I doubled the file size to 100,000,000 doubles:

Cores  Time
1      15.056s
2      8.116s
3      5.925s
4      4.802s
5      4.430s
6      3.733s
7      3.540s
8      3.228s
9      3.103s
10     2.827s
11     2.784s
12     2.689s

In this case, Zjarek's C++ version took 3.968s. Python just took too long here.

150,000,000 doubles:

Cores  Time
1      23.295s
2      12.391s
3      8.944s
4      6.990s
5      6.216s
6      6.211s
7      5.446s
8      5.155s
9      4.840s
10     4.435s
11     4.248s
12     4.174s

In this case, Zjarek's C++ version was 6.044s. I didn't even attempt Python.

The C++ version is very consistent with its results, where Java swings a little. First it gets a little more efficient when the problem gets larger, but then less efficient again.

arjan

Posted 2011-06-07T12:58:58.813

Reputation: 121

1This code does not parse the double values correctly for me. Is Java 7 required to correctly parse the values from the file? – jonderry – 2011-06-09T02:55:19.980

1Ah, silly me. I forgot to set the endianness again after I locally refactored the IO code from multiple lines to one. Java 7 would normally be needed, unless you added fork/join separately to Java 6 of course. – arjan – 2011-06-09T17:57:47.303

3.411s on my machine. Not bad, but slower than koumes21's java solution :) – static_rtti – 2011-06-09T19:21:53.327

1I'll try koumes21's solution here too locally to see what the relative differences are on my system. Anyway, no shame in 'loosing' from koumes21 since it's a much more clever solution. This is just an almost standard quick-sort thrown into a fork/join pool ;) – arjan – 2011-06-09T21:52:47.923

2

Here's a linear scan bucket sort implementation. I think it's faster than all current single-threaded implementations except for radix sort. It should have linear expected running time if I'm estimating the cdf accurately enough (I'm using linear interpolation of values I found on the web) and haven't made any mistakes that would cause excessive scanning:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <algorithm>
#include <ctime>

using std::fill;

const double q[] = {
  0.0,
  9.865E-10,
  2.8665150000000003E-7,
  3.167E-5,
  0.001349898,
  0.022750132,
  0.158655254,
  0.5,
  0.8413447460000001,
  0.9772498679999999,
  0.998650102,
  0.99996833,
  0.9999997133485,
  0.9999999990134999,
  1.0,
};
int main(int argc, char** argv) {
  if (argc <= 1)
    return puts("No argument!");
  unsigned count = atoi(argv[1]);
  unsigned count2 = 3 * count;

  bool *ba = new bool[count2 + 1000];
  fill(ba, ba + count2 + 1000, false);
  double *a = new double[count];
  double *c = new double[count2 + 1000];

  FILE *f = fopen("gaussian.dat", "rb");
  if (fread(a, 8, count, f) != count)
    return puts("fread failed!");
  fclose(f);

  int i;
  int j;
  bool s;
  int t;
  double z;
  double p;
  double d1;
  double d2;
  for (i = 0; i < count; i++) {
    s = a[i] < 0;
    t = a[i];
    if (s) t--;
    z = a[i] - t;
    t += 7;
    if (t < 0) {
      t = 0;
      z = 0;
    } else if (t >= 14) {
      t = 13;
      z = 1;
    }
    p = q[t] * (1 - z) + q[t + 1] * z;
    j = count2 * p;
    while (ba[j] && c[j] < a[i]) {
      j++;
    }
    if (!ba[j]) {
      ba[j] = true;
      c[j] = a[i];
    } else {
      d1 = c[j];
      c[j] = a[i];
      j++;
      while (ba[j]) {
        d2 = c[j];
        c[j] = d1;
        d1 = d2;
        j++;
      }
      c[j] = d1;
      ba[j] = true;
    }
  }
  i = 0;
  int max = count2 + 1000;
  for (j = 0; j < max; j++) {
    if (ba[j]) {
      a[i++] = c[j];
    }
  }
  // for (i = 0; i < count; i += 1) {
  //   printf("here %f\n", a[i]);
  // }
  return 0;
}

jonderry

Posted 2011-06-07T12:58:58.813

Reputation: 121

1I'll try this later today when I get home. In the meantime, can I say your code is very ugly? :-D – static_rtti – 2011-06-09T09:41:42.653

3.071s! Not bad for a single-threaded solution! – static_rtti – 2011-06-09T19:02:32.647

2

I don't know, why I can't edit my previous post, so here's new version, 0,2 seconds faster (but about 1,5 s faster in CPU time (user)). This solution have 2 programs, first precalculates quantiles for normal distribution for bucket sort, and stores it in table, t[double * scale] = bucket index, where scale is some arbitrary number which makes casting to double possible. Then main program can use this data to put doubles in correct bucket. It has one drawback, if data is not gaussian it will not work correctly (and also there is almost zero chance to work incorrectly for normal distribution), but modification for special case is easy and fast (only number of buckets checks and falling to std::sort()).

Compiling: g++ => http://pastebin.com/WG7pZEzH helper program

g++ -std=c++0x -O3 -march=native -pthread => http://pastebin.com/T3yzViZP main sorting program

Zjarek

Posted 2011-06-07T12:58:58.813

Reputation: 71

1.621s! I think you're the leader, but I'm rapidly losing track with all these answers :) – static_rtti – 2011-06-09T19:08:59.087

2

Here is another sequential solution. This one uses the fact that the elements are normal distributed, and the I think the idea is generally applicable to get sorting close to linear time.

The algorithm is like this:

  • Approximate CDF (see phi() function in the implementation)
  • For all elements compute the approximate position in the sorted array: size * phi(x)
  • Put elements in a new array close to their final position
    • In my implementation destination array has some gaps in it so I don't have to shift too many elements when inserting.
  • Use insertsort to sort the final elements (insertsort is linear if distance to final position is smaller than a constant).

Unfortunately, the hidden constant is quite large and this solution is twice as slow as the radix sort algorithm.

Alexandru

Posted 2011-06-07T12:58:58.813

Reputation: 5 485

12.470s! Very nice ideas. It doesn't matter that the solution isn't the fastest if the ideas are interesting :) – static_rtti – 2011-06-09T19:35:35.580

1This is the same as mine, but grouping the phi computations together and the shifts together for better cache performance, right? – jonderry – 2011-06-10T01:13:36.807

@jonderry: I upvoted your solution, now that I understand what it does. Didn't mean to steal your idea. I included your implementation in my (unofficial) set of tests

– Alexandru – 2011-06-10T10:58:39.653

1

A version using traditional pthreads. Code for merging copied from Guvante's answer. Compile with g++ -O3 -pthread.

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <algorithm>

static unsigned int nthreads = 4;
static unsigned int size = 50000000;

typedef struct {
  double *array;
  int size;
} array_t;


void 
merge(double *left, int leftsize,
      double *right, int rightsize,
      double *result)
{
  int l = 0, r = 0, insertat = 0;
  while (l < leftsize && r < rightsize) {
    if (left[l] < right[r])
      result[insertat++] = left[l++];
    else
      result[insertat++] = right[r++];
  }

  while (l < leftsize) result[insertat++] = left[l++];
  while (r < rightsize) result[insertat++] = right[r++];
}


void *
run_thread(void *input)
{
  array_t numbers = *(array_t *)input;
  std::sort(numbers.array, numbers.array+numbers.size); 
  pthread_exit(NULL);
}

int 
main(int argc, char **argv) 
{
  double *numbers = (double *) malloc(size * sizeof(double));

  FILE *f = fopen("gaussian.dat", "rb");
  if (fread(numbers, sizeof(double), size, f) != size)
    return printf("Reading gaussian.dat failed");
  fclose(f);

  array_t worksets[nthreads];
  int worksetsize = size / nthreads;
  for (int i = 0; i < nthreads; i++) {
    worksets[i].array=numbers+(i*worksetsize);
    worksets[i].size=worksetsize;
  }

  pthread_attr_t attributes;
  pthread_attr_init(&attributes);
  pthread_attr_setdetachstate(&attributes, PTHREAD_CREATE_JOINABLE);

  pthread_t threads[nthreads];
  for (int i = 0; i < nthreads; i++) {
    pthread_create(&threads[i], &attributes, &run_thread, &worksets[i]);
  }

  for (int i = 0; i < nthreads; i++) {
    pthread_join(threads[i], NULL);
  }

  double *tmp = (double *) malloc(size * sizeof(double));
  merge(numbers, worksetsize, numbers+worksetsize, worksetsize, tmp);
  merge(numbers+(worksetsize*2), worksetsize, numbers+(worksetsize*3), worksetsize, tmp+(size/2));
  merge(tmp, worksetsize*2, tmp+(size/2), worksetsize*2, numbers);

  /*
  printf("Verifying result..\n");
  for (int i = 0; i < size - 1; i++) {
    if (numbers[i] > numbers[i+1])
      printf("Result is not correct\n");
  }
  */

  pthread_attr_destroy(&attributes);
  return 0;
}  

On my laptop I get the following results:

real    0m6.660s
user    0m9.449s
sys     0m1.160s

Salesman

Posted 2011-06-07T12:58:58.813

Reputation: 121

1

Here is a sequential C99 implementation that tries to really make use of the known distribution. It basically does a single round of bucket sort using the distribution information, then a few rounds of quicksort on each bucket assuming a uniform distribution within the limits of the bucket and finally a modified selection sort to copy the data back to the original buffer. The quicksort memorizes the split points, so selection sort only needs to operate on small cunks. And in spite (because?) of all that complexity, it isn't even really fast.

To make evaluating Φ fast, the values are sampled in a few points and later on only linear interpolation is used. It actually does not matter if Φ is evaluated exactly, as long as the approximation is strictly monotonic.

The bin sizes are chosen such that the chance of an bin overflow is negligible. More precisely, with the current parameters, the chance that a dataset of 50000000 elements will cause a bin overflow is 3.65e-09. (This can be computed using the survival function of the Poisson distribution.)

To compile, please use

gcc -std=c99 -msse3 -O3 -ffinite-math-only

Since there is considerably more computation than in the other solutions, these compiler flags are needed to make it at least reasonably fast. Without -msse3 the conversions from double to int become really slow. If your architecture does not support SSE3, these conversions can also be done using the lrint() function.

The code is rather ugly -- not sure if this meets the requirement of being "reasonably readable"...

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <math.h>

#define N 50000000
#define BINSIZE 720
#define MAXBINSIZE 880
#define BINCOUNT (N / BINSIZE)
#define SPLITS 64
#define PHI_VALS 513

double phi_vals[PHI_VALS];

int bin_index(double x)
{
    double y = (x + 8.0) * ((PHI_VALS - 1) / 16.0);
    int interval = y;
    y -= interval;
    return (1.0 - y) * phi_vals[interval] + y * phi_vals[interval + 1];
}

double bin_value(int bin)
{
    int left = 0;
    int right = PHI_VALS - 1;
    do
    {
        int centre = (left + right) / 2;
        if (bin < phi_vals[centre])
            right = centre;
        else
            left = centre;
    } while (right - left > 1);
    double frac = (bin - phi_vals[left]) / (phi_vals[right] - phi_vals[left]);
    return (left + frac) * (16.0 / (PHI_VALS - 1)) - 8.0;
}

void gaussian_sort(double *restrict a)
{
    double *b = malloc(BINCOUNT * MAXBINSIZE * sizeof(double));
    double **pos = malloc(BINCOUNT * sizeof(double*));
    for (size_t i = 0; i < BINCOUNT; ++i)
        pos[i] = b + MAXBINSIZE * i;
    for (size_t i = 0; i < N; ++i)
        *pos[bin_index(a[i])]++ = a[i];
    double left_val, right_val = bin_value(0);
    for (size_t bin = 0, i = 0; bin < BINCOUNT; ++bin)
    {
        left_val = right_val;
        right_val = bin_value(bin + 1);
        double *splits[SPLITS + 1];
        splits[0] = b + bin * MAXBINSIZE;
        splits[SPLITS] = pos[bin];
        for (int step = SPLITS; step > 1; step >>= 1)
            for (int left_split = 0; left_split < SPLITS; left_split += step)
            {
                double *left = splits[left_split];
                double *right = splits[left_split + step] - 1;
                double frac = (double)(left_split + (step >> 1)) / SPLITS;
                double pivot = (1.0 - frac) * left_val + frac * right_val;
                while (1)
                {
                    while (*left < pivot && left <= right)
                        ++left;
                    while (*right >= pivot && left < right)
                        --right;
                    if (left >= right)
                        break;
                    double tmp = *left;
                    *left = *right;
                    *right = tmp;
                    ++left;
                    --right;
                }
                splits[left_split + (step >> 1)] = left;
            }
        for (int left_split = 0; left_split < SPLITS; ++left_split)
        {
            double *left = splits[left_split];
            double *right = splits[left_split + 1] - 1;
            while (left <= right)
            {
                double *min = left;
                for (double *tmp = left + 1; tmp <= right; ++tmp)
                    if (*tmp < *min)
                        min = tmp;
                a[i++] = *min;
                *min = *right--;
            }
        }
    }
    free(b);
    free(pos);
}

int main()
{
    double *a = malloc(N * sizeof(double));
    FILE *f = fopen("gaussian.dat", "rb");
    assert(fread(a, sizeof(double), N, f) == N);
    fclose(f);
    for (int i = 0; i < PHI_VALS; ++i)
    {
        double x = (i * (16.0 / PHI_VALS) - 8.0) / sqrt(2.0);
        phi_vals[i] =  (erf(x) + 1.0) * 0.5 * BINCOUNT;
    }
    gaussian_sort(a);
    free(a);
}

Sven Marnach

Posted 2011-06-07T12:58:58.813

Reputation: 236

4.098s! I had to add -lm to compile it (for erf). – static_rtti – 2011-06-09T19:25:49.640

1

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <memory.h>
#include <algorithm>

// maps [-inf,+inf] to (0,1)
double normcdf(double x) {
        return 0.5 * (1 + erf(x * M_SQRT1_2));
}

int calcbin(double x, int bins) {
        return (int)floor(normcdf(x) * bins);
}

int *docensus(int bins, int n, double *arr) {
        int *hist = calloc(bins, sizeof(int));
        int i;
        for(i = 0; i < n; i++) {
                hist[calcbin(arr[i], bins)]++;
        }
        return hist;
}

void partition(int bins, int *orig_counts, double *arr) {
        int *counts = malloc(bins * sizeof(int));
        memcpy(counts, orig_counts, bins*sizeof(int));
        int *starts = malloc(bins * sizeof(int));
        int b, i;
        starts[0] = 0;
        for(i = 1; i < bins; i++) {
                starts[i] = starts[i-1] + counts[i-1];
        }
        for(b = 0; b < bins; b++) {
                while (counts[b] > 0) {
                        double v = arr[starts[b]];
                        int correctbin;
                        do {
                                correctbin = calcbin(v, bins);
                                int swappos = starts[correctbin];
                                double tmp = arr[swappos];
                                arr[swappos] = v;
                                v = tmp;
                                starts[correctbin]++;
                                counts[correctbin]--;
                        } while (correctbin != b);
                }
        }
        free(counts);
        free(starts);
}


void sortbins(int bins, int *counts, double *arr) {
        int start = 0;
        int b;
        for(b = 0; b < bins; b++) {
                std::sort(arr + start, arr + start + counts[b]);
                start += counts[b];
        }
}


void checksorted(double *arr, int n) {
        int i;
        for(i = 1; i < n; i++) {
                if (arr[i-1] > arr[i]) {
                        printf("out of order at %d: %lf %lf\n", i, arr[i-1], arr[i]);
                        exit(1);
                }
        }
}


int main(int argc, char *argv[]) {
        if (argc == 1 || argv[1] == NULL) {
                printf("Expected data size as argument\n");
                exit(1);
        }
        int n = atoi(argv[1]);
        const int cachesize = 128 * 1024; // a guess
        int bins = (int) (1.1 * n * sizeof(double) / cachesize);
        if (argc > 2) {
                bins = atoi(argv[2]);
        }
        printf("Using %d bins\n", bins);
        FILE *f = fopen("gaussian.dat", "rb");
        if (f == NULL) {
                printf("Couldn't open gaussian.dat\n");
                exit(1);
        }
        double *arr = malloc(n * sizeof(double));
        fread(arr, sizeof(double), n, f);
        fclose(f);

        int *counts = docensus(bins, n, arr);
        partition(bins, counts, arr);
        sortbins(bins, counts, arr);
        checksorted(arr, n);

        return 0;
}

This uses erf() to place each element appropriately into a bin, then sorts each bin. It keeps the array entirely in-place.

First pass: docensus() counts the number of elements in each bin.

Second pass: partition() permutes the array, placing each element into its proper bin

Third pass: sortbins() performs a qsort on each bin.

It's kind of naive, and calls the expensive erf() function twice for every value. The first and third passes are potentially parallelizable. The second is highly serial and is probably slowed down by its highly random memory access patterns. It also might be worthwhile to cache each double's bin number, depending on the CPU power to memory speed ratio.

This program lets you choose the number of bins to use. Just add a second number to the command line. I compiled it with gcc -O3, but my machine's so feeble I can't tell you any good performance numbers.

Edit: Poof! My C program has magically transformed into a C++ program using std::sort!

frud

Posted 2011-06-07T12:58:58.813

Reputation: 11

You can use phi for a faster stdnormal_cdf.

– Alexandru – 2011-06-09T18:11:29.830

How many bins should I put, approximately? – static_rtti – 2011-06-09T19:09:40.147

@Alexandru: I added a piecewise linear approximation to normcdf and only gained about 5% speed. – frud – 2011-06-09T19:14:37.033

@static_rtti: You don't have to put any. By default, the code chooses the bins count so the average bin size is 10/11 of 128kb. Too few bins and you don't get the benefit of partitioning. Too many and the partition phase bogs down due to overflowing the cache. – frud – 2011-06-09T19:19:16.347

10.6s! I tried playing a bit with the number of bins, and I got the best results with 5000 (slightly over the default value of 3356). I must say I was expected to see a much better performance for your solution... Maybe it's the fact that you're using qsort instead of the potentially faster std::sort of the C++ solutions? – static_rtti – 2011-06-09T19:43:29.243

1

Have a look at the radix sort implementation by Michael Herf (Radix Tricks). On my machine sorting was 5 times faster compared to the std::sort algorithm in my first answer. The name of the sorting function is RadixSort11.

int main(void)
{
    std::ifstream ifs("C:\\Temp\\gaussian.dat", std::ios::binary | std::ios::in);
    std::vector<float> v;
    v.reserve(50000000);
    double d;
    while (ifs.read(reinterpret_cast<char*>(&d), sizeof(double)))
        v.push_back(static_cast<float>(d));
    std::vector<float> vres(v.size(), 0.0);
    clock_t c0 = clock();
    RadixSort11(&v[0], &vres[0], v.size());
    std::cout << "Finished after: "
              << static_cast<double>(clock() - c0) / CLOCKS_PER_SEC << std::endl;
    return 0;
}

Christian Ammer

Posted 2011-06-07T12:58:58.813

Reputation: 151