Here is a mix of Alexandru's radix sort with Zjarek's threaded smart pivoting. Compile it with
g++ -std=c++0x -pthread -O3 -march=native sorter_gaussian_radix.cxx -o sorter_gaussian_radix
You can change the radix size by defining STEP (e.g. add -DSTEP=11). I found the best for my laptop is 8 (the default).
By default, it splits the problem into 4 pieces and runs that on multiple threads. You can change that by passing a depth parameter to the command line. So if you have two cores, run it as
sorter_gaussian_radix 50000000 1
and if you have 16 cores
sorter_gaussian_radix 50000000 4
The max depth right now is 6 (64 threads). If you put too many levels, you will just slow the code down.
One thing I also tried was the radix sort from the Intel Performance Primitives (IPP) library. Alexandru's implementation soundly trounces IPP, with IPP being about 30% slower. That variation is also included here (commented out).
#include <stdlib.h>
#include <string.h>
#include <algorithm>
#include <ctime>
#include <iostream>
#include <thread>
#include <vector>
#include <boost/cstdint.hpp>
// #include "ipps.h"
#ifndef STEP
#define STEP 8
#endif
const int step = STEP;
const int start_step=24;
const int num_steps=(64-start_step+step-1)/step;
int size;
double *dbuf, *copy;
clock_t c1, c2, c3, c4, c5;
const double distrib[]={-2.15387,
-1.86273,
-1.67594,
-1.53412,
-1.4178,
-1.31801,
-1.22986,
-1.15035,
-1.07752,
-1.00999,
-0.946782,
-0.887147,
-0.830511,
-0.776422,
-0.724514,
-0.67449,
-0.626099,
-0.579132,
-0.53341,
-0.488776,
-0.445096,
-0.40225,
-0.36013,
-0.318639,
-0.27769,
-0.237202,
-0.197099,
-0.157311,
-0.11777,
-0.0784124,
-0.0391761,
0,
0.0391761,
0.0784124,
0.11777,
0.157311,
0.197099,
0.237202,
0.27769,
0.318639,
0.36013,
0.40225,
0.445097,
0.488776,
0.53341,
0.579132,
0.626099,
0.67449,
0.724514,
0.776422,
0.830511,
0.887147,
0.946782,
1.00999,
1.07752,
1.15035,
1.22986,
1.31801,
1.4178,
1.53412,
1.67594,
1.86273,
2.15387};
class Distrib
{
const int value;
public:
Distrib(const double &v): value(v) {}
bool operator()(double a)
{
return a<value;
}
};
void recursive_sort(const int start, const int end,
const int index, const int offset,
const int depth, const int max_depth)
{
if(depth<max_depth)
{
Distrib dist(distrib[index]);
const int middle=std::partition(dbuf+start,dbuf+end,dist) - dbuf;
// const int middle=
// std::partition(dbuf+start,dbuf+end,[&](double a)
// {return a<distrib[index];})
// - dbuf;
std::thread lower(recursive_sort,start,middle,index-offset,offset/2,
depth+1,max_depth);
std::thread upper(recursive_sort,middle,end,index+offset,offset/2,
depth+1,max_depth);
lower.join(), upper.join();
}
else
{
// ippsSortRadixAscend_64f_I(dbuf+start,copy+start,end-start);
c1=clock();
double *dbuf_local(dbuf), *copy_local(copy);
boost::uint64_t mask = (1 << step) - 1;
int cnt[num_steps][mask+1];
boost::uint64_t *ibuf = reinterpret_cast<boost::uint64_t *> (dbuf_local);
for(int i=0;i<num_steps;++i)
for(uint j=0;j<mask+1;++j)
cnt[i][j]=0;
for (int i = start; i < end; i++)
{
for (int w = start_step, v = 0; w < 64; w += step, v++)
{
int p = (~ibuf[i] >> w) & mask;
(cnt[v][p])++;
}
}
c2=clock();
std::vector<int> sum(num_steps,0);
for (uint i = 0; i <= mask; i++)
{
for (int w = start_step, v = 0; w < 64; w += step, v++)
{
int tmp = sum[v] + cnt[v][i];
cnt[v][i] = sum[v];
sum[v] = tmp;
}
}
c3=clock();
for (int w = start_step, v = 0; w < 64; w += step, v++)
{
ibuf = reinterpret_cast<boost::uint64_t *>(dbuf_local);
for (int i = start; i < end; i++)
{
int p = (~ibuf[i] >> w) & mask;
copy_local[start+((cnt[v][p])++)] = dbuf_local[i];
}
std::swap(copy_local,dbuf_local);
}
// Do the last set of reversals
for (int p = start; p < end; p++)
if (dbuf_local[p] >= 0.)
{
std::reverse(dbuf_local+p, dbuf_local + end);
break;
}
c4=clock();
// Insertion sort
for (int i = start+1; i < end; i++) {
double value = dbuf_local[i];
if (value < dbuf_local[i - 1]) {
dbuf_local[i] = dbuf_local[i - 1];
int p = i - 1;
for (; p > 0 && value < dbuf_local[p - 1]; p--)
dbuf_local[p] = dbuf_local[p - 1];
dbuf_local[p] = value;
}
}
c5=clock();
}
}
int main(int argc, char **argv) {
size = atoi(argv[1]);
copy = new double[size];
dbuf = new double[size];
FILE *f = fopen("gaussian.dat", "r");
fread(dbuf, size, sizeof(double), f);
fclose(f);
clock_t c0 = clock();
const int max_depth= (argc > 2) ? atoi(argv[2]) : 2;
// ippsSortRadixAscend_64f_I(dbuf,copy,size);
recursive_sort(0,size,31,16,0,max_depth);
if(num_steps%2==1)
std::swap(dbuf,copy);
// for (int i=0; i<size-1; i++){
// if (dbuf[i]>dbuf[i+1])
// std::cout << "BAD "
// << i << " "
// << dbuf[i] << " "
// << dbuf[i+1] << " "
// << "\n";
// }
std::cout << "Finished after "
<< (double) (c1 - c0) / CLOCKS_PER_SEC << " "
<< (double) (c2 - c1) / CLOCKS_PER_SEC << " "
<< (double) (c3 - c2) / CLOCKS_PER_SEC << " "
<< (double) (c4 - c3) / CLOCKS_PER_SEC << " "
<< (double) (c5 - c4) / CLOCKS_PER_SEC << " "
<< "\n";
// delete [] dbuf;
// delete [] copy;
return 0;
}
EDIT: I implemented Alexandru's cache improvements, and that shaved off about 30% of the time on my machine.
EDIT: This implements a recursive sort, so it should work well on Alexandru's 16 core machine. It also use's Alexandru's last improvement and removes one of the reverse's. For me, this gave a 20% improvement.
EDIT: Fixed a sign bug that caused inefficiency when there are more than 2 cores.
EDIT: Removed the lambda, so it will compile with older versions of gcc. It includes the IPP code variation commented out. I also fixed the documentation for running on 16 cores. As far as I can tell, this is the fastest implementation.
EDIT: Fixed a bug when STEP is not 8. Increased the maximum number of threads to 64. Added some timing info.
I'm a bit lost concerning your implementation. I saved the first snippet as
generator.py
and didpython generator.py 1000000
. However, as I saved the second snippet assort.py
and entered in the consolepython sort.py 1000000
, I got a EOFError. I know nearly nothing in Python, so if you could help... Edit : Thegaussian.dat
file is indeed at the right place, but the serialization made it look like garbage in my editor. Normal behaviour with python'stofile()
? – None – 2011-06-07T14:14:52.100@Fezvez: it's supposed to look as garbage since it's stored as raw binary data. However what you did is supposed to work... What version of python did you use? I used python 2.4, which is rather old. Also, how big is the resulting file? – None – 2011-06-07T14:29:24.110
I used python 2.7, and the file size was 8M. If anyone cared to check with this version of python, it'll help. Anyway, that's not the point, I'll generate the data myself, and sort it myself too. It's just that I wondered why I couldn't make the baseline work for me. – None – 2011-06-07T14:42:36.963
@Fezvez: I'll try tonight on a newer machine. In the meantime, can you report the exact size in bytes? If it's exactly 8,000,000 bytes, then the file is probably good. – None – 2011-06-07T14:47:23.797
Hehe : 7,62 MB (8 000 000 bytes)! In the meantime, I'll work on the sorting algorithm =) – None – 2011-06-07T15:24:15.290
@Fezvez: I just tried it again with Python 2.7, and it works fine... – None – 2011-06-07T16:55:17.767
I believe that using the Gaussian distribution needlessly complicates things, if you want to do smart calculations. If we have the uniform distribution, the point (using the knowledge of the distribution) stays the same, and the arithmetics gets much simpler... – None – 2011-06-07T18:06:55.187
@xofon: I picked the Gaussian distribution because it's what is most widely encountered in practice. – None – 2011-06-07T18:49:32.920
@xofon: I would be surprised if the direct indexing method described by Mike didn't hands down win such a method, while in theory with this method a strategy to avoid the cost of the phi method could be created. – None – 2011-06-07T18:52:31.023
1Take each value and move it directly to its "expected" position, repeat for the displaced value. Not sure how to resolve a couple issue with that. When done, bubble sort until complete (a couple passes should do). – None – 2011-06-07T20:29:18.647
1I will post a bucket sort solution tomorrow evening if this hasn't been closed by then :) – None – 2011-06-07T21:22:10.423
1@static_rtti - as a heavy CG user, this is exactly the sort of thing "we" like to hack on at CG.SE. To any reading mods, move this to CG, don't close it. – arrdem – 2011-06-08T00:44:17.967
1Welcome to CodeGolf.SE! I've cleared a lot of the commentary from the SO original concerning where this does or does not belong, and re-tagged to be closer to the CodeGolf.SE mainstream. – dmckee --- ex-moderator kitten – 2011-06-08T13:26:05.390
2The one tricky issue here is that we look for objective winning criteria, and "fastest" introduces platform dependencies...does a O(n^{1.2}) algorithm implemented on the cpython virtual machine beat a O(n^{1.3}) algorithm with a similar constant implemented in c? I generally suggest some discussion on the performance characteristics of each solution, as this may help people to judge what is going on. – dmckee --- ex-moderator kitten – 2011-06-08T13:26:51.837
1Thanks for the welcome! The problem you're referring to is why I run the code on my computer, so we have a single reference. I agree different languages are going to have different performance results, that is why I think discussing the algorithms is as important as looking at the raw numbers. – static_rtti – 2011-06-08T13:31:16.030
Hopefully the winning solution (at least!) will be tested for correctness in addition to speed? I can write
sort; exit
very, very fast... as far as you know! – Ben Jackson – 2011-06-08T20:22:54.453@Ben Jackson: of course! This is becomming increasingly needed as solutions become more and more cryptic! – static_rtti – 2011-06-09T09:44:53.043
Note that I'll add a bounty, so the contest should go on for about a week. – static_rtti – 2011-06-09T09:45:14.277
@static_rtti: Would you like to split the contest into two sections: single-threaded (wall-time) vs multi-threaded (real-time)? I have access to a 48 core machine (gcc 4.3.4) which I can use it to time multi-threaded solutions. – Alexandru – 2011-06-09T12:23:44.713
Why don't you give the timings on that machine? I'm not sure it would make sense to start a second question, but the results would definitely be interesting. Maybe increase the problem size as well? – static_rtti – 2011-06-09T12:29:29.393
1
Here are some timings on a 16 core machine: link. The 48 core machine is currently busy, but none of the tested solutions can take advantage of more than 4 cpus. The testing methodology is in the git repository I linked to. Anybody is welcome to add more tests.
– Alexandru – 2011-06-09T14:16:54.297@Alexandru, the fork/join solution I provided should definitely be able to take advantage of more than 4 cpus. Did you test it? – arjan – 2011-06-09T18:41:42.463
@arjan: Unfortunately, I don't have Java 1.7 on that machine, but 1.6. I expected the TBB solution to use all CPUs, but it wasn't the case. – Alexandru – 2011-06-09T18:50:40.717
I tried to evaluate all new solutions, I hope I didn't miss any! Note that I didn't have much time, so I didn't look at the solutions in depth, but I will try to do so (and also check a few ones for correctness) later. Thanks to everyone for your very nice solutions! – static_rtti – 2011-06-09T19:36:41.260
@Alexandru - and you don't have download rights/capabilities on that 48 core machine? You don't have to install Java 7, but can simply put it in your home dir. I'll see if I can try it tomorrow on a simple (compared to your's ;)) 12 core dev machine. – arjan – 2011-06-09T21:39:18.360
@arjan: Done. Your solution is a second close.
– Alexandru – 2011-06-10T10:14:32.547@Alexandru: Thanks! :) What are specs of the machine btw? I also executed my solution on our 12 core dev machine, which is ~3.46Ghz dual Intel Nehalem hex core with some 96GB RAM and I think a dual 8x SSD. Hyperthreading was disabled. To get some idea of the scaling factor, I run the code with 1 core till 12 cores and another run with 10 million doubles. See my updated answer for a table with the results of that. – arjan – 2011-06-10T19:43:42.860
OK, if nobody objects, I will retest the fastest solutions tonight, check them for correctness, and award the bounty. – static_rtti – 2011-06-14T09:30:44.853
Nice contest, you should probably point out that a real implementation would have to estimate the mean and standard deviation first or somehow on the fly - might change the algorithm somewhat. – Flash – 2011-06-14T13:07:02.707
@Andrew, I agree, but the idea was to focus on the sorting. Estimation is another problem, and is already well understood. – static_rtti – 2011-06-15T08:09:01.027