π(n) is the number of primes less than or equal to n.

Input: a natural number, n.

Output: π(n).

Scoring: This is a fastest-code challenge. Score will be the sum of times for the score cases. I will time each entry on my computer.

Rules and Details

Your code should work for n up to 2 billion ( 2,000,000,000 ).
Built-ins that trivialize this are not allowed. This includes built in π functions or lists of values for π(n).
Built-ins that test primality or generate primes are not allowed. This includes lists of primes, which may not be looked up externally or hardcoded locally, except with regards to the next bullet point.
You may hardcode primes up to and including 19 and no higher.
your implementation of π should be deterministic. This means that given a specific n, your code should run in (approximately) the same amount of time.
Languages used must be freely available on Linux (Centos 7). Instructions should be included on how to run your code. Include compiler/interpreter details if necessary.
Official times will be from my computer.
When posting, please include a self-measured time on some/all of the test/score cases, just to give me an estimate of how fast your code is running.
Submissions must fit in an answer post to this question.
I am running 64bit centos7. I only have 8GB of RAM and 1GB swap. The cpu model is: AMD FX(tm)-6300 Six-Core Processor.

Test cases(source):

Input        Output
90           24
3000         430
9000         1117
4000000      283146           <--- input = 4*10^6
800000000    41146179         <--- input = 9*10^8
1100000000   55662470         <--- input = 1.1*10^9

Score Cases (same source)

As usual, these cases are subject to change. Optimizing for the scoring cases is not permitted. I may also change the number of cases in an effort to balance reasonable run times and accurate results.

Input        Output
1907000000   93875448         <--- input = 1.907*10^9
1337000000   66990613         <--- input = 1.337*10^9
1240000000   62366021         <--- input = 1.24*10^9
660000000    34286170         <--- input = 6.6*10^8
99820000     5751639          <--- input = 9.982*10^7
40550000     2465109          <--- input = 4.055*10^7
24850000     1557132          <--- input = 2.485*10^7
41500        4339

Duration

Since this is a fastest-code challenge and entries are to be run on my computer, I reserve the right to stop timing entries after 2 weeks. After this point, entries are still accepted, but there is no guarantee that they are officially timed.

Having said this, I don't expect too many answers to this challenge and I will likely continue to time new answers indefinitely.

Scoring Particulars

I timed the faster entries with the following script:

#!/bin/bash

a=(1907000000 1337000000 1240000000 660000000 99820000 40550000 24850000 41500)

echo DennisC
exec 2>> times/dennisc.txt
time for j in ${a[@]}; do ./dennisc $j; done >> /dev/null;

echo DennisPy
exec 2>> times/dennispy.txt
time for j in ${a[@]}; do pypy dennispy.py <<< $j; done >> /dev/null;

echo arjandelumens
exec 2>> times/arjandelumens.txt
time for j in ${a[@]}; do ./arjandelumens $j; done >> /dev/null;

echo orlp
exec 2>> times/orlp.txt
time for j in ${a[@]}; do ./orlp $j; done >> /dev/null;

# echo mwr247
# time node-v4.3.1-linux-x64/bin/node mwr247.js

# mwr247 using js seems a bit longer, so I am going to run the fastest
# and then come back to his. 

# mwr247 provided a function, so I appended
# console.log( F( <argument> ) )
# to his code, for each argument.

time writes to stderr, so I sent stderr to a log file using exec 2 >> <filename>. You may notice that stdout is sent to /dev/null. This isn't a problem, because I already verified that the programs were producing the correct output.

I ran the above timeall.sh script 10 times using for i in {1..10}; do ./timeall.sh; done;

I then averaged the real time score for each entry.

Note that no other programs were running on my computer while timing.

Also, the official times have been appended to each entry. Please double check your own average.

Liam

Posted 2016-02-26T19:08:02.277

Reputation: 3 035

What prevents us from using a lookup table with the first 2e9 values of pi(n)? Would that be acceptable? (Not sure how fast it would be, though, because it would be a large table) – Luis Mendo – 2016-02-26T19:15:41.440

@DonMuesli That wouldn't be acceptable (goes against the spirit of the challenge), I've edited to make it expressly forbidden as well now. – Liam – 2016-02-26T19:18:45.937

8It's dangerous to refer to the "spirit" of the challenge. Your "against the spirit" may be someone else's "great trick" :-) It's better that you made it explicit – Luis Mendo – 2016-02-26T19:20:51.020

@DonMuesli Yeah. Honestly I think that is actually a standard loophole so it was closed by default. I would consider it Fetching the desired output from an external source unless the language for some reason has a built-in for it. In which case, "built-ins that trivialize" were already disallowed.

– Liam – 2016-02-26T19:24:18.480

1What's a built-in? I have a list-of-primes function in a library. May I use it? If not, can I copy the source code of the library in my program and use that? – nimi – 2016-02-26T19:26:01.300

@nimi "Built-ins that test primality or generate primes are not allowed. This includes lists of primes." also "you may hardcode primes up to and including 19 and no higher." – Liam – 2016-02-26T19:27:19.570

1@Liam: Yes, I know, but what counts as a built-in? Is copying source code from a library a built-in? – nimi – 2016-02-26T19:28:05.163

@nimi a function in your language. If it's not a function you wrote, then it is a built-in. I've added a bit to the challenge to hopefully clear things up. – Liam – 2016-02-26T19:28:58.173

@nimi if the definition is in your submission, then I suppose it would not be a built-in, it is now hard-coded. Just make sure to give credit. – Liam – 2016-02-26T19:33:59.713

@DonMuesli You could hardcode it and call it with reflection? That'd be pretty quick, provided the lookup was. – Addison Crump – 2016-02-26T23:24:58.470

@VoteToClose Anyway, the challenge now excludes that: may not be looked up externally or hardcoded locally – Luis Mendo – 2016-02-26T23:26:50.333

Do you have a 64-bit system? – LegionMammal978 – 2016-02-27T02:45:21.113

Is there any limit on compile time/executable size because I may be able to use a preprocessor-macro to generate all the primes up to 2000000000 into a table and then during the execution it would just be to look up in table. – 永劫回帰 – 2016-02-28T08:40:18.147

@変幻出没 That's an interesting question. I haven't considered compile time with the other entries, so I'm leaning toward that it's not expressly forbidden, but is still kind of questionable. Having said that, it's a bit late to change the rules now, so go ahead. – Liam – 2016-02-29T03:05:19.240

@Liam I've been meaning to refamiliarize myself with simd instructions and this seems like a good opportunity. Even if it's non-competing, could you give your cpu type so I can attempt an entry with simd stuff? – Michael Klein – 2016-02-29T13:05:04.733

@MichaelKlein AMD FX(tm)-6300 Six-Core Processor. If you need more info, please let me know. – Liam – 2016-02-29T16:37:13.553

@Liam, should be enough, it has AVX (advanced vector extensions, 128 bit) and fused-multiply add. – Michael Klein – 2016-02-29T16:59:22.723

What, no answers in machine code, or even assembler? – Cary Swoveland – 2016-03-01T07:40:50.730

@CarySwoveland There is time yet! – Liam – 2016-03-01T07:41:43.060

launching a new process for each test seems rather unfair to interpreted languages. should entries not just be able to include a test harness main() function? – Sparr – 2016-03-07T14:31:00.513

@Liam, Hi.. I know its too late.. I'm planning to generate primes with Sieve of Eratosthenes (ofcourse it's not a plain Sieve, I'm gng to heavily customise it) to generate primes upto given number, and print the number of primes upto given number.. I assume my program will be provided with many test cases. One thing I want to confirm is, whether all the numbers will be given in an one run, or does one number will be given for every exection upto n test cases..? – The Coder – 2016-03-24T16:46:52.477

@thecoder inputs are given one at a time. – Liam – 2016-03-24T17:51:39.453

What are the specs on your computer's GPU? – Mego – 2016-06-11T02:15:14.097

@mego Graphics Card Manufacturer - Powered by AMD Graphics Chipset - AMD Radeon R9 200 Series Device ID - 6810 Vendor ID - 1002 SubSystem ID - E271 SubSystem Vendor ID - 174B Memory Size - 2048 MB Memory Type - GDDR5 Memory Clock - 1400 MHz Core Clock - 1070 MHz Total Memory Bandwidth - 179 GByte/s – Liam – 2016-06-11T19:43:13.503

@Liam Thanks! I might be able to compete with Dennis! – Mego – 2016-06-11T19:45:57.887

@Mego I'm looking forward to it! If you need more details, let me know – Liam – 2016-06-11T19:46:55.983

Answers

123

C, 0.026119s (Mar 12 2016)

#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define cache_size 16384
#define Phi_prec_max (47 * a)

#define bit(k) (1ULL << ((k) & 63))
#define word(k) sieve[(k) >> 6]
#define sbit(k) ((word(k >> 1) >> (k >> 1)) & 1)
#define ones(k) (~0ULL >> (64 - (k)))
#define m2(k) ((k + 1) / 2)
#define max(a, b) ((a) > (b) ? (a) : (b))
#define min(a, b) ((a) < (b) ? (a) : (b))
#define ns(t) (1000000000 * t.tv_sec + t.tv_nsec)
#define popcnt __builtin_popcountll

#define mask_build(i, p, o, m) mask |= m << i, i += o, i -= p * (i >= p)
#define Phi_prec_bytes ((m2(Phi_prec_max) + 1) * sizeof(int16_t))
#define Phi_prec(i, j) Phi_prec_pointer[(j) * (m2(Phi_prec_max) + 1) + (i)]
#define Phi_6_next ((i / 1155) * 480 + Phi_5[i % 1155] - Phi_5[(i + 6) / 13])
#define Phi_6_upd_1() t = Phi_6_next, i += 1, *(l++) = t
#define Phi_6_upd_2() t = Phi_6_next, i += 2, *(l++) = t, *(l++) = t
#define Phi_6_upd_3() t = Phi_6_next, i += 3, *(l++) = t, *(l++) = t, *(l++) = t

typedef unsigned __int128 uint128_t;
struct timespec then, now;
uint64_t a, primes[4648] = { 2, 3, 5, 7, 11, 13, 17, 19 }, *primes_fastdiv;
uint16_t *Phi_6, *Phi_prec_pointer;

static inline uint64_t Phi_6_mod(uint64_t y)
{
    if (y < 30030)
        return Phi_6[m2(y)];
    else
        return (y / 30030) * 5760 + Phi_6[m2(y % 30030)];
}

static inline uint64_t fastdiv(uint64_t dividend, uint64_t fast_divisor)
{
    return ((uint128_t) dividend * fast_divisor) >> 64;
}

uint64_t Phi(uint64_t y, uint64_t c)
{
    uint64_t *d = primes_fastdiv, i = 0, r = Phi_6_mod(y), t = y / 17;

    r -= Phi_6_mod(t), t = y / 19;

    while (i < c && t > Phi_prec_max) r -= Phi(t, i++), t = fastdiv(y, *(d++));

    while (i < c && t) r -= Phi_prec(m2(t), i++), t = fastdiv(y, *(d++));

    return r;
}

uint64_t Phi_small(uint64_t y, uint64_t c)
{
    if (!c--) return y;

    return Phi_small(y, c) - Phi_small(y / primes[c], c);
}

uint64_t pi_small(uint64_t y)
{
    uint64_t i, r = 0;

    for (i = 0; i < 8; i++) r += (primes[i] <= y);

    for (i = 21; i <= y; i += 2)
        r += i % 3 && i % 5 && i % 7 && i % 11 && i % 13 && i % 17 && i % 19;

    return r;
}

int output(int result)
{
    clock_gettime(CLOCK_REALTIME, &now);
    printf("pi(x) = %9d    real time:%9ld ns\n", result , ns(now) - ns(then));

    return 0;
}

int main(int argc, char *argv[])
{
    uint64_t b, i, j, k, limit, mask, P2, *p, start, t = 8, x = atoi(argv[1]);
    uint64_t root2 = sqrt(x), root3 = pow(x, 1./3), top = x / root3 + 1;
    uint64_t halftop = m2(top), *sieve, sieve_length = (halftop + 63) / 64;
    uint64_t i3 = 1, i5 = 2, i7 = 3, i11 = 5, i13 = 6, i17 = 8, i19 = 9;
    uint16_t Phi_3[] = { 0, 1, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6, 7, 7, 7, 8 };
    uint16_t *l, *m, Phi_4[106], Phi_5[1156];

    clock_gettime(CLOCK_REALTIME, &then);

    sieve = malloc(sieve_length * sizeof(int64_t));

    if (x < 529) return output(pi_small(x));

    for (i = 0; i < sieve_length; i++)
    {
        mask  = 0;

        mask_build( i3,  3,  2, 0x9249249249249249ULL);
        mask_build( i5,  5,  1, 0x1084210842108421ULL);
        mask_build( i7,  7,  6, 0x8102040810204081ULL);
        mask_build(i11, 11,  2, 0x0080100200400801ULL);
        mask_build(i13, 13,  1, 0x0010008004002001ULL);
        mask_build(i17, 17,  4, 0x0008000400020001ULL);
        mask_build(i19, 19, 12, 0x0200004000080001ULL);

        sieve[i] = ~mask;
    }

    limit = min(halftop, 8 * cache_size);

    for (i = 21; i < root3 + 1; i += 2)
        if (sbit(i))
            for (primes[t++] = i, j = i * i / 2; j < limit; j += i)
                word(j) &= ~bit(j);

    a = t;

    for (i = (root3 + 1) | 1; i < root2 + 1; i += 2)
        if (sbit(i)) primes[t++] = i;

    b = t;

    while (limit < halftop)
    {
        start = 2 * limit + 1, limit = min(halftop, limit + 8 * cache_size);

        for (p = &primes[8]; p < &primes[a]; p++)
            for (j = max(start / *p | 1, *p) * *p / 2; j < limit; j += *p)
                word(j) &= ~bit(j);
    }

    P2 = (a - b) * (a + b - 1) / 2;

    for (i = m2(root2); b --> a; P2 += t, i = limit)
    {
        limit = m2(x / primes[b]), j = limit & ~63;

        if (i < j)
        {
            t += popcnt((word(i)) >> (i & 63)), i = (i | 63) + 1;

            while (i < j) t += popcnt(word(i)), i += 64;

            if (i < limit) t += popcnt(word(i) & ones(limit - i));
        }
        else if (i < limit) t += popcnt((word(i) >> (i & 63)) & ones(limit - i));
    }

    if (a < 7) return output(Phi_small(x, a) + a - 1 - P2);

    a -= 7, Phi_6 = malloc(a * Phi_prec_bytes + 15016 * sizeof(int16_t));
    Phi_prec_pointer = &Phi_6[15016];

    for (i = 0; i <= 105; i++)
        Phi_4[i] = (i / 15) * 8 + Phi_3[i % 15] - Phi_3[(i + 3) / 7];

    for (i = 0; i <= 1155; i++)
        Phi_5[i] = (i / 105) * 48 + Phi_4[i % 105] - Phi_4[(i + 5) / 11];

    for (i = 1, l = Phi_6, *l++ = 0; i <= 15015; )
    {
        Phi_6_upd_3(); Phi_6_upd_2(); Phi_6_upd_1(); Phi_6_upd_2();
        Phi_6_upd_1(); Phi_6_upd_2(); Phi_6_upd_3(); Phi_6_upd_1();
    }

    for (i = 0; i <= m2(Phi_prec_max); i++)
        Phi_prec(i, 0) = Phi_6[i] - Phi_6[(i + 8) / 17];

    for (j = 1, p = &primes[7]; j < a; j++, p++)
    {
        i = 1, memcpy(&Phi_prec(0, j), &Phi_prec(0, j - 1), Phi_prec_bytes);
        l = &Phi_prec(*p / 2 + 1, j), m = &Phi_prec(m2(Phi_prec_max), j) - *p;

        while (l <= m)
            for (k = 0, t = Phi_prec(i++, j - 1); k < *p; k++) *(l++) -= t;

        t = Phi_prec(i++, j - 1);

        while (l <= m + *p) *(l++) -= t;
    }

    primes_fastdiv = malloc(a * sizeof(int64_t));

    for (i = 0, p = &primes[8]; i < a; i++, p++)
    {
        t = 96 - __builtin_clzll(*p);
        primes_fastdiv[i] = (bit(t) / *p + 1) << (64 - t);
    }

    return output(Phi(x, a) + a + 6 - P2);
}

This uses the Meissel-Lehmer method.

Timings

On my machine, I'm getting roughly 5.7 milliseconds for the combined test cases. This is on an Intel Core i7-3770 with DDR3 RAM at 1867 MHz, running openSUSE 13.2.

$ ./timepi '-march=native -O3' pi 1000
pi(x) =  93875448    real time:  2774958 ns
pi(x) =  66990613    real time:  2158491 ns
pi(x) =  62366021    real time:  2023441 ns
pi(x) =  34286170    real time:  1233158 ns
pi(x) =   5751639    real time:   384284 ns
pi(x) =   2465109    real time:   239783 ns
pi(x) =   1557132    real time:   196248 ns
pi(x) =      4339    real time:    60597 ns

0.00572879 s

Because the variance got too high, I'm using timings from within the program for the unofficial run times. This is the script that computed the average of the combined run times.

#!/bin/bash

all() { for j in ${a[@]}; do ./$1 $j; done; }

gcc -Wall $1 -lm -o $2 $2.c

a=(1907000000 1337000000 1240000000 660000000 99820000 40550000 24850000 41500)

all $2

r=$(seq 1 $3)

for i in $r; do all $2; done > times

awk -v it=$3 '{ sum += $6 } END { print "\n" sum / (1e9 * it) " s" }' times

rm times

Official times

This time is for doing the score cases 1000 times.

real    0m28.006s
user    0m15.703s
sys 0m14.319s

How it works

Formula

Let \$x\$ be a positive integer.

Each positive integer \$n \le x\$ satisfies exactly one of the following conditions.

\$n = 1\$
\$n\$ is divisible by a prime number \$p\$ in \$[1, \sqrt[3]{x}]\$.
\$n = pq\$, where \$p\$ and \$q\$ are (not necessarily distinct) prime numbers in \$(\sqrt[3]{x}, \sqrt[3]{x^2})\$.
\$n\$ is prime and \$n > \sqrt[3]{x}\$

Let \$\pi(y)\$ denote the number of primes \$p\$ such that \$p \le y\$. There are \$\pi(x) - \pi(\sqrt[3]{x})\$ numbers that fall in the fourth category.

Let \$P_k(y, c)\$ denote the amount of positive integers \$m \le y\$ that are a product of exactly \$k\$ prime numbers not among the first \$c\$ prime numbers. There are \$P_2(x, \pi(\sqrt[3]{x}))\$ numbers that fall in the third category.

Finally, let \$\phi(y, c)\$ denote the amount of positive integers \$k \le y\$ that are coprime to the first \$c\$ prime numbers. There are \$x - \phi(x, \pi(\sqrt[3]{x}))\$ numbers that fall into the second category.

Since there are \$x\$ numbers in all categories,

$$1 + x - \phi(x, \pi(\sqrt[3]{x})) + P_2(x, \pi(\sqrt[3]{x})) + \pi(x) - \pi(\sqrt[3]{x}) = x$$

and, therefore,

$$\pi(x) = \phi(x, \pi(\sqrt[3]{x})) + \pi(\sqrt[3]{x}) - 1 - P_2(x, \pi(\sqrt[3]{x}))$$

The numbers in the third category have a unique representation if we require that \$p \le q\$ and, therefore \$p \le \sqrt{x}\$. This way, the product of the primes \$p\$ and \$q\$ is in the third category if and only if \$\sqrt[3]{x} < p \le q \le \frac{x}{p}\$, so there are \$\pi(\frac{x}{p}) - \pi(p) + 1\$ possible values for \$q\$ for a fixed value of \$p\$, and \$P_2(x, \pi(\sqrt[3]{x})) = \sum_{\pi(\sqrt[3]{x}) < k \le \pi(\sqrt{x})} (\pi(\frac{x}{p_k}) - \pi(p_k) + 1)\$, where \$p_k\$ denotes the \$k^{\text{th}}\$ prime number.

Finally, every positive integer \$n \le y\$ that is not coprime to the first \$c\$ prime numbers can be expressed in unique fashion as \$n = p_kf\$, where \$p_k\$ is the lowest prime factor of \$n\$. This way, \$k \le c\$, and \$f\$ is coprime to the first \$k - 1\$ primes numbers.

This leads to the recursive formula \$\phi(y, c) = y - \sum_{1 \le k \le c} \phi(\frac{y}{p_k}, k - 1)\$. In particular, the sum is empty if \$c = 0\$, so \$\phi(y, 0) = y\$.

We now have a formula that allows us to compute \$\pi(x)\$ by generating only the first \$\pi(\sqrt[3]{x^2})\$ prime numbers (millions vs billions).

Algorithm

We'll need to compute \$\pi(\frac{x}{p})\$, where \$p\$ can get as low as \$\sqrt[3]{x}\$. While there are other ways to do this (like applying our formula recursively), the fastest way seems to be to enumerate all primes up to \$\sqrt[3]{x^2}\$, which can be done with the sieve of Eratosthenes.

First, we identify and store all prime numbers in \$[1, \sqrt{x}]\$, and compute \$\pi(\sqrt[3]{x})\$ and \$\pi(\sqrt{x})\$ at the same time. Then, we compute \$\frac{x}{p_k}\$ for all \$k\$ in \$(\pi(\sqrt[3]{x}), \pi(\sqrt{x})]\$, and count the primes up to each successive quotient.

Also, \$\sum_{\pi(\sqrt[3]{x}) < k \le \pi(\sqrt{x})} (-\pi(p_k) + 1)\$ has the closed form \$\frac{\pi(\sqrt[3]{x}) - \pi(\sqrt{x}))(\pi(\sqrt[3]x) + \pi(\sqrt{x}) - 1}{2}\$, which allows us to complete the computation of \$P_2(x, \pi(\sqrt[3]{x}))\$.

That leaves the computation of \$\phi\$, which is the most expensive part of the algorithm. Simply using the recursive formula would require \$2^c\$ function calls to compute \$\phi(y, c)\$.

First of all, \$\phi(0, c) = 0\$ for all values of \$c\$, so \$\phi(y, c) = y - \sum_{1 \le k \le c, p_k \le y} \phi(\frac{y}{p_k}, k - 1)\$. By itself, this observation is already enough to make the computation feasible. This is because any number below \$2 \cdot 10^9\$ is smaller than the product of any ten distinct primes, so the overwhelming majority of summands vanishes.

Also, by grouping \$y\$ and the first \$c'\$ summands of the definition of \$\phi\$, we get the alternative formula \$\phi(y, c) = \phi(y, c') - \sum_{c' < k \le c, p_k \le y} \phi(\frac{y}{p_k}, k - 1)\$. Thus, precomputing \$\phi(y, c')\$ for a fixed \$c'\$ and appropriate values of \$y\$ saves most of the remaining function calls and the associated computations.

If \$m_c = \prod_{1 \le k \le c} p_k\$, then \$\phi(m_c, c) = \varphi(m_c)\$, since the integers in \$[1, m_c]\$ that are divisible by none of \$p_1, \cdots, p_c\$ are precisely those that are coprime to \$m_c\$. Also, since \$\gcd(z + m_c, m_c) = \gcd(z, m_c)\$, we have that \$\phi(y, c) = \phi(\lfloor \frac{y}{m_c} \rfloor m_c, c) + \phi(y % m_c, c) = \lfloor \frac{y}{m_c} \rfloor \varphi(m_c) + \phi(y % m_c, c)\$.

Since Euler's totient function is multiplicative, \$\varphi(m_c) = \prod_{1 \le k \le c} \varphi(p_k) = \prod_{1 \le k \le c} (p_k - 1)\$, and we have an easy way to derive \$\phi(y, c)\$ for all \$y\$ by precomputing the values for only those \$y\$ in \$[0, m_c)\$.

Also, if we set \$c' = c - 1\$, we obtain \$\phi(y, c) = \phi(y, c - 1) - \phi(\frac{y}{p_c}, c - 1)\$, the original definition from Lehmer's paper. This gives us a simple way to precompute \$\phi(y, c)\$ for increasing values of \$c\$.

In addition for precomputing \$\phi(y, c)\$ for a certain, low value of \$c\$, we'll also precompute it for low values of \$y\$, cutting the recursion short after falling below a certain threshold.

Implementation

The previous section covers most parts of the code. One remaining, important detail is how the divisions in the function Phi are performed.

Since computing \$\phi\$ only requires dividing by the first \$\pi(\sqrt[3]{x})\$ prime numbers, we can use the fastdiv function instead. Rather than simply dividing an \$y\$ by a prime \$p\$, we multiply \$y\$ by \$d_p \approx \frac{2^{64}}{p}\$ instead and recover \$\frac{y}{p}\$ as \$\frac{d_py}{2^{64}}\$. Because of how integer multiplication is implemented on x64, dividing by \$2^{64}\$ is not required; the higher 64 bits of \$d_py\$ are stored in their own register.

Note that this method requires precomputing \$d_p\$, which isn't faster than computing \$\frac{y}{p}\$ directly. However, since we have to divide by the same primes over and over again and division is so much slower than multiplication, this results in an important speed-up. More details on this algorithm, as well as a formal proof, can be found in Division by Invariant Integers using Multiplication.

Dennis

Posted 2016-02-26T19:08:02.277

Reputation: 196 637

22One does not simply outpace Dennis? – Addison Crump – 2016-02-28T21:24:39.943

8Honestly I just can't believe how fast this is. I haven't had the time to go through and understand what's going on but I really need to. – Liam – 2016-02-28T22:14:53.470

27@Liam I fully intend to explain how this works, but I'm still trying to speed it up. Right now, I really wish PPCG had LaTeX... – Dennis – 2016-02-29T05:05:36.413

1I understand that MathJax was causing some problems or something but we really need to get it back – Liam – 2016-02-29T05:17:23.953

15Fun note: (On my machine) This is currently beating both Mathematica's builtin and kimwalisch on github's primecount C++ library, however, it is currently the only entry doing so. – Michael Klein – 2016-03-01T00:15:44.777

6I see you are a fan of the 'goes to' operator: while (b --> a) – neocpp – 2016-03-01T04:04:24.947

– cat – 2016-03-01T22:46:18.977

You can speed up the Phi function considerably with the The Meissel, Lehmer, Lagarias, Miller, Odlyzko Method

– TheNumberOne – 2016-03-02T13:42:21.103

10@TheNumberOne shh don't tell him about it... other people might need that to beat him – Liam – 2016-03-02T23:59:16.637

This code is buggy. (26071) is 2866 and not 2865 which this code returns.

– Tamas – 2019-11-04T09:52:01.653

1@Tamas I seem to have introduced an off-by-one error in revision 45. I've tested the fixed code for all integers between 1 and 10,000,000. – Dennis – 2019-11-04T13:00:02.863

C99/C++, 8.9208s (28 Feb 2016)

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>

uint64_t popcount( uint64_t v )
    {
    v = (v & 0x5555555555555555ULL) + ((v>>1) & 0x5555555555555555ULL);
    v = (v & 0x3333333333333333ULL) + ((v>>2) & 0x3333333333333333ULL);
    v = (v & 0x0F0F0F0F0F0F0F0FULL) + ((v>>4) & 0x0F0F0F0F0F0F0F0FULL);
    v *= 0x0101010101010101ULL;
    return v >> 56;
    }

#define PPROD  3*5*7

int primecount( int limit )
    {
    int i,j;
    int reps = (limit-1)/(64*PPROD) + 1;
    int mod_limit = reps * (64*PPROD);
    int seek_limit = (int)ceil( sqrt(limit) );
    int primecount = 0;
    int slice_count = limit/250000 + 1;

    uint8_t *buf = (uint8_t *)malloc( mod_limit/8 + seek_limit);
    int *primes = (int *)malloc(seek_limit*sizeof(int));

    // initialize a repeating bit-pattern to fill our sieve-memory with
    uint64_t v[PPROD];
    memset(v, 0, sizeof(v) );
    for(i=0;i<(64*PPROD);i++)
        for(j=2;j<=7;j++)
            if( i % j == 0 )
                v[ i >> 6 ] |= 1ULL << (i & 0x3F);

    for(i=0; i<reps; i++)
        memcpy( buf + 8*PPROD*i, v, 8*PPROD );

    // use naive E-sieve to get hold of all primes to test for
    for(i=11;i<seek_limit;i+=2)
        {
        if( (buf[i >> 3] & (1 << (i & 7)) ) == 0 )
            {
            primes[primecount++] = i;
            for(j=3*i;j<seek_limit;j += 2*i )
                buf[j >> 3] |= (1 << (j&7) );
            }
        }

    // fill up whole E-sieve. Use chunks of about 30 Kbytes
    // so that the chunk of E-sieve we're working on
    // can fit into the L1-cache.
    for(j=0;j<slice_count;j++)
        {
        int low_bound = ((uint64_t)limit * j) / slice_count;
        int high_bound = ((uint64_t)limit * (j+1)) / slice_count - 1;

        for(i=0;i<primecount;i++)
            {
            int pm = primes[i];
            // compute the first odd multiple of pm that is larger than or equal
            // to the lower bound.
            uint32_t lb2 = (low_bound + pm - 1) / pm;
            lb2 |= 1;
            if( lb2 < 3 ) lb2 = 3;
            lb2 *= pm;
            uint32_t hb2 = (high_bound / pm) * pm;

            uint32_t kt1 = ((lb2 + 2*pm) >> 3) - (lb2 >> 3);
            uint32_t kt2 = ((lb2 + 4*pm) >> 3) - (lb2 >> 3);
            uint32_t kt3 = ((lb2 + 6*pm) >> 3) - (lb2 >> 3);

            uint32_t kx0 = 1 << (lb2 & 7);
            uint32_t kx1 = 1 << ((lb2 + 2*pm) & 7);
            uint32_t kx2 = 1 << ((lb2 + 4*pm) & 7);
            uint32_t kx3 = 1 << ((lb2 + 6*pm) & 7);

            uint8_t *lb3 = buf + (lb2 >> 3);
            uint8_t *hb3 = buf + (hb2 >> 3);

            uint8_t *kp;
            for(kp=lb3; kp<=hb3; kp+=pm)
                {
                kp[0]   |= kx0;
                kp[kt1] |= kx1;
                kp[kt2] |= kx2;
                kp[kt3] |= kx3;
                }
            }
        }

    // flag tail elements to exclude them from prime-counting.
    for(i=limit;i<mod_limit;i++)
        buf[i >> 3] |= 1 << (i&7);

    int sum = 0;
    uint64_t *bufx = (uint64_t *)buf;

    for(i=0;i<mod_limit>>6;i++)
        sum += popcount( bufx[i] );

    free(buf);
    free(primes);

    return mod_limit - sum + 3;
    }


int main( int argc, char **argv)
    {
    if( argc != 2 )
        {
        printf("Please provide an argument\n");
        exit(1);
        }

    int limit = atoi( argv[1] );
    if( limit < 3 || limit > 2000000000 )
        {
        printf("Argument %d out of range\n", limit );
        exit(1);
        }

    printf("%d\n", primecount(limit) );
    }

A bitmap-based sieve-of-erastothenes implementation. It performs the following steps:

First, generate a repeating bit-pattern to fill the sieve with, that covers multiples of 2,3,5,7
Next, use the sieve method to generate an array of all primes smaller than sqrt(n)
Next, use the prime list from the previous step to write into the sieve. This is done on chunks of the sieve that are approximately L1-cache-sized, so that the sieve processing doesn't constantly thrash the L1 cache; this seems to bring about a 5x speedup over not chunking.
Finally, perform a bit-count.

Compiled with gcc primecount.c -O3 -lm -Wall and run on ubuntu 15.10 (64-bit) on an i7-4970k, it takes about 2.2 seconds for the full set of score cases. The running time is dominated by step 3; this could be multithreaded if desired, since the chunks are independent; this would require some care to ensure that chunk boundaries are suitably aligned.

It allocates slightly more memory than strictly needed for the sieve; this makes room for some end-of-buffer overrun, which is necessary for the loop unrolling in step 3 to work properly.

Official times

real    0m8.934s
user    0m8.795s
sys 0m0.150s

real    0m8.956s
user    0m8.818s
sys 0m0.150s

real    0m8.907s
user    0m8.775s
sys 0m0.145s

real    0m8.904s
user    0m8.775s
sys 0m0.141s

real    0m8.902s
user    0m8.783s
sys 0m0.132s

real    0m9.087s
user    0m8.923s
sys 0m0.176s

real    0m8.905s
user    0m8.778s
sys 0m0.140s

real    0m9.005s
user    0m8.859s
sys 0m0.158s

real    0m8.911s
user    0m8.789s
sys 0m0.135s

real    0m8.907s
user    0m8.781s
sys 0m0.138s

arjan de lumens

Posted 2016-02-26T19:08:02.277

Reputation: 351

8Welcome to Programming Puzzles & Code Golf, and congratulations on a stellar first post! – Dennis – 2016-02-27T19:34:13.670

Consider using -O3 -march=native. Your CPU supports the popcnt instruction, and compilers can sometimes recognize some pure C implementations of it and compile to the single instruction. (Or better, use __builtin_popcountll on GNU C, like Dennis's answer).

– Peter Cordes – 2016-12-17T19:07:36.150

-march=native on your Haswell CPU will also enable BMI2 for more efficient variable-count shift instructions. (SHLX instead of the legacy SHL that needs count in cl.) The OP's AMD Piledriver CPU doesn't have BMI2, but it does have popcnt. But AMD CPUs run variable-count SHL faster than Intel CPUs, so compiling with BMI2 while tuning might still be appropriate. Piledriver is pretty different from Haswell as far as micro-optimizations go, but asking for -march=native is good – Peter Cordes – 2016-12-17T19:11:37.567

Java, 25,725.315 seconds on this machine

This is not going to win, I just wanted to post an answer that does not use any sieves.

UPDATE: This is currently ranked at about 150,440.4386 times slower than the leading score. Go up-vote them, their answer is awesome.

Byte code:

0000000: cafe babe 0000 0034 0030 0a00 0900 1709  .......4.0......
0000010: 0018 0019 0a00 1a00 1b0a 0008 001c 0a00  ................
0000020: 1d00 1e0a 0008 001f 0a00 2000 2107 0022  .......... .!.."
0000030: 0700 2301 0006 3c69 6e69 743e 0100 0328  ..#...<init>...(
0000040: 2956 0100 0443 6f64 6501 000f 4c69 6e65  )V...Code...Line
0000050: 4e75 6d62 6572 5461 626c 6501 0004 6d61  NumberTable...ma
0000060: 696e 0100 1628 5b4c 6a61 7661 2f6c 616e  in...([Ljava/lan
0000070: 672f 5374 7269 6e67 3b29 5601 0008 6e75  g/String;)V...nu
0000080: 6d50 7269 6d65 0100 0428 4929 4901 000d  mPrime...(I)I...
0000090: 5374 6163 6b4d 6170 5461 626c 6501 0007  StackMapTable...
00000a0: 6973 5072 696d 6501 0004 2849 295a 0100  isPrime...(I)Z..
00000b0: 0a53 6f75 7263 6546 696c 6501 0006 452e  .SourceFile...E.
00000c0: 6a61 7661 0c00 0a00 0b07 0024 0c00 2500  java.......$..%.
00000d0: 2607 0027 0c00 2800 290c 0010 0011 0700  &..'..(.).......
00000e0: 2a0c 002b 002c 0c00 1300 1407 002d 0c00  *..+.,.......-..
00000f0: 2e00 2f01 0001 4501 0010 6a61 7661 2f6c  ../...E...java/l
0000100: 616e 672f 4f62 6a65 6374 0100 106a 6176  ang/Object...jav
0000110: 612f 6c61 6e67 2f53 7973 7465 6d01 0003  a/lang/System...
0000120: 6f75 7401 0015 4c6a 6176 612f 696f 2f50  out...Ljava/io/P
0000130: 7269 6e74 5374 7265 616d 3b01 0011 6a61  rintStream;...ja
0000140: 7661 2f6c 616e 672f 496e 7465 6765 7201  va/lang/Integer.
0000150: 0008 7061 7273 6549 6e74 0100 1528 4c6a  ..parseInt...(Lj
0000160: 6176 612f 6c61 6e67 2f53 7472 696e 673b  ava/lang/String;
0000170: 2949 0100 136a 6176 612f 696f 2f50 7269  )I...java/io/Pri
0000180: 6e74 5374 7265 616d 0100 0770 7269 6e74  ntStream...print
0000190: 6c6e 0100 0428 4929 5601 000e 6a61 7661  ln...(I)V...java
00001a0: 2f6c 616e 672f 4d61 7468 0100 0473 7172  /lang/Math...sqr
00001b0: 7401 0004 2844 2944 0021 0008 0009 0000  t...(D)D.!......
00001c0: 0000 0004 0001 000a 000b 0001 000c 0000  ................
00001d0: 001d 0001 0001 0000 0005 2ab7 0001 b100  ..........*.....
00001e0: 0000 0100 0d00 0000 0600 0100 0000 0100  ................
00001f0: 0900 0e00 0f00 0100 0c00 0000 2c00 0300  ............,...
0000200: 0100 0000 10b2 0002 2a03 32b8 0003 b800  ........*.2.....
0000210: 04b6 0005 b100 0000 0100 0d00 0000 0a00  ................
0000220: 0200 0000 0300 0f00 0400 0a00 1000 1100  ................
0000230: 0100 0c00 0000 6600 0200 0300 0000 2003  ......f....... .
0000240: 3c03 3d1c 1aa2 0018 1b1c b800 0699 0007  <.=.............
0000250: 04a7 0004 0360 3c84 0201 a7ff e91b ac00  .....`<.........
0000260: 0000 0200 0d00 0000 1600 0500 0000 0600  ................
0000270: 0200 0700 0900 0800 1800 0700 1e00 0900  ................
0000280: 1200 0000 1800 04fd 0004 0101 5001 ff00  ............P...
0000290: 0000 0301 0101 0002 0101 fa00 0700 0a00  ................
00002a0: 1300 1400 0100 0c00 0000 9700 0300 0300  ................
00002b0: 0000 4c1a 05a2 0005 03ac 1a05 9f00 081a  ..L.............
00002c0: 06a0 0005 04ac 1a05 7099 0009 1a06 709a  ........p.....p.
00002d0: 0005 03ac 1a87 b800 078e 0460 3c10 063d  ...........`<..=
00002e0: 1c1b a300 1b1a 1c04 6470 9900 0b1a 1c04  ........dp......
00002f0: 6070 9a00 0503 ac84 0206 a7ff e604 ac00  `p..............
0000300: 0000 0200 0d00 0000 2200 0800 0000 0c00  ........".......
0000310: 0700 0d00 1300 0e00 2100 0f00 2a00 1000  ........!...*...
0000320: 3200 1100 4400 1000 4a00 1200 1200 0000  2...D...J.......
0000330: 1100 0907 0901 0b01 fd00 0b01 0114 01fa  ................
0000340: 0005 0001 0015 0000 0002 0016            ............

Source Code:

public class E {
    public static void main(String[]args){
        System.out.println(numPrime(Integer.parseInt(args[0])));
    }
    private static int numPrime(int max) {
        int toReturn = 0;
        for (int i = 0; i < max; i++)
            toReturn += (isPrime(i))?1:0;
        return toReturn;
    }
    private static boolean isPrime(int n) {
            if(n < 2) return false;
            if(n == 2 || n == 3) return true;
            if(n%2 == 0 || n%3 == 0) return false;
            int sqrtN = (int)Math.sqrt(n)+1;
            for(int i = 6; i <= sqrtN; i += 6)
                if(n%(i-1) == 0 || n%(i+1) == 0) return false;
            return true;
    }
}

Turns out the optimizer was, in fact, increasing the time taken. >.> Dammit.

Input below 1000 appears to take .157s average time on my computer (likely due to class loading ಠ_ಠ), but past about 1e7 it gets fussy.

Timing list:

> time java E 41500;time java E 24850000;time java E 40550000;time java E 99820000;time java E 660000000;time java E 1240000000;time java E 1337000000;time java E 1907000000
4339

real    0m0.236s
user    0m0.112s
sys     0m0.024s
1557132

real    0m8.842s
user    0m8.784s
sys     0m0.060s
2465109

real    0m18.442s
user    0m18.348s
sys     0m0.116s
5751639

real    1m15.642s
user    1m8.772s
sys     0m0.252s
34286170

real    40m35.810s
user    16m5.240s
sys     0m5.820s
62366021

real    104m12.628s
user    39m32.348s
sys     0m13.584s
66990613

real    110m22.064s
user    42m28.092s
sys     0m11.320s
93875448

real    171m51.650s
user    68m39.968s
sys     0m14.916s

Addison Crump

Posted 2016-02-26T19:08:02.277

Reputation: 10 763

11Java is currently running at a consistent 100% CPU right now. This is totally efficient, what are you talking about? – Addison Crump – 2016-02-27T00:32:57.630

can you give me a quit tutorial on how to java ( because C/C++ > java). I compile with javac voteToClose.java (I renamed the class) and then what? – Liam – 2016-02-27T02:04:08.763

@Liam java voteToClose <input> – Addison Crump – 2016-02-27T02:07:48.390

It outputs voteToClose.class – Liam – 2016-02-27T02:10:00.457

@Liam Make sure you rename the class in the file as well. – Addison Crump – 2016-02-27T02:11:45.227

Nice. Code should work for n up to 2 billion, so you could use int instead of long, I think that would improve speed. – sergioFC – 2016-02-27T13:49:22.650

@sergioFC Yes, that made it slightly faster. :D Thanks. – Addison Crump – 2016-02-27T15:25:20.347

Is it okay if I don't run this one? I can if you really want me to, but I kinda don't want to force my computer to do this for 2 hours. – Liam – 2016-02-27T18:41:32.107

@Liam I'm not going to win anyways. :P Don't worry about it. I'll post the times that I get. – Addison Crump – 2016-02-27T18:42:20.657

1Wait... Why does the byte code say cafe babe? – Cyoce – 2016-02-27T19:43:44.167

12@Cyoce All Java class files are headed with 0xCAFEBABE. – Addison Crump – 2016-02-27T19:44:35.340

Python 2 (PyPy 4.0), 2.36961s (Feb 29 2016)

def Phi(m, b):
    if not b:
        return m
    if not m:
        return 0
    if m >= 800:
        return Phi(m, b - 1) - Phi(m // primes[b - 1], b - 1)
    t = b * 800 + m
    if not Phi_memo[t]:
        Phi_memo[t] =  Phi(m, b - 1) - Phi(m // primes[b - 1], b - 1)
    return Phi_memo[t]

x = int(input())

if x < 6:
    print [0, 0, 1, 2, 2, 3][x]
    exit()

root2 = int(x ** (1./2))
root3 = int(x ** (1./3))
top = x // root3 + 1
sieve = [0, 0] + [1] * (top - 2)
pi = [0, 0]
primes = []
t = 0

for i in range(2, top):
    if sieve[i] == 1:
        t += 1
        primes.append(i)
        sieve[i::i] = [0] * len(sieve[i::i])
    pi.append(t)

a, b = pi[root3 + 1], pi[root2 + 1]
Phi_memo = [0] * ((a + 1) * 800)

print Phi(x, a) + a - 1 - sum(pi[x // p] - pi[p] + 1 for p in primes[a:b])

This uses the Meissel-Lehmer method.

Timings

$ time for i in 1.907e9 1.337e9 1.24e9 6.6e8 9.982e7 4.055e7 2.485e7 41500
> do pypy pi.py <<< $i; done
93875448
66990613
62366021
34286170
5751639
2465109
1557132
4339

real    0m1.696s
user    0m1.360s
sys     0m0.332s

Official times

Since there was another answer with a similar time, I opted to get more precise results. I timed this 100 times. The score is the following time divided by 100.

real    3m56.961s
user    3m38.802s
sys 0m18.512s

Dennis

Posted 2016-02-26T19:08:02.277

Reputation: 196 637

5Also, just for note: this code is 15,102.4 times faster than mine. +1 – Addison Crump – 2016-02-28T02:19:02.050

Rust, 0.37001 sec (12 June 2016)

About 10 times slower than slower than Dennis' C answer, but 10 times faster than his Python entry. This answer is made possible by @Shepmaster and @Veedrac who helped improve it on Code Review. It is taken verbatim from @Veedrac's post.

use std::env;

const EMPTY: usize = std::usize::MAX;
const MAX_X: usize = 800;

fn main() {
    let args: Vec<_> = env::args().collect();
    let x: usize = args[1].trim().parse().expect("expected a number");

    let root = (x as f64).sqrt() as usize;
    let y = (x as f64).powf(0.3333333333333) as usize + 1;

    let sieve_size = x / y + 2;
    let mut sieve = vec![true; sieve_size];
    let mut primes = vec![0; sieve_size];
    sieve[0] = false;
    sieve[1] = false;

    let mut a = 0;
    let mut num_primes = 1;

    let mut num_primes_smaller_root = 0;

    // find all primes up to x/y ~ x^2/3 aka sieve_size
    for i in 2..sieve_size {
        if sieve[i] {
            if i <= root {
                if i <= y {
                    a += 1;
                }
                num_primes_smaller_root += 1;
            }

            primes[num_primes] = i;
            num_primes += 1;
            let mut multiples = i;
            while multiples < sieve_size {
                sieve[multiples] = false;
                multiples += i;
            }
        }
    }

    let interesting_primes = primes[a + 1..num_primes_smaller_root + 1].iter();

    let p_2 =
        interesting_primes
        .map(|ip| primes.iter().take_while(|&&p| p <= x / ip).count())
        .enumerate()
        .map(|(i, v)| v - 1 - i - a)
        .fold(0, |acc, v| acc + v);

    let mut phi_results = vec![EMPTY; (a + 1) * MAX_X];
    println!("pi({}) = {}", x, phi(x, a, &primes, &mut phi_results) + a - 1 - p_2);
}

fn phi(x: usize, b: usize, primes: &[usize], phi_results: &mut [usize]) -> usize {
    if b == 0 {
        return x;
    }

    if x < MAX_X && phi_results[x + b * MAX_X] != EMPTY {
        return phi_results[x + b * MAX_X];
    }

    let value = phi(x, b - 1, primes, phi_results) - phi(x / primes[b], b - 1, primes, phi_results);
    if x < MAX_X {
        phi_results[x + b * MAX_X] = value;
    }
    value
}

Timed with: time ./time.sh where time.sh looks like:

#!/bin/bash

a=(1907000000 1337000000 1240000000 660000000 99820000 40550000 24850000 41500)

for i in {0..100}; do
    for j in ${a[@]}; do
        ./target/release/pi_n $j  > /dev/null;
    done;
done;

Here is the output.

[me@localhost pi_n]$ time ./time.sh 

real    0m37.011s
user    0m34.752s
sys 0m2.410s

Liam

Posted 2016-02-26T19:08:02.277

Reputation: 3 035

Node.js (JavaScript/ES6), 83.549s (11 Nov 2016)

var n=process.argv[2]*1,r=new Uint8Array(n),p=0,i=1,j
while(++i<=n){
  if(r[i]===0){
    for(j=i*i;j<=n;j+=i){r[j]=1}
    p+=1
  }
}
console.log(p)

Finally got around to remaking this, and it's both smaller/simpler and MUCH faster than before. Rather than a slower brute force method, it utilizes the Sieve of Eratosthenes alongside more efficient data structures, so that it is now able to actually finish in a respectable time (as far as I can find on the internet, it's the fastest JS prime count function out there).

Some demo times (i7-3770k):

10^4 (10,000) => 0.001 seconds
10^5 (100,000) => 0.003 seconds
10^6 (1,000,000) => 0.009 seconds
10^7 (10,000,000) => 0.074 seconds
10^8 (100,000,000) => 1.193 seconds
10^9 (1,000,000,000) => 14.415 seconds

Mwr247

Posted 2016-02-26T19:08:02.277

Reputation: 3 494

Why +=1 and not ++? – ETHproductions – 2016-02-27T01:43:58.020

@ETHproductions Depends on if you mean pre or post-increment. i++ has to hold the value change for another op, which at this scale leads to a small yet noticeable performance hit. I didn't test pre-increment, but I suspect it will be about the same as +=1. – Mwr247 – 2016-02-27T01:52:53.680

But +=1 needs to allocate 1 into memory. I think. If I were you, I would use ++i. I think that there is a single instruction to increment a value, so, I'm not sure. – Ismael Miguel – 2016-02-27T16:47:20.950

Why is it so condensed? This isn't [tag:code-golf], and this is really hard to read. – Cyoce – 2016-02-27T19:29:18.253

Also, it might help to change (...)|0;i=0 to (...)|(i=0) – Cyoce – 2016-02-27T19:34:59.463

@IsmaelMiguel I assume that that is one of the optimizations that JavaScript would perform pre-execution, so they probably are about the same. Different from post-increment because the behavior is different. – Mwr247 – 2016-02-27T21:11:41.393

I wouldn't believe on that. That 1 must be in memory somewhere. And on an architecture with an instruction to increment a value, you dont need to load 2 and sum them. just 1 and increment a bit. I would be surprised if it did a huge difference. – Ismael Miguel – 2016-02-27T21:14:21.437

@IsmaelMiguel I'm fairly certain that most JavaScript engines make the initial optimization to increment before execution, in which case the end result should be the same on the low level. I tested in JSPerf and it showed negligible difference, but I'll go ahead change it anyways. – Mwr247 – 2016-02-28T03:21:35.560

And those tiny changes gave you -2 seconds on 10^8? WOW! – Ismael Miguel – 2016-02-28T03:31:56.730

@IsmaelMiguel Actually, only one thing changed the time, and it was stupid for me not to notice: p.length... Simply having to calculate the list size at the end was adding a few seconds. By moving it to l (which tracks it for me), it saved a significant amount =) Thanks for helping me scrutinize and perf it better ^_^ – Mwr247 – 2016-02-28T03:33:26.617

@Cyoce Didn't help, but it looks better, so I changed to that. As to why it's so condensed... I kinda like how it looks. Spread out programs look kinda ugly to me, and I like the conciseness of it haha. Stupid I know, but meh. – Mwr247 – 2016-02-28T03:35:36.190

You're welcome. I'm trying to get rid of that ugly continue. I came up with function F(n){var i=0,a=3,b=1,p=[2],l=0,s=2;while(a<=n)if(a%(s=p[i]))b<s||i===l?(p[++l]=a)===(b=Math.sqrt(a+=2)|(i=0)):++i;else b=Math.sqrt(a+=2)|(i=0);return l+1-(n<2)}, which is ugly... To say the least... – Ismael Miguel – 2016-02-28T04:04:40.607

Is there a way in particular that you want me to run this? I'm getting times that are a lot longer than those you provided, I'll edit the question in a bit to include the script I'm using to time things. It took 1min to run the shortest 4 test cases on my computer. – Liam – 2016-02-28T18:23:08.137

So basically, take a look at the script. If that's an acceptable way to run it, just let me know and I'll time it that way – Liam – 2016-02-28T19:24:02.450

@Liam That is an acceptable method to run it. I'm curious what times you're getting compared to the ones I've provided, but it may simply be a matter of machine power. As far as it not completing, that's sorta expected... It's not the most efficient implementation in not the most efficient language, and by my estimates, calculating up through 10^9 should take nearly 10.5 minutes on my machine. – Mwr247 – 2016-02-28T20:14:07.630

As a side note, I've tried preallocating the array's length using n/(Math.log(n)-1)|0 to approximate the number of primes, as well as using special array types like Uint32Array, and everything slowed it down relative to the current implementation =( – Mwr247 – 2016-02-28T20:15:40.433

C++11, 22.6503s (28 Feb 2016)

Compile with g++ -O2 -m64 -march=native -ftree-vectorize -std=c++11 numprimes.cpp. These options are important. You also need to have Boost installed. On Ubuntu this is available by installing libboost-all-dev.

If you are on Windows I can recommend installing g++ and Boost through MSYS2. I have written a nice tutorial on how to install MSYS2. After following the tutorial you can install Boost using pacman -Sy `pacman -Ssq boost`.

#include <cmath>
#include <cstdint>
#include <iostream>
#include <vector>
#include <boost/dynamic_bitset.hpp>

uint64_t num_primes(uint64_t n) {
    // http://stackoverflow.com/questions/4643647/fast-prime-factorization-module
    uint64_t pi = (n >= 2) + (n >= 3);
    if (n < 5) return pi;

    n += 1;
    uint64_t correction = n % 6 > 1;
    uint64_t wheels[6] = { n, n - 1, n + 4, n + 3, n + 2, n + 1 };
    uint64_t limit = wheels[n % 6];

    boost::dynamic_bitset<> sieve(limit / 3);
    sieve.set();
    sieve[0] = false;

    for (uint64_t i = 0, upper = uint64_t(std::sqrt(limit))/3; i <= upper; ++i) {
        if (sieve[i]) {
            uint64_t k = (3*i + 1) | 1;
            for (uint64_t j = (k*k) / 3;                   j < limit/3; j += 2*k) sieve[j] = false;
            for (uint64_t j = (k*k + 4*k - 2*k*(i & 1))/3; j < limit/3; j += 2*k) sieve[j] = false;
        }
    }

    pi += sieve.count();
    for (uint64_t i = limit / 3 - correction; i < limit / 3; ++i) pi -= sieve[i];

    return pi;
}


int main(int argc, char** argv) {
    if (argc <= 1) {
        std::cout << "Usage: " << argv[0] << " n\n";
        return 0;
    }

    std::cout << num_primes(std::stoi(argv[1])) << "\n";
    return 0;
}

On my machine this runs in 4.8 seconds for 1907000000 (1.9e9).

^{The code above was repurposed from my personal C++ library, so I had a head start.}

Official times

real    0m22.760s
user    0m22.704s
sys 0m0.080s

real    0m22.854s
user    0m22.800s
sys 0m0.077s

real    0m22.742s
user    0m22.700s
sys 0m0.066s

real    0m22.484s
user    0m22.450s
sys 0m0.059s

real    0m22.653s
user    0m22.597s
sys 0m0.080s

real    0m22.665s
user    0m22.602s
sys 0m0.088s

real    0m22.528s
user    0m22.489s
sys 0m0.062s

real    0m22.510s
user    0m22.474s
sys 0m0.060s

real    0m22.819s
user    0m22.759s
sys 0m0.084s

real    0m22.488s
user    0m22.459s
sys 0m0.053s

orlp

Posted 2016-02-26T19:08:02.277

Reputation: 37 067

:o Dayyyum. That is fast. What's your machine? – Addison Crump – 2016-02-26T23:19:41.653

@VoteToClose Intel i5-4670k running 64-bit Windows 7. – orlp – 2016-02-26T23:21:40.093

care to add an explanation? – Liam – 2016-02-27T17:32:23.477

@Liam It's just a sieve that has any number which is a multiple of 2 and 3 left out of the sieve. – orlp – 2016-02-27T17:37:24.077

C++, 2.47215s (29 Feb 2016)

This is a (sloppy) multi-threaded version of my other answer.

#include <cstdint>
#include <vector>
#include <iostream>
#include <limits>
#include <cmath>
#include <array>
// uses posix ffsll
#include <string.h>
#include <algorithm>
#include <thread>

constexpr uint64_t wheel_width = 2;
constexpr uint64_t buf_size = 1<<(10+6);
constexpr uint64_t dtype_width = 6;
constexpr uint64_t dtype_mask = 63;
constexpr uint64_t buf_len = ((buf_size*wheel_width)>>dtype_width);
constexpr uint64_t seg_len = 6*buf_size;
constexpr uint64_t limit_i_max = 0xfffffffe00000001ULL;

typedef std::vector<uint64_t> buf_type;

void mark_composite(buf_type& buf, uint64_t prime,
                    std::array<uint64_t, 2>& poff,
                    uint64_t seg_start, uint64_t max_j)
{
  const auto p = 2*prime;
  for(uint64_t k = 0; k < wheel_width; ++k)
  {
    for(uint64_t j = 2*poff[k]+(k==0); j < max_j; j += p)
    {
      buf[(j-seg_start)>>dtype_width] |= 1ULL << (j & dtype_mask);
      poff[k] += prime;
    }
  }
}

struct prime_counter
{
  buf_type buf;
  uint64_t n;
  uint64_t seg_a, seg_b;
  uint64_t nj;
  uint64_t store_max;
  uint64_t& store_res;

  prime_counter(uint64_t n, uint64_t seg_a, uint64_t seg_b, uint64_t nj, uint64_t store_max,
                uint64_t& store_res) :
    buf(buf_len), n(n), nj(nj), seg_a(seg_a), seg_b(seg_b),
    store_max(store_max), store_res(store_res)
  {}

  prime_counter(const prime_counter&) = default;
  prime_counter(prime_counter&&) = default;

  prime_counter& operator =(const prime_counter&) = default;
  prime_counter& operator =(prime_counter&&) = default;

  void operator()(uint64_t nsmall_segs,
                  const std::vector<uint64_t>& primes,
                  std::vector<std::array<uint64_t, 2> > poffs)
  {
    uint64_t res = 0;
    // no new prime added portion
    uint64_t seg_start = buf_size*wheel_width*seg_a;
    uint64_t seg_min = seg_len*seg_a+5;

    if(seg_a > nsmall_segs)
    {
      uint64_t max_j = buf_size*wheel_width*nsmall_segs+(seg_a-nsmall_segs)*(buf_len<<dtype_width);
      for(size_t k = 0; k < wheel_width; ++k)
      {
        for(uint64_t i = 0; i < poffs.size() && max_j >= (2*poffs[i][k]+(k==0)); ++i)
        {
          // adjust poffs
          // TODO: might be a more efficient way
          auto w = (max_j-(2*poffs[i][k]+(k==0)));
          poffs[i][k] += primes[i]*(w/(2*primes[i]));
          if(w % (2*primes[i]) != 0)
          {
            poffs[i][k]+=primes[i];// += primes[i]*(w/(2*primes[i])+1);
          }
          /*else
          {

          }*/
        }
      }
    }

    for(uint64_t seg = seg_a; seg < seg_b; ++seg)
    {
      std::fill(buf.begin(), buf.end(), 0);
      const uint64_t limit_i = std::min<uint64_t>((((seg_len+seg_min) >= limit_i_max) ?
                                                   std::numeric_limits<uint32_t>::max() :
                                                   ceil(sqrt(seg_len+seg_min))),
                                                  store_max);
      uint64_t max_j = std::min(seg_start+(buf_len<<dtype_width), nj);
      for(uint64_t i = 0; i < primes.size() && primes[i] <= limit_i; ++i)
      {
        mark_composite(buf, primes[i], poffs[i], seg_start, max_j);
      }
      // sieve
      uint64_t val;
      const uint64_t stop = std::min(seg_min+seg_len, n);
      for(uint64_t i = ffsll(~(buf[0]))-((~buf[0]) != 0)+64*((~buf[0]) == 0);
          (val = 6ULL*(i>>1)+seg_min+2ULL*(i&1ULL)) < stop;)
      {
        if(!(buf[i>>dtype_width] & (1ULL << (i & dtype_mask))))
        {
          ++res;
          ++i;
        }
        else
        {
          uint64_t mask = buf[i>>dtype_width]>>(i&dtype_mask);
          const int64_t inc = ffsll(~mask)-((~mask) != 0)+64*((~mask) == 0);
          i += inc;
        }
      }
      seg_min += seg_len;
      seg_start += buf_size*wheel_width;
    }
    store_res = res;
  }
};

uint64_t num_primes(uint64_t n)
{
  uint64_t res = (n >= 2) + (n >= 3);
  if(n >= 5)
  {
    buf_type buf(buf_len);
    // compute and store primes < sqrt(n)
    const uint64_t store_max = ceil(sqrt(n));

    // only primes >= 5
    std::vector<uint64_t> primes;
    std::vector<std::array<uint64_t, 2> > poffs;
    primes.reserve(ceil(1.25506*store_max/log(store_max)));
    poffs.reserve(ceil(1.25506*store_max/log(store_max)));
    uint64_t seg_start = 0;
    uint64_t seg_min = 5;
    const uint64_t num_segs = 1+(n-seg_min)/seg_len;
    const uint64_t nj = (n-seg_min)/3+1;
    // compute how many small segments there are
    const uint64_t nsmall_segs = 1+(store_max-seg_min)/seg_len;
    for(uint64_t seg = 0; seg < nsmall_segs; ++seg)
    {
      std::fill(buf.begin(), buf.end(), 0);
      // mark off small primes
      const uint64_t limit_i = std::min<uint64_t>((((seg_len+seg_min) >= limit_i_max) ?
                                                   std::numeric_limits<uint32_t>::max() :
                                                   ceil(sqrt(seg_len+seg_min))),
                                                  store_max);
      uint64_t max_j = std::min(seg_start+(buf_len<<dtype_width), nj);
      for(uint64_t i = 0; i < primes.size() && primes[i] <= limit_i; ++i)
      {
        mark_composite(buf, primes[i], poffs[i], seg_start, max_j);
      }
      // sieve
      uint64_t val;
      const uint64_t stop = std::min(seg_min+seg_len, n);
      for(uint64_t i = ffsll(~(buf[0]))-((~buf[0]) != 0)+64*((~buf[0]) == 0);
            (val = 6ULL*(i>>1)+seg_min+2ULL*(i&1ULL)) < stop;)
      {
        if(!(buf[i>>dtype_width] & (1ULL << (i & dtype_mask))))
        {
          if(val <= store_max)
          {
            // add prime and poffs
            primes.push_back(val);
            poffs.emplace_back();
            poffs.back()[0] = (val*val-1)/6-1;
            if(i&1)
            {
              // 6n+1 prime
              poffs.back()[1] = (val*val+4*val-5)/6;
            }
            else
            {
              // 6n+5 prime
              poffs.back()[1] = (val*val+2*val-5)/6;
            }
            // mark-off multiples
            mark_composite(buf, val, poffs.back(), seg_start, max_j);
          }
          ++res;
          ++i;
        }
        else
        {
          uint64_t mask = buf[i>>dtype_width]>>(i&dtype_mask);
          const int64_t inc = ffsll(~mask)-((~mask) != 0)+64*((~mask) == 0);
          i += inc;
        }
      }
      seg_min += seg_len;
      seg_start += buf_size*wheel_width;
    }
    // multi-threaded sieving for remaining segments
    std::vector<std::thread> workers;
    auto num_workers = std::min<uint64_t>(num_segs-nsmall_segs, std::thread::hardware_concurrency());
    std::vector<uint64_t> store_reses(num_workers);

    workers.reserve(num_workers);
    auto num_segs_pw = ceil((num_segs-nsmall_segs)/static_cast<double>(num_workers));
    for(size_t i = 0; i < num_workers; ++i)
    {
      workers.emplace_back(prime_counter(n, nsmall_segs+i*num_segs_pw,
                                         std::min<uint64_t>(nsmall_segs+(i+1)*num_segs_pw,
                                                            num_segs),
                                         nj, store_max, store_reses[i]),
                           nsmall_segs, primes, poffs);
    }
    for(size_t i = 0; i < num_workers; ++i)
    {
      workers[i].join();
      res += store_reses[i];
    }
  }
  return res;
}

int main(int argc, char** argv)
{
  if(argc <= 1)
  {
    std::cout << "usage: " << argv[0] << " n\n";
    return -1;
  }
  std::cout << num_primes(std::stoll(argv[1])) << '\n';
}

Uses a segmented sieve of Eratosthenes with a wheel factorization of 6 to skip all multiples of 2/3. Makes use of the POSIX ffsll to skip consecutive composite values.

To compile:

g++ -std=c++11 -o sieve_mt -O3 -march=native -pthread sieve_mt.cpp

unofficial timings

Timed with an Intel i5-6600k on Ubuntu 15.10, the 1907000000 case took 0.817s.

Official times

To get more precise times, I timed this 100 times, then divided the time by 100.

real    4m7.215s
user    23m54.086s
sys 0m1.239s

helloworld922

Posted 2016-02-26T19:08:02.277

Reputation: 2 503

Since this and @Dennis 's python answer are so close, I may retime them for more accurate results. – Liam – 2016-02-29T16:56:47.380

Wow wow wow. This makes even less sense to me than CJam or Pyth. I'll call it the bit-shift monster! +1 – Tamoghna Chowdhury – 2016-03-03T10:17:26.530

As an aside, could you try CUDA/OpenCL for a GPU speedup? If I knew more C, I might have. – Tamoghna Chowdhury – 2016-03-03T10:18:43.867

Yeah, I guess I was a bit excessive with bitshifting/masking :P I don't know if GPGPU would be helpful here or not; the only area I can see it possibly helping is in pre-sieving small primes, and even then the data transfer speeds might be enough to kill that. What still bums me is that I'm still off by a factor of 10 or so from the fastest sieve implementation I've ever seen

– helloworld922 – 2016-03-03T11:00:06.420

C, 2m42.7254s (Feb 28 2016)

Save as pi.c, compile as gcc -o pi pi.c, run as ./pi <arg>:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

unsigned char p[2000000001];

int main(int argc, char **argv)
{
        unsigned int n, c, i, j;

        n = atoi(argv[1]);
        memset(p, 1, n + 1);

        p[1] = p[0] = 0;

        for (i = 2, c = 0; i <= n; i++)
        {
                if (p[i])
                {
                        c++;
                        for (j = i + i; j <= n; j += i)
                                p[j] = 0;
                }
        }

        printf("%d: %d\n", n, c);

        return 0;
}

Needs a lot of memory to run! If your hardware can't spare up to two gigabytes of real memory, then the program will either crash or run very slowly because of VMM and HD thrashing.

Approximate timing on my hardware is 1.239×10^-8·n^1.065 s. For example, an input of n = 2×10⁹ takes about 100 s to run.

Official times

real    2m42.657s
user    2m42.065s
sys 0m0.757s

real    2m42.947s
user    2m42.400s
sys 0m0.708s

real    2m42.827s
user    2m42.282s
sys 0m0.703s

real    2m42.800s
user    2m42.300s
sys 0m0.665s

real    2m42.562s
user    2m42.050s
sys 0m0.675s

real    2m42.788s
user    2m42.192s
sys 0m0.756s

real    2m42.631s
user    2m42.074s
sys 0m0.720s

real    2m42.658s
user    2m42.115s
sys 0m0.707s

real    2m42.710s
user    2m42.219s
sys 0m0.657s

real    2m42.674s
user    2m42.110s
sys 0m0.730s

user15259

Posted 2016-02-26T19:08:02.277

Reputation:

This works using the sieve of eratosthenes? I'll time it when I get home – Liam – 2016-02-26T21:48:29.657

I'm segfaulting on the first case (others run fine). It happens after ~1 minute of runtime. I added if (p==NULL) {exit(1);} line to the code, so I don't believe the malloc is failing (also it would fail at the beginning, not 1 minute in). Ideas on what is happening? – Liam – 2016-02-27T01:39:23.243

Many systems, including Linux, do optimistic allocation. E.g. if you ask for 1 Gb it will "give" it to you, but when you actually go to use it, and if the system can't find it, it will crash. If this were the case, it would likely be crashing at the memset. The minute it is taking is the time spending trying to coalesce the heap into a contiguous block. Also check on your system is sizeof(bool) == 1. If it's == 4, then I can rewrite this to use char. – None – 2016-02-27T01:47:49.833

I already checked. Bool is 1 byte. Is it possible to just ask form 2*10^9 bytes of memory in the stack? I.e declare a global variable which (on gcc) I believe will be initiated to 0. This would require using char instead though I think. – Liam – 2016-02-27T01:51:17.383

Rewritten to use global array. Good luck! – None – 2016-02-27T09:25:31.283

I think my computer just straight up doesn't have enough RAM to run this as is. Please post the time from your own computer. – Liam – 2016-02-27T18:53:56.097

This segfaults because i + i does not fit in an int for large values of i. – Dennis – 2016-02-27T18:55:37.120

Hooray for @Dennis. I changed the int declaration line to long and now it works. I'll time it shortly. – Liam – 2016-02-28T19:28:33.630

out of curiosity though, why wasn't that segfaulting on his computer too? – Liam – 2016-02-28T19:29:00.617

1@Liam Hard to say. Signed integer overflow is undefined behavior, so without looking at the generated assembly, it's difficult to predict what the compiler did. – Dennis – 2016-02-28T20:31:22.077

Is i + i really faster than i * 2? – Cyoce – 2016-02-29T18:15:08.257

@Liam - default on mine was 64-bit integers - have changed source to work with 32-bit, might run faster, might not – None – 2016-02-29T18:35:35.860

@Cyoce - it might be faster, don't know - i wrote for clarity - these days most compilers will treat i + i ⇔ i * 2 – None – 2016-02-29T18:39:09.413

You could also start i to be an odd number (such as 19) and then increment by 2, since any even number is divisible by 2. – DDPWNAGE – 2016-03-13T05:23:44.243

Python 3

import sys

sys.setrecursionlimit(sys.maxsize)

n = int(sys.argv[-1])

if n < 4:
    print(0 if n < 2 else n-1)
    exit()

p = [0, 0] + [True] * n

i = 0
while i < pow(n, 0.5):
    if p[i]:
        j = pow(i, 2)
        while j < n:
            p[j] = False
            j += i
    i += 1

print(sum(p) - 2)

Uses the Sieve of Eratosthenes. Runs at an average of 8.775s where n = 10^7. To time, I used the builtin time command. For example:

$ time python3 test.py 90
24

real    0m0.045s
user    0m0.031s
 sys    0m0.010s

Zach Gates

Posted 2016-02-26T19:08:02.277

Reputation: 6 152

It's the sieve! I couldn't use this in Java because it didn't like how much memory a boolean array used. D: – Addison Crump – 2016-02-26T23:22:06.963

memory error on the bigger cases. – Liam – 2016-02-27T01:57:22.913

Which cases? I believe I've fixed it. @Liam – Zach Gates – 2016-02-27T04:43:54.363

2@VoteToClose Then don't use a Boolean array. Use an integer array and bit shifting/masking, with each bit representing a Boolean value. – mbomb007 – 2016-02-29T17:35:33.753

AttributeError: 'module' object has no attribute 'maxint' – Dennis – 2016-03-10T20:00:53.010

Good catch! sys.maxsize in Python 3. Fixed it. @Dennis – Zach Gates – 2016-03-10T20:18:20.633

OverflowError: signed integer is greater than maximum – Dennis – 2016-03-10T20:26:01.203

I tried with sys.setrecursionlimit(2**31 - 1) (the highest possible value), but that's still a memory error. I don't think that's related to recursion anyway. A list of length 1,907,000,000 simply does not fit into my RAM (16 GiB). – Dennis – 2016-03-10T20:30:51.627

@CoolestVeto, Ofcourse boolean[] is useless in sieve of Eratosthenes coz of the practical 1 byte (aka 8 bits) for a boolean value rather than the theoritical 1 bit memory. To overcome this, you've to use java.util.BitSet – The Coder – 2016-03-24T16:22:04.210

Julia, 1m 21.1329s

I'd like to come up with something a little faster, but for now here's a rather naïve implementation of the Sieve of Eratosthenes.

function eratos(n::Int64)
    sieve = trues(n)
    sieve[1] = false
    for p = 2:isqrt(n)
        @inbounds sieve[p] || continue
        for i = 2:n÷p
            @inbounds sieve[p*i] = false
        end
    end
    return sum(sieve)
end

const x = parse(Int64, ARGS[1])

println(eratos(x))

Get the latest version of Julia for your system here. Make sure the Julia executable is in your path. Save the code as sieve.jl and run from the command line like julia sieve.jl N, where N is the input.

Official times

real    1m21.227s
user    1m20.755s
sys 0m0.576s

real    1m20.944s
user    1m20.426s
sys 0m0.640s

real    1m21.052s
user    1m20.581s
sys 0m0.573s

real    1m21.328s
user    1m20.862s
sys 0m0.570s

real    1m21.253s
user    1m20.780s
sys 0m0.588s

real    1m20.925s
user    1m20.460s
sys 0m0.576s

real    1m21.011s
user    1m20.512s
sys 0m0.601s

real    1m21.011s
user    1m20.550s
sys 0m0.564s

real    1m20.875s
user    1m20.409s
sys 0m0.569s

real    1m21.703s
user    1m21.088s
sys 0m0.701s

Alex A.

Posted 2016-02-26T19:08:02.277

Reputation: 23 761

1I implemented the Sieve of Atkin and my implementation for that is slower. >:U – Alex A. – 2016-02-29T03:25:27.573

@Liam Whoa. I wonder why the official times are wayyyyy longer than my unofficial ones. The official times are pretty awful. – Alex A. – 2016-02-29T17:49:54.773

Well the official times are for all the score cases together. The unofficial ones go number by number. Also, my computer probably isn't as fast as yours. – Liam – 2016-02-29T17:50:58.530

@Liam Oh, that makes more sense. Dang, I thought this was decent. Oh well, back to the drawing board. – Alex A. – 2016-02-29T17:52:12.937

I'm about to steal Dennis' algorithm... just so I can understand how fast it is. – Liam – 2016-02-29T17:53:38.060

@Liam For which language? I started a port to Julia but only got halfway through before forgetting what I was doing. :| – Alex A. – 2016-02-29T20:29:46.890

I kinda want to see what will happen with rust – Liam – 2016-02-29T20:30:55.363

Java, 42.663122s* (Mar 3 2016)

*this was timed internally by the program (on the OP's computer though)

public class PrimeCounter
{
public static final String START_CODE="=",
TEST_FORMAT="Input = %d , Output = %d , calculated in %f seconds%n",
PROMPT="Enter numbers to compute pi(x) for (Type \""+START_CODE+"\" to start):%n",
WAIT="Calculating, please wait...%n",
WARNING="Probably won't work with values close to or more than 2^31%n",
TOTAL_OUTPUT_FORMAT="Total time for all inputs is %f seconds%n";
public static final int NUM_THREADS=16,LOW_LIM=1,HIGH_LIM=1<<28;
private static final Object LOCK=new Lock();
private static final class Lock{}
/**
 * Generates and counts primes using an optimized but naive iterative algorithm.
 * Uses MultiThreading for arguments above LOW_LIM
 * @param MAX : argument x for pi(x), the limit to which to generate numbers.
 */
public static long primeCount(long MAX){
    long ctr=1;
    if(MAX<1<<7){
        for(long i=3;i<=MAX;i+=2){
            if(isPrime(i))++ctr;
        }
    }else{
        long[] counts=new long[NUM_THREADS];
        for(int i=0;i<NUM_THREADS;++i){
            counts[i]=-1;
        }
        long range=Math.round((double)MAX/NUM_THREADS);
        for(int i=0;i<NUM_THREADS;++i){
            long start=(i==0)?3:i*range+1,end=(i==NUM_THREADS-1)?MAX:(i+1)*range;
            final int idx=i;
            new Thread(new Runnable(){
                    public void run(){
                        for(long j=start;j<=end;j+=2){
                            if(isPrime(j))++counts[idx];
                        }
                    }
                }).start();
        }
        synchronized(LOCK){
            while(!completed(counts)){
                try{
                    LOCK.wait(300);}catch(InterruptedException ie){}
            }
            LOCK.notifyAll();
        }
        for(long count:counts){
            ctr+=count;
        }
        ctr+=NUM_THREADS;
    }
    return ctr;
}

/**
 * Checks for completion of threads
 * @param array : The array containing the completion data
 */
private static boolean completed(long[] array){
    for(long i:array){
        if(i<0)return false;
    }return true;
}

/**
 * Checks if the parameter is prime or not.
 * 2,3,5,7 are hardcoded as factors.
 * @param n : the number to check for primality
 */
private static boolean isPrime(long n){
    if(n==2||n==3||n==5||n==7)return true;
    else if(n%2==0||n%3==0||n%5==0||n%7==0)return false;
    else{
        for(long i=11;i<n;i+=2){
            if(n%i==0)return false;
        }
        return true;
    }
}

/**
 * Calculates primes using the atandard Sieve of Eratosthenes.
 * Uses 2,3,5,7 wheel factorization for elimination (hardcoded for performance reasons)
 * @param MAX : argument x for pi(x)
 * Will delegate to <code>primeCount(long)</code> for MAX<LOW_LIM and to <code>bitPrimeSieve(long)</code>
 * for MAX>HIGH_LIM, for performance reasons.
 */
public static long primeSieve(long MAX){
    if(MAX<=1)return 0;
    else if(LOW_LIM>0&&MAX<LOW_LIM){return primeCount(MAX);}
    else if(HIGH_LIM>0&&MAX>HIGH_LIM){return bitPrimeSieve(MAX);}
    int n=(int)MAX;
    int sn=(int)Math.sqrt(n),ctr=2;
    if(sn%2==0)--sn;
    boolean[]ps=new boolean[n+1];
    for(int i=2;i<=n;++i){
        if(i==2||i==3||i==5||i==7)ps[i]=true;
        else if(i%2!=0&&i%3!=0&&i%5!=0&&i%7!=0)ps[i]=true;
        else ++ctr;
    }
    for(int i=(n>10)?11:3;i<=sn;i+=2){
        if(ps[i]){
            for(int j=i*i;j<=n;j+=i){
                if(ps[j]){ ps[j]=false;++ctr;}
            }
        }
    }
    return (n+1-ctr);
}
/**
 * Calculates primes using bitmasked Sieve of Eratosthenes.
 * @param MAX : argument x for pi(x)
 */
public static long bitPrimeSieve(long MAX) {
    long SQRT_MAX = (long) Math.sqrt(MAX);
    if(SQRT_MAX%2==0)--SQRT_MAX;
    int MEMORY_SIZE = (int) ((MAX+1) >> 4);
    byte[] array = new byte[MEMORY_SIZE];
    for (long i = 3; i <= SQRT_MAX; i += 2) {
        if ((array[(int) (i >> 4)] & (byte) (1 << ((i >> 1) & 7))) == 0) {
            for(long j=i*i;j<=MAX;j+=i<<1) {
                if((array[(int) (j >> 4)] & (byte) (1 << ((j >> 1) & 7))) == 0){
                    array[(int) (j >> 4)] |= (byte) (1 << ((j >> 1) & 7));
                }
            }
        }
    }
    long pi = 1;
    for (long i = 3; i <= MAX; i += 2) {
        if ((array[(int) (i >> 4)] & (byte) (1 << ((i >> 1) & 7))) == 0) {
            ++pi;
        }
    }
    return pi;
}
/**
 * Private testing and timer function
 * @param MAX : input to be passed on to <code>primeSieve(long)</code>
 */
private static long sieveTest(long MAX){
    long start=System.nanoTime();
    long ps=primeSieve(MAX);
    long end=System.nanoTime();
    System.out.format(TEST_FORMAT,MAX,ps,((end-start)/1E9));
    return end-start;
}
/**
 * Main method: accepts user input and shows total execution time taken
 * @param args : The command-line arguments
 */
public static void main(String[]args){
    double total_time=0;
    java.util.Scanner sc=new java.util.Scanner(System.in);
    java.util.ArrayList<Long> numbers=new java.util.ArrayList<>();
    System.out.format(PROMPT+WARNING);
    String line=sc.nextLine();
    while(!line.equals(START_CODE)/*sc.hasNextLine()&&Character.isDigit(line.charAt(0))*/){
        numbers.add(Long.valueOf(line));
        line=sc.nextLine();
    }
    System.out.format(WAIT);
    for(long num:numbers){
        total_time+=sieveTest(num);
    }
    System.out.format(TOTAL_OUTPUT_FORMAT,total_time/1e9);
}
}

Follows in the great PPCG tradition of self-documenting code (although not in the literal sense :p).

This is to prove the point that Java can be fast enough to be competitive with other VM languages when using similar algorithms.

Run information

Run it as you would have @CoolestVeto's answer, but mine doesn't need command-line arguments, it can get them from STDIN.

Tweak the NUM_THREADS constant to set it to 2x your native core count for maximum performance (as I observed - In my case I have 8 virtual cores so it is set to 16, the OP might want 12 for his hexa-core processor).

When I ran these tests I used JDK 1.7.0.45 with BlueJ 3.1.6 (IntelliJ was updating) on Windows 10 Enterpise x64 on a ASUS K55VM laptop (Core i7 3610QM, 8GB RAM). Google Chrome 49.0 64-bit with 1 tab (PPCG) open and QBittorrent downloading 1 file were running in the background, 60% RAM usage at start of run.

Basically,

javac PrimeCounter.java
java PrimeCounter

The program will walk you through the rest.

Timing is done by Java's inbuilt System.nanoTime().

Algorithm details:

Has 3 variants for different use cases - a naive version like @CoolestVeto's (but multithreaded) for inputs below 2^15, and a bitmasked sieve of Eratosthenes with odd-elimination for inputs above 2^28, and a normal sieve of Eratosthenes with a 2/3/5/7 wheel factorization for pre-elimination of multiples.

I use the bitmasked sieve to avoid special JVM arguments for the largest test cases. If that can be done, the overhead for the calculation of the count in the bitmasked version can be eliminated.

Here's the output:

Enter numbers to compute pi(x) for (Type "=" to start):
Probably won't work with values close to or more than 2^31
41500
24850000
40550000
99820000
660000000
1240000000
1337000000
1907000000
=
Calculating, please wait...
Input = 41500 , Output = 4339 , calculated in 0.002712 seconds
Input = 24850000 , Output = 1557132 , calculated in 0.304792 seconds
Input = 40550000 , Output = 2465109 , calculated in 0.523999 seconds
Input = 99820000 , Output = 5751639 , calculated in 1.326542 seconds
Input = 660000000 , Output = 34286170 , calculated in 4.750049 seconds
Input = 1240000000 , Output = 62366021 , calculated in 9.160406 seconds
Input = 1337000000 , Output = 66990613 , calculated in 9.989093 seconds
Input = 1907000000 , Output = 93875448 , calculated in 14.832107 seconds
Total time for all inputs is 40.889700 seconds

Tamoghna Chowdhury

Posted 2016-02-26T19:08:02.277

Reputation: 373

Outputting only the result of pi(n) (with no prompts) may save some time, because STDOUT is... well, let's say it could just be a bit faster. – user48538 – 2016-03-02T11:39:57.697

@zyabin101 , if anybody had the patience to go through the code, he/she would understand that STDOUT latency has been accounted for. – Tamoghna Chowdhury – 2016-03-02T11:41:19.937

Also for timing, I've been sending stdout to /dev/null – Liam – 2016-03-02T23:13:08.463

@Liam I guess you'll have to make an exception in my case, then. You could tweak the main method for command line arguments, but the program is self-timing anyway. Check it out anyway. Please? – Tamoghna Chowdhury – 2016-03-03T03:51:11.563

Of course I will. I'll be doing it tomorrow. If I have trouble I'll ping you in chat – Liam – 2016-03-03T03:53:12.300

@Liam thanks. BTW, this is the only (partially) multithreaded answer. I expect you understand Java natively? – Tamoghna Chowdhury – 2016-03-03T03:55:19.623

@TamoghnaChowdhury Isn't nanoTime() seriously inaccurate? – Addison Crump – 2016-03-03T09:32:36.127

@CoolestVeto in my experience, if you are calculating a time difference, it's fine. I have timed many of my programs using nanoTime, and if not accurate, it's at least consistent. Also, currentTimeMillis() I tried, but it didn't give me the desired resolution for some of the faster cases. What do you suggest? – Tamoghna Chowdhury – 2016-03-03T10:19:29.337

@TamoghnaChowdhury You are restricted to the language builtins. :P As far as I can see, nanoTime(), though occasionally inaccurate, is the best solution you can have. – Addison Crump – 2016-03-03T20:15:47.700

C++, 9.3221s (29 Feb 2016)

#include <cstdint>
#include <vector>
#include <iostream>
#include <limits>
#include <cmath>
#include <array>
// uses posix ffsll
#include <string.h>
#include <algorithm>

constexpr uint64_t wheel_width = 2;
constexpr uint64_t buf_size = 1<<(10+6);
constexpr uint64_t dtype_width = 6;
constexpr uint64_t dtype_mask = 63;
constexpr uint64_t buf_len = ((buf_size*wheel_width)>>dtype_width);

typedef std::vector<uint64_t> buf_type;

void mark_composite(buf_type& buf, uint64_t prime,
                    std::array<uint64_t, 2>& poff,
                    uint64_t seg_start, uint64_t max_j)
{
  const auto p = 2*prime;
  for(uint64_t k = 0; k < wheel_width; ++k)
  {
    for(uint64_t j = 2*poff[k]+(k==0); j < max_j; j += p)
    {
      buf[(j-seg_start)>>dtype_width] |= 1ULL << (j & dtype_mask);
      poff[k] += prime;
    }
  }
}

uint64_t num_primes(uint64_t n)
{
  uint64_t res = (n >= 2) + (n >= 3);
  if(n >= 5)
  {
    buf_type buf(buf_len);
    // compute and store primes < sqrt(n)
    const uint64_t store_max = ceil(sqrt(n));

    // only primes >= 5
    std::vector<uint64_t> primes; // 5,7,11
    std::vector<std::array<uint64_t, 2> > poffs;// {{3,0},{0,5},{8,1}};
    primes.reserve(ceil(1.25506*store_max/log(store_max)));
    poffs.reserve(ceil(1.25506*store_max/log(store_max)));
    uint64_t seg_start = 0;
    uint64_t seg_min = 5;
    constexpr uint64_t seg_len = 6*buf_size;///wheel_width;
    constexpr uint64_t limit_i_max = 0xfffffffe00000001ULL;
    const uint64_t num_segs = 1+(n-seg_min)/seg_len;
    const uint64_t nj = (n-seg_min)/3+1;
    for(uint64_t seg = 0; seg < num_segs; ++seg)
    {
      std::fill(buf.begin(), buf.end(), 0);
      // mark off small primes
      const uint64_t limit_i = std::min<uint64_t>((((seg_len+seg_min) >= limit_i_max) ?
                                                   std::numeric_limits<uint32_t>::max() :
                                                   ceil(sqrt(seg_len+seg_min))),
                                                  store_max);
      uint64_t max_j = std::min(seg_start+(buf_len<<dtype_width), nj);
      for(uint64_t i = 0; i < primes.size() && primes[i] <= limit_i; ++i)
      {
        mark_composite(buf, primes[i], poffs[i], seg_start, max_j);
      }
      // sieve
      uint64_t val;
      const uint64_t stop = std::min(seg_min+seg_len, n);
      for(uint64_t i = ffsll(~(buf[0]))-((~buf[0]) != 0)+64*((~buf[0]) == 0);
            (val = 6ULL*(i>>1)+seg_min+2ULL*(i&1ULL)) < stop;)
      {
        if(!(buf[i>>dtype_width] & (1ULL << (i & dtype_mask))))
        {
          if(val <= store_max)
          {
            // add prime and poffs
            primes.push_back(val);
            poffs.emplace_back();
            poffs.back()[0] = (val*val-1)/6-1;
            if(i&1)
            {
              // 6n+1 prime
              poffs.back()[1] = (val*val+4*val-5)/6;
            }
            else
            {
              // 6n+5 prime
              poffs.back()[1] = (val*val+2*val-5)/6;
            }
            // mark-off multiples
            mark_composite(buf, val, poffs.back(), seg_start, max_j);
          }
          ++res;
          ++i;
        }
        else
        {
          uint64_t mask = buf[i>>dtype_width]>>(i&dtype_mask);
          const int64_t inc = ffsll(~mask)-((~mask) != 0)+64*((~mask) == 0);
          i += inc;
        }
      }
      seg_min += seg_len;
      seg_start += buf_size*wheel_width;
    }
  }
  return res;
}

int main(int argc, char** argv)
{
  if(argc <= 1)
  {
    std::cout << "usage: " << argv[0] << " n\n";
    return -1;
  }
  std::cout << num_primes(std::stoll(argv[1])) << '\n';
}

Uses a segmented sieve of Eratosthenes with a wheel factorization of 6 to skip all multiples of 2/3. Makes use of the POSIX ffsll to skip consecutive composite values.

Could potentially be sped up by making the segmented sieve work in parallel.

To compile:

g++ -std=c++11 -o sieve -O3 -march=native sieve.cpp

unofficial timings

Timed with an Intel i5-6600k on Ubuntu 15.10, the 1907000000 case took 2.363s.

41500
4339

real    0m0.001s
user    0m0.000s
sys     0m0.000s

24850000
1557132

real    0m0.036s
user    0m0.032s
sys     0m0.000s

40550000
2465109

real    0m0.056s
user    0m0.052s
sys     0m0.000s

99820000
5751639

real    0m0.149s
user    0m0.144s
sys     0m0.000s

660000000
34286170

real    0m0.795s
user    0m0.788s
sys     0m0.000s

1240000000
62366021

real    0m1.468s
user    0m1.464s
sys     0m0.000s

1337000000
66990613

real    0m1.583s
user    0m1.576s
sys     0m0.004s

1907000000
93875448

real    0m2.363s
user    0m2.356s
sys     0m0.000s

Official Times

real    0m9.415s
user    0m9.414s
sys 0m0.014s

real    0m9.315s
user    0m9.315s
sys 0m0.013s

real    0m9.307s
user    0m9.309s
sys 0m0.012s

real    0m9.333s
user    0m9.330s
sys 0m0.017s

real    0m9.288s
user    0m9.289s
sys 0m0.012s

real    0m9.319s
user    0m9.318s
sys 0m0.015s

real    0m9.285s
user    0m9.284s
sys 0m0.015s

real    0m9.342s
user    0m9.342s
sys 0m0.014s

real    0m9.305s
user    0m9.305s
sys 0m0.014s

real    0m9.312s
user    0m9.313s
sys 0m0.012s

helloworld922

Posted 2016-02-26T19:08:02.277

Reputation: 2 503

Calculate the number of primes up to n

Answers

C, 0.026119s (Mar 12 2016)

Timings

Official times

How it works

Formula

Algorithm

Implementation

C99/C++, 8.9208s (28 Feb 2016)

Java, 25,725.315 seconds on this machine

Python 2 (PyPy 4.0), 2.36961s (Feb 29 2016)

Timings

Rust, 0.37001 sec (12 June 2016)

Node.js (JavaScript/ES6), 83.549s (11 Nov 2016)

C++11, 22.6503s (28 Feb 2016)

C++, 2.47215s (29 Feb 2016)

unofficial timings

C, 2m42.7254s (Feb 28 2016)

Python 3

Julia, 1m 21.1329s

Java, 42.663122s* (Mar 3 2016)

Run information

Algorithm details:

C++, 9.3221s (29 Feb 2016)

unofficial timings