Fast Parallel Sieve of Eratosthenes in Java

Recently I was playing a little with prime numbers and I needed a prime number generator for which I wrote a sieve of Eratosthenes in Java. Since I wanted all primes up to 10^9, I became a little impatient about the time taken to generate those primes. So I started to optimize it incrementally. Also I found a fast version which uses parallelization. This is the story.

Sieve of Erathostenes

Sieve of Eratosthenes is an ancient algorithm which finds prime numbers up to a given limit. It is an iterative algorithm. In each iteration it finds the next prime number in increasing order and eliminates all the numbers which are divisible with the current prime number found.

This is a well known algorithm and I expect that most of the people interested in computer science or math has played with it and they admired its simplicity and elegance. Here is a direct straightforward version as it is taught in schools.

    int[] sieveClassic(int n) {
        int len = 0;
        int[] numbers = new int[n + 1];
        for (int i = 0; i < n; i++) {
            numbers[i] = i;
        }
        for (int p = 2; p < n; p++) {
            if (numbers[p] == 0) {
                continue;
            }
            numbers[len++] = p;
            for (int j = p; j < n; j += p) {
                numbers[j] = 0;
            }
        }
        return Arrays.copyOf(numbers, len);
    }

There is nothing significant about the implementation. It initializes an array of consecutive numbers, and iterate starting with 2 through those numbers. When a number is found as prime (it’s value is different than \(0\) ), the number is moved to the first part of the array, and all numbers divisible with that one are marked as non-prime, aka. sets value to \(0\).

The performance for generating all prime numbers less than 10^9 is around 40 seconds. A lot of time. One can only imagine the eternity awaiting him, if he wants more than that. An immediate natural question is if we can do something to improve the situation.

First improvement: eliminating even numbers

There are two simple observations which we can make since the beginning. The first one is that we know that other than \(2\), all odd numbers are not prime. And half of those numbers are odd. There is plenty of computation done for that and we can elude that with ease.

A second observation is the following one. Suppose we are at an iteration and we found a prime number \(p\). We want to mark as not prime all numbers divisible by \(p\). Those numbers are \(\{2p, 3p, 4p, .. , p^2, p(p+1), ..\}\). The numbers up until \(p^2\) are already eliminated. The reason being that in those numbers \(p\) is multiplied with a number smaller than \(p\). But all those numbers are multiple of smaller than \(p\) prime numbers, and therefore are already marked as non prime. We can start the inner loop from \(p^2\). Another small trick to shave off some computation even further is that we can step through the inner loop with \(2p\) because we know odd numbers are not what we search for.

Also, because we know we have to start from \(p^2\) all primes greater than the square root of \(n\) will mark nobody. Using that limit we can further eliminate one computation, which is the initialization of inner loop.

    int[] sieveSkip2(int n) {
        int[] numbers = new int[n + 1];
        for (int i = 1; i < n; i += 2) {
            numbers[i] = i;
        }
        numbers[0] = 2;
        int len = 1;
        int lim = (int) Math.floor(Math.sqrt(n));
        for (int p = 3; p < n; p += 2) {
            if (numbers[p] == 0) {
                continue;
            }
            numbers[len++] = p;
            if (p < lim) {
                int step = 2 * p;
                for (int j = p * p; j < n; j += step) {
                    numbers[j] = 0;
                }
            }
        }
        return Arrays.copyOf(numbers, len);
    }

Those small adjustment moved down the total running time for getting all primes smaller than 10^9 towards 12 seconds. This is a good progress, but other tricks can be applied.

Second improvement: working with booleans

We are working with an array of integers because we have anyway to return an array of prime numbers and we reuse the array for that purpose. However we know that if the jumps in memory are large, than it can affect the computation also. At the price of an additional boolean array, we can improve the computation due to the eventual memory caching which will provide fewer cache miss operations.

The next improvement is also straight forward. We simply have to keep the computations in the boolean array. Remember that a boolean element takes the same amount as a byte value. We could work with bytes also, but I preferred boolean because is much cleaner for conditional statements. When a number is marked as not a prime, it’s flag (value from its indexed position) is set to true.

    int[] sieveSkip2Flags(int n) {
        int[] numbers = new int[n + 1];
        boolean[] flags = new boolean[n + 1];
        numbers[0] = 2;
        int len = 1;
        int lim = (int) Math.floor(Math.sqrt(n));
        for (int p = 3; p < n; p += 2) {
            if (flags[p]) {
                continue;
            }
            numbers[len++] = p;
            if (p < lim) {
                int step = 2 * p;
                for (int j = p * p; j < n; j += step) {
                    flags[j] = true;
                }
            }
        }
        return Arrays.copyOf(numbers, len);
    }

This improvement, although small, reduced computation time towards 6.6 seconds. It’s a good lesson for anybody who ignores cache and memory characteristics in real life. And are so many, unfortunately.

Going further: Eliminate numbers divisible by 2 and 3

The next improvement is similar in idea with removing odd numbers, but induces some complicated code. It is not hard, but we have to do some bookkeeping which is sometimes confusing.

It is clear that going further by eliminating from calculation all the candidates which are divisible by \(2\) and \(3\) will reduce further the computation. The number of odd numbers is \(n/2\). The number of candidates divisible by \(3\) is \(n/3\), but half of them are odd, so they are already counted. Still, it gives us \(n/2+n/6=2n/3\) reduction of candidate numbers, which is quite good. In order to further make the computation fast we would better reduce the memory.

Let’s look at the numbers after we marked with angle brackets the numbers which were not marked as non-prime (prime candidates) by iterating with \(2\) and \(3\):

             0   1   2  3  4  <5>
             6  <7>  8  9 10 <11>
            12 <13> 14 15 16 <17>
            18 <19> 20 21 22 <23>
            24 <25> 26 27 28 <29>
            ...

What you can notice is that it is a pattern of length \(6\). At each \(6^{th}\) element, the position where the candidates are found is fixed. This tells us that if we take the value modulo \(6\) we will know if that number is a prime candidate or not. This happens because \(p \% 6 \in \{1,5\}\), or in other words \(p = i*k +/- 1\).

This suggests a new improvement. What if we would store only the values which are prime candidates after removal of all numbers which are divisible by \(2\) or \(3\)? In other words we will keep only the two columns which contains proper prime candidates, and we will not waste space and computation for other numbers. For that purpose we add two helper functions to convert a position into a number and viceversa.

The code with this new improvement is listed below:

    int[] sieveSkip23Flags(int n) {
        int size = n / 3;
        int[] numbers = new int[size];
        boolean[] flags = new boolean[size];
        numbers[0] = 2;
        numbers[1] = 3;
        int len = 2;
        int lim = (int) Math.floor(Math.sqrt(n));
        int pos = 0;
        while (true) {
            while (pos < size && flags[pos]) {
                pos++;
            }
            if (pos == size) {
                break;
            }
            int p = posToPrime(pos);
            if (p >= n) {
                break;
            }
            numbers[len++] = p;
            if (p < lim) {
                int step = 2 * p;
                for (int j = p * p; j < n; j += step) {
                    int rest = j % 6;
                    if (rest == 1 || rest == 5) {
                        flags[primeToPos(j)] = true;
                    }
                }
            }
            pos++;
        }
        return Arrays.copyOf(numbers, len);
    }
    private int posToPrime(int pos) {
        return ( (pos & 1) == 0 ) ? 3 * pos + 5 : 3 * pos + 4;
    }
    private int primeToPos(int p) {
        int pp = (p + 1) / 3 - 2;
        return (3 * pp + 5 < p) ? pp + 1 : pp;
    }

As one can see, we have some magical arithmetic in the helper functions. You can trust the code, it works. It looks ugly because I tried to put it in a form which involves less operations. It is possible to exist an even uglier formula, but I was not able to find it.

The time reduction is interesting, since it came down a little bit upper than 4 seconds, which is a drastic reduction from the previous variant.

At this moment I was fed up with searching for better variants. I know we can go further reducing other candidates, but I expect the ugliness of the code would explode and the gains will not be significant. I spent some dozens of minutes trying for something better and I was ready to give up.

Can we go in parallel?

Does the Sieve of Eratosthenes algorithm allow parallel computation? Honestly, I did not consider it a viable options since I have not read about parallel methods. I am not saying that there are not versions of it, I am saying that I did not know about them (later edit: I had read some papers after writing this, there are some parallel implementations but not like the version I found. It looks like a new approach and I have to research more on that topic).

The idea of parallelism in the sieve algorithm is based on the following observation. Suppose we are iterating through numbers and we found the prime number \(p\). After adding the prime number to the list, we start to eliminate multiples of \(p\) starting with \(p^2\). The point is that between \(p\) and \(p^2\) there are also other prime numbers. For example between \(5\) and \(25=5^2\) there are also \(7,11,13,17,19,23\). If, instead of finding the prime number \(p\) we search for all prime numbers greater or equal with \(p\) and less than \(p^2\), we can add all of them to prime numbers and set the flags in parallel.

Using parallel computation leads to concerns regarding concurrent updates or related things. This is not a concern here, however. Reading flags is done only before finding the next chunk of prime numbers. Updating flags happens only in the parallel loops. Additionally all those updates does a single thing, put value true instead of false. If this happens concurrently it really does not matter who wins, since the result is invariant to order. As such, this allows a brutal concurrent processing with no thread safe mechanisms.

Additionally we do some optimization in such that we do not launch new threads when there is nothing to do (the squared prime number is greater than \(n\)).

In the code below I used a preview version of structured parallel constructs from Java 19. It can be implemented also in the classic way, using normal threads and a thread executor. But I like to take a look into the new features, and the code using StructuredTaskScope is much cleaner anyway.

    int[] sieveSkip32FlagsParallel(int n) {
        int size = n / 3;
        int[] numbers = new int[size + 1];
        boolean[] flags = new boolean[size + 1];
        numbers[0] = 2;
        numbers[1] = 3;
        int len = 2;
        int lim = (int) Math.floor(Math.sqrt(n));
        int pos = 0;
        try (StructuredTaskScope<Object> s = new StructuredTaskScope<>()) {
            while (true) {
                while (pos < size && flags[pos]) {
                    pos++;
                }
                if (pos == size) {
                    break;
                }
                NextPrimes ps = addNextPrimes(numbers, len, flags, pos, n, size);
                if (ps.step == 0) {
                    break;
                }
                List<Callable<Object>> tasks = new LinkedList<>();
                for (int i = len; i < ps.len; i++) {
                    int pp = numbers[i];
                    if (pp < lim) {
                        tasks.add(() -> {
                            int step = 2 * pp;
                            for (int j = pp * pp; j < n; j += step) {
                                int rest = j % 6;
                                if (rest == 1 || rest == 5) {
                                    flags[primeToPos(j)] = true;
                                }
                            }
                            return null;
                        });
                    }
                }
                len = ps.len;
                if (!tasks.isEmpty()) {
                    for (var task : tasks) {
                        s.fork(task);
                    }
                    s.join();
                }

                pos += ps.step;
            }
            s.shutdown();
        } catch (InterruptedException ignored) {
        }
        return Arrays.copyOf(numbers, len);
    }
    record NextPrimes(int step, int len) {
    }
    private NextPrimes addNextPrimes(int[] numbers, int len, boolean[] flags, int pos, int n, int size) {
        int p = posToPrime(pos);
        int square = Math.min(p * p, n);
        int i = pos;
        while (i < size) {
            if (flags[i]) {
                i++;
                continue;
            }
            int pp = posToPrime(i);
            if (pp >= square) {
                break;
            }
            numbers[len++] = pp;
            i++;
        }
        return new NextPrimes(i - pos, len);
    }
    private int posToPrime(int pos) {
        return ( (pos & 1) == 0 ) ? 3 * pos + 5 : 3 * pos + 4;
    }
    private int primeToPos(int p) {
        int pp = (p + 1) / 3 - 2;
        return (3 * pp + 5 < p) ? pp + 1 : pp;
    }

There is a little bit more code than the initial version, but it is still pretty clean. Honestly, I did not expect a large improvement, considering the work which has to be done in parallel. Despite the expectations, the implementation performed very well, pushing down the execution time further to 2.2 seconds for generating all prime numbers less than 10^9. It almost halved the previous time!

Benchmarks

To have a proper assessment of the performance I implemented some JMH (Java Microbenchmark Harness) test and measure the performance. I used various sizes of prime number limit starting with \(100\) and incrementing with one order of magnitude until 10^9. The average time for each variant is listed below.

Benchmark                                        (n)  Mode  Cnt      Score       Error  Units
Eratostenes.testSieveClassic                     100  avgt    3      0.001 ±     0.001  ms/op
Eratostenes.testSieveClassic                    1000  avgt    3      0.004 ±     0.001  ms/op
Eratostenes.testSieveClassic                   10000  avgt    3      0.030 ±     0.110  ms/op
Eratostenes.testSieveClassic                  100000  avgt    3      0.321 ±     0.156  ms/op
Eratostenes.testSieveClassic                 1000000  avgt    3      5.886 ±     7.290  ms/op
Eratostenes.testSieveClassic                10000000  avgt    3    164.316 ±    25.162  ms/op
Eratostenes.testSieveClassic               100000000  avgt    3   1967.791 ±   673.718  ms/op
Eratostenes.testSieveClassic              1000000000  avgt    3  38775.991 ± 30531.627  ms/op
Eratostenes.testSieveSkip2                       100  avgt    3      0.001 ±     0.001  ms/op
Eratostenes.testSieveSkip2                      1000  avgt    3      0.002 ±     0.001  ms/op
Eratostenes.testSieveSkip2                     10000  avgt    3      0.016 ±     0.040  ms/op
Eratostenes.testSieveSkip2                    100000  avgt    3      0.207 ±     0.010  ms/op
Eratostenes.testSieveSkip2                   1000000  avgt    3      3.123 ±     3.989  ms/op
Eratostenes.testSieveSkip2                  10000000  avgt    3     81.881 ±   227.557  ms/op
Eratostenes.testSieveSkip2                 100000000  avgt    3    805.019 ±   176.453  ms/op
Eratostenes.testSieveSkip2                1000000000  avgt    3  11274.829 ±   851.475  ms/op
Eratostenes.testSieveSkip2Flags                  100  avgt    3      0.001 ±     0.001  ms/op
Eratostenes.testSieveSkip2Flags                 1000  avgt    3      0.003 ±     0.001  ms/op
Eratostenes.testSieveSkip2Flags                10000  avgt    3      0.020 ±     0.005  ms/op
Eratostenes.testSieveSkip2Flags               100000  avgt    3      0.181 ±     0.055  ms/op
Eratostenes.testSieveSkip2Flags              1000000  avgt    3      1.825 ±     0.319  ms/op
Eratostenes.testSieveSkip2Flags             10000000  avgt    3     42.074 ±     6.276  ms/op
Eratostenes.testSieveSkip2Flags            100000000  avgt    3    558.695 ±   154.560  ms/op
Eratostenes.testSieveSkip2Flags           1000000000  avgt    3   6673.653 ±  3237.412  ms/op
Eratostenes.testSieveSkip23Flags                 100  avgt    3      0.001 ±     0.001  ms/op
Eratostenes.testSieveSkip23Flags                1000  avgt    3      0.002 ±     0.001  ms/op
Eratostenes.testSieveSkip23Flags               10000  avgt    3      0.016 ±     0.012  ms/op
Eratostenes.testSieveSkip23Flags              100000  avgt    3      0.196 ±     0.013  ms/op
Eratostenes.testSieveSkip23Flags             1000000  avgt    3      2.124 ±     0.211  ms/op
Eratostenes.testSieveSkip23Flags            10000000  avgt    3     21.587 ±     4.011  ms/op
Eratostenes.testSieveSkip23Flags           100000000  avgt    3    356.884 ±    35.051  ms/op
Eratostenes.testSieveSkip23Flags          1000000000  avgt    3   4114.460 ±  2876.299  ms/op
Eratostenes.testSieveSkip32FlagsParallel         100  avgt    3      0.032 ±     0.010  ms/op
Eratostenes.testSieveSkip32FlagsParallel        1000  avgt    3      0.074 ±     0.062  ms/op
Eratostenes.testSieveSkip32FlagsParallel       10000  avgt    3      0.138 ±     0.061  ms/op
Eratostenes.testSieveSkip32FlagsParallel      100000  avgt    3      0.754 ±     0.188  ms/op
Eratostenes.testSieveSkip32FlagsParallel     1000000  avgt    3      4.284 ±     1.251  ms/op
Eratostenes.testSieveSkip32FlagsParallel    10000000  avgt    3     26.336 ±    12.007  ms/op
Eratostenes.testSieveSkip32FlagsParallel   100000000  avgt    3    202.568 ±   103.288  ms/op
Eratostenes.testSieveSkip32FlagsParallel  1000000000  avgt    3   2258.194 ±   635.403  ms/op

Conclusion

Even if the Sieve of Eratosthenes is an ancient and well known algorithm implemented by most of us from undergraduate school, it still can offer opportunities for improvements. I am not aware of this way of exploiting parallel computation for this algorithm. It might be a new idea or not. However, the implementation is not hard or inaccessible and the implementation of all the variants with testing took me about 4 hours. It can be easily adapted for long integer values. The same idea can be readily applied also to a segmented version.

I hope you enjoyed that small journey around the Sieve of Eratosthenes.

One response to “Fast Parallel Sieve of Eratosthenes in Java”

ducquoc.wordpress.com

January 12, 2025

Thanks for sharing. Yet, I will try to point out a few glitches.

1/ The method `sieveSkip23Flags()` works well for N < 130000, but may be not correct for 139100+.
For instance, I tried to find total number of primes less than `139583` . The method `sieveSkip23Flags` return "129793" numbers, whilst the correct answer is "129792"
(the difference is the number `139129` which is not a prime: 139129 = 373*373 ).

2/ Just minor, there is possible `ArrayIndexOutOfBoundException` , with values N <= 36. (for example, n = 34 or n=14 or n=8).

Anyway, good job for the findings.