Recently I was playing a little with prime numbers and I needed a prime number generator for which I wrote a sieve of Eratosthenes in Java. Since I wanted all primes up to 10^9, I became a little impatient about the time taken to generate those primes. So I started to optimize it incrementally. Also I found a fast version which uses parallelization. This is the story.
Sieve of Erathostenes
Sieve of Eratosthenes is an ancient algorithm which finds prime numbers up to a given limit. It is an iterative algorithm. In each iteration it finds the next prime number in increasing order and eliminates all the numbers which are divisible with the current prime number found.
This is a well known algorithm and I expect that most of the people interested in computer science or math has played with it and they admired its simplicity and elegance. Here is a direct straightforward version as it is taught in schools.
int[] sieveClassic(int n) { int len = 0; int[] numbers = new int[n + 1]; for (int i = 0; i < n; i++) { numbers[i] = i; } for (int p = 2; p < n; p++) { if (numbers[p] == 0) { continue; } numbers[len++] = p; for (int j = p; j < n; j += p) { numbers[j] = 0; } } return Arrays.copyOf(numbers, len); }
There is nothing significant about the implementation. It initializes an array of consecutive numbers, and iterate starting with 2 through those numbers. When a number is found as prime (it’s value is different than \(0\) ), the number is moved to the first part of the array, and all numbers divisible with that one are marked as non-prime, aka. sets value to \(0\).
The performance for generating all prime numbers less than 10^9 is around 40 seconds. A lot of time. One can only imagine the eternity awaiting him, if he wants more than that. An immediate natural question is if we can do something to improve the situation.
First improvement: eliminating even numbers
There are two simple observations which we can make since the beginning. The first one is that we know that other than \(2\), all odd numbers are not prime. And half of those numbers are odd. There is plenty of computation done for that and we can elude that with ease.
A second observation is the following one. Suppose we are at an iteration and we found a prime number \(p\). We want to mark as not prime all numbers divisible by \(p\). Those numbers are \(\{2p, 3p, 4p, .. , p^2, p(p+1), ..\}\). The numbers up until \(p^2\) are already eliminated. The reason being that in those numbers \(p\) is multiplied with a number smaller than \(p\). But all those numbers are multiple of smaller than \(p\) prime numbers, and therefore are already marked as non prime. We can start the inner loop from \(p^2\). Another small trick to shave off some computation even further is that we can step through the inner loop with \(2p\) because we know odd numbers are not what we search for.
Also, because we know we have to start from \(p^2\) all primes greater than the square root of \(n\) will mark nobody. Using that limit we can further eliminate one computation, which is the initialization of inner loop.
int[] sieveSkip2(int n) { int[] numbers = new int[n + 1]; for (int i = 1; i < n; i += 2) { numbers[i] = i; } numbers[0] = 2; int len = 1; int lim = (int) Math.floor(Math.sqrt(n)); for (int p = 3; p < n; p += 2) { if (numbers[p] == 0) { continue; } numbers[len++] = p; if (p < lim) { int step = 2 * p; for (int j = p * p; j < n; j += step) { numbers[j] = 0; } } } return Arrays.copyOf(numbers, len); }
Those small adjustment moved down the total running time for getting all primes smaller than 10^9 towards 12 seconds. This is a good progress, but other tricks can be applied.
Second improvement: working with booleans
We are working with an array of integers because we have anyway to return an array of prime numbers and we reuse the array for that purpose. However we know that if the jumps in memory are large, than it can affect the computation also. At the price of an additional boolean array, we can improve the computation due to the eventual memory caching which will provide fewer cache miss operations.
The next improvement is also straight forward. We simply have to keep the computations in the boolean array. Remember that a boolean element takes the same amount as a byte value. We could work with bytes also, but I preferred boolean because is much cleaner for conditional statements. When a number is marked as not a prime, it’s flag (value from its indexed position) is set to true.
int[] sieveSkip2Flags(int n) { int[] numbers = new int[n + 1]; boolean[] flags = new boolean[n + 1]; numbers[0] = 2; int len = 1; int lim = (int) Math.floor(Math.sqrt(n)); for (int p = 3; p < n; p += 2) { if (flags[p]) { continue; } numbers[len++] = p; if (p < lim) { int step = 2 * p; for (int j = p * p; j < n; j += step) { flags[j] = true; } } } return Arrays.copyOf(numbers, len); }
This improvement, although small, reduced computation time towards 6.6 seconds. It’s a good lesson for anybody who ignores cache and memory characteristics in real life. And are so many, unfortunately.
Going further: Eliminate numbers divisible by 2 and 3
The next improvement is similar in idea with removing odd numbers, but induces some complicated code. It is not hard, but we have to do some bookkeeping which is sometimes confusing.
It is clear that going further by eliminating from calculation all the candidates which are divisible by \(2\) and \(3\) will reduce further the computation. The number of odd numbers is \(n/2\). The number of candidates divisible by \(3\) is \(n/3\), but half of them are odd, so they are already counted. Still, it gives us \(n/2+n/6=2n/3\) reduction of candidate numbers, which is quite good. In order to further make the computation fast we would better reduce the memory.
Let’s look at the numbers after we marked with angle brackets the numbers which were not marked as non-prime (prime candidates) by iterating with \(2\) and \(3\):
0 1 2 3 4 <5> 6 <7> 8 9 10 <11> 12 <13> 14 15 16 <17> 18 <19> 20 21 22 <23> 24 <25> 26 27 28 <29> ...
What you can notice is that it is a pattern of length \(6\). At each \(6^{th}\) element, the position where the candidates are found is fixed. This tells us that if we take the value modulo \(6\) we will know if that number is a prime candidate or not. This happens because \(p \% 6 \in \{1,5\}\), or in other words \(p = i*k +/- 1\).
This suggests a new improvement. What if we would store only the values which are prime candidates after removal of all numbers which are divisible by \(2\) or \(3\)? In other words we will keep only the two columns which contains proper prime candidates, and we will not waste space and computation for other numbers. For that purpose we add two helper functions to convert a position into a number and viceversa.
The code with this new improvement is listed below:
int[] sieveSkip23Flags(int n) { int size = n / 3; int[] numbers = new int[size]; boolean[] flags = new boolean[size]; numbers[0] = 2; numbers[1] = 3; int len = 2; int lim = (int) Math.floor(Math.sqrt(n)); int pos = 0; while (true) { while (pos < size && flags[pos]) { pos++; } if (pos == size) { break; } int p = posToPrime(pos); if (p >= n) { break; } numbers[len++] = p; if (p < lim) { int step = 2 * p; for (int j = p * p; j < n; j += step) { int rest = j % 6; if (rest == 1 || rest == 5) { flags[primeToPos(j)] = true; } } } pos++; } return Arrays.copyOf(numbers, len); } private int posToPrime(int pos) { return ( (pos & 1) == 0 ) ? 3 * pos + 5 : 3 * pos + 4; } private int primeToPos(int p) { int pp = (p + 1) / 3 - 2; return (3 * pp + 5 < p) ? pp + 1 : pp; }
As one can see, we have some magical arithmetic in the helper functions. You can trust the code, it works. It looks ugly because I tried to put it in a form which involves less operations. It is possible to exist an even uglier formula, but I was not able to find it.
The time reduction is interesting, since it came down a little bit upper than 4 seconds, which is a drastic reduction from the previous variant.
At this moment I was fed up with searching for better variants. I know we can go further reducing other candidates, but I expect the ugliness of the code would explode and the gains will not be significant. I spent some dozens of minutes trying for something better and I was ready to give up.
Can we go in parallel?
Does the Sieve of Eratosthenes algorithm allow parallel computation? Honestly, I did not consider it a viable options since I have not read about parallel methods. I am not saying that there are not versions of it, I am saying that I did not know about them (later edit: I had read some papers after writing this, there are some parallel implementations but not like the version I found. It looks like a new approach and I have to research more on that topic).
The idea of parallelism in the sieve algorithm is based on the following observation. Suppose we are iterating through numbers and we found the prime number \(p\). After adding the prime number to the list, we start to eliminate multiples of \(p\) starting with \(p^2\). The point is that between \(p\) and \(p^2\) there are also other prime numbers. For example between \(5\) and \(25=5^2\) there are also \(7,11,13,17,19,23\). If, instead of finding the prime number \(p\) we search for all prime numbers greater or equal with \(p\) and less than \(p^2\), we can add all of them to prime numbers and set the flags in parallel.
Using parallel computation leads to concerns regarding concurrent updates or related things. This is not a concern here, however. Reading flags is done only before finding the next chunk of prime numbers. Updating flags happens only in the parallel loops. Additionally all those updates does a single thing, put value true instead of false. If this happens concurrently it really does not matter who wins, since the result is invariant to order. As such, this allows a brutal concurrent processing with no thread safe mechanisms.
Additionally we do some optimization in such that we do not launch new threads when there is nothing to do (the squared prime number is greater than \(n\)).
In the code below I used a preview version of structured parallel constructs from Java 19. It can be implemented also in the classic way, using normal threads and a thread executor. But I like to take a look into the new features, and the code using StructuredTaskScope is much cleaner anyway.
int[] sieveSkip32FlagsParallel(int n) { int size = n / 3; int[] numbers = new int[size + 1]; boolean[] flags = new boolean[size + 1]; numbers[0] = 2; numbers[1] = 3; int len = 2; int lim = (int) Math.floor(Math.sqrt(n)); int pos = 0; try (StructuredTaskScope<Object> s = new StructuredTaskScope<>()) { while (true) { while (pos < size && flags[pos]) { pos++; } if (pos == size) { break; } NextPrimes ps = addNextPrimes(numbers, len, flags, pos, n, size); if (ps.step == 0) { break; } List<Callable<Object>> tasks = new LinkedList<>(); for (int i = len; i < ps.len; i++) { int pp = numbers[i]; if (pp < lim) { tasks.add(() -> { int step = 2 * pp; for (int j = pp * pp; j < n; j += step) { int rest = j % 6; if (rest == 1 || rest == 5) { flags[primeToPos(j)] = true; } } return null; }); } } len = ps.len; if (!tasks.isEmpty()) { for (var task : tasks) { s.fork(task); } s.join(); } pos += ps.step; } s.shutdown(); } catch (InterruptedException ignored) { } return Arrays.copyOf(numbers, len); } record NextPrimes(int step, int len) { } private NextPrimes addNextPrimes(int[] numbers, int len, boolean[] flags, int pos, int n, int size) { int p = posToPrime(pos); int square = Math.min(p * p, n); int i = pos; while (i < size) { if (flags[i]) { i++; continue; } int pp = posToPrime(i); if (pp >= square) { break; } numbers[len++] = pp; i++; } return new NextPrimes(i - pos, len); } private int posToPrime(int pos) { return ( (pos & 1) == 0 ) ? 3 * pos + 5 : 3 * pos + 4; } private int primeToPos(int p) { int pp = (p + 1) / 3 - 2; return (3 * pp + 5 < p) ? pp + 1 : pp; }
There is a little bit more code than the initial version, but it is still pretty clean. Honestly, I did not expect a large improvement, considering the work which has to be done in parallel. Despite the expectations, the implementation performed very well, pushing down the execution time further to 2.2 seconds for generating all prime numbers less than 10^9. It almost halved the previous time!
Benchmarks
To have a proper assessment of the performance I implemented some JMH (Java Microbenchmark Harness) test and measure the performance. I used various sizes of prime number limit starting with \(100\) and incrementing with one order of magnitude until 10^9. The average time for each variant is listed below.
Benchmark (n) Mode Cnt Score Error Units Eratostenes.testSieveClassic 100 avgt 3 0.001 ± 0.001 ms/op Eratostenes.testSieveClassic 1000 avgt 3 0.004 ± 0.001 ms/op Eratostenes.testSieveClassic 10000 avgt 3 0.030 ± 0.110 ms/op Eratostenes.testSieveClassic 100000 avgt 3 0.321 ± 0.156 ms/op Eratostenes.testSieveClassic 1000000 avgt 3 5.886 ± 7.290 ms/op Eratostenes.testSieveClassic 10000000 avgt 3 164.316 ± 25.162 ms/op Eratostenes.testSieveClassic 100000000 avgt 3 1967.791 ± 673.718 ms/op Eratostenes.testSieveClassic 1000000000 avgt 3 38775.991 ± 30531.627 ms/op Eratostenes.testSieveSkip2 100 avgt 3 0.001 ± 0.001 ms/op Eratostenes.testSieveSkip2 1000 avgt 3 0.002 ± 0.001 ms/op Eratostenes.testSieveSkip2 10000 avgt 3 0.016 ± 0.040 ms/op Eratostenes.testSieveSkip2 100000 avgt 3 0.207 ± 0.010 ms/op Eratostenes.testSieveSkip2 1000000 avgt 3 3.123 ± 3.989 ms/op Eratostenes.testSieveSkip2 10000000 avgt 3 81.881 ± 227.557 ms/op Eratostenes.testSieveSkip2 100000000 avgt 3 805.019 ± 176.453 ms/op Eratostenes.testSieveSkip2 1000000000 avgt 3 11274.829 ± 851.475 ms/op Eratostenes.testSieveSkip2Flags 100 avgt 3 0.001 ± 0.001 ms/op Eratostenes.testSieveSkip2Flags 1000 avgt 3 0.003 ± 0.001 ms/op Eratostenes.testSieveSkip2Flags 10000 avgt 3 0.020 ± 0.005 ms/op Eratostenes.testSieveSkip2Flags 100000 avgt 3 0.181 ± 0.055 ms/op Eratostenes.testSieveSkip2Flags 1000000 avgt 3 1.825 ± 0.319 ms/op Eratostenes.testSieveSkip2Flags 10000000 avgt 3 42.074 ± 6.276 ms/op Eratostenes.testSieveSkip2Flags 100000000 avgt 3 558.695 ± 154.560 ms/op Eratostenes.testSieveSkip2Flags 1000000000 avgt 3 6673.653 ± 3237.412 ms/op Eratostenes.testSieveSkip23Flags 100 avgt 3 0.001 ± 0.001 ms/op Eratostenes.testSieveSkip23Flags 1000 avgt 3 0.002 ± 0.001 ms/op Eratostenes.testSieveSkip23Flags 10000 avgt 3 0.016 ± 0.012 ms/op Eratostenes.testSieveSkip23Flags 100000 avgt 3 0.196 ± 0.013 ms/op Eratostenes.testSieveSkip23Flags 1000000 avgt 3 2.124 ± 0.211 ms/op Eratostenes.testSieveSkip23Flags 10000000 avgt 3 21.587 ± 4.011 ms/op Eratostenes.testSieveSkip23Flags 100000000 avgt 3 356.884 ± 35.051 ms/op Eratostenes.testSieveSkip23Flags 1000000000 avgt 3 4114.460 ± 2876.299 ms/op Eratostenes.testSieveSkip32FlagsParallel 100 avgt 3 0.032 ± 0.010 ms/op Eratostenes.testSieveSkip32FlagsParallel 1000 avgt 3 0.074 ± 0.062 ms/op Eratostenes.testSieveSkip32FlagsParallel 10000 avgt 3 0.138 ± 0.061 ms/op Eratostenes.testSieveSkip32FlagsParallel 100000 avgt 3 0.754 ± 0.188 ms/op Eratostenes.testSieveSkip32FlagsParallel 1000000 avgt 3 4.284 ± 1.251 ms/op Eratostenes.testSieveSkip32FlagsParallel 10000000 avgt 3 26.336 ± 12.007 ms/op Eratostenes.testSieveSkip32FlagsParallel 100000000 avgt 3 202.568 ± 103.288 ms/op Eratostenes.testSieveSkip32FlagsParallel 1000000000 avgt 3 2258.194 ± 635.403 ms/op
Conclusion
Even if the Sieve of Eratosthenes is an ancient and well known algorithm implemented by most of us from undergraduate school, it still can offer opportunities for improvements. I am not aware of this way of exploiting parallel computation for this algorithm. It might be a new idea or not. However, the implementation is not hard or inaccessible and the implementation of all the variants with testing took me about 4 hours. It can be easily adapted for long integer values. The same idea can be readily applied also to a segmented version.
I hope you enjoyed that small journey around the Sieve of Eratosthenes.
Leave a Reply