A poc of a JVM autoparallelizer

Computation is a commodity—and a very cheap one. Yet it’s the cornerstone of our world, and anything that makes it faster or more efficient can have massive impact. For example, shaving just 5% off compute time can be akin to jumping a full generation in CPU performance—worth hundreds of billions of dollars. That’s the kind of leap we’re after.

I’ve been experimenting with a new approach to automatically parallelize loops in compiled Java bytecode. In a simple test, it delivered a 9× speedup (JMH benchmarked) on a large loop. Intrigued? Let’s dive in.

The Big Picture

My proof of concept (PoC) detects which loops are safe to run on different cores, rewrites the bytecode to execute them in parallel, and automatically decides when to switch between sequential and parallel versions. The end result? Some parts of your code get spread across multiple CPU cores, speeding things up—without you having to touch the source code. It’s all done post-compilation.

Here’s the high-level workflow:

Analyze the compiled Java bytecode.
Identify loops.
Prove that they are thread-safe if parallelized (that’s the hard part).
Rewrite the code to include a parallel version.
Run the parallel version automatically for large loops—but revert to sequential for small loops (where thread overhead outweighs benefits).

my poc: Summing Integers in a Loop

To illustrate the idea, I started with a straightforward numeric loop—one that sums two integers ‘a‘,‘b‘`a`, `b`‘a‘,‘b‘ plus the loop index ‘i‘`i`‘i‘ over a variable number of iterations (potentially random or large).

 public int loopedSum(int a, int b, int iteration) {
        long startTime = System.nanoTime();
        int sum = 0;
        for (int i = 0; i < iteration; i++) {
                // Increase workload: perform multiple additions in each iteration
            sum += (a + b + i); // Add variability with `i`
        }
        long endTime = System.nanoTime();
        long duration = (endTime - startTime)/ 1_000_000;
        System.out.println("Time taken: " + String.format("%,d", duration) + " ms");
        return sum;
    }

Performance

This is the performance for the original code on my PC.

Large Loop (1 billion iterations): ~247 ms on my machine.
Tiny Loop (10 iterations): Too fast to measure precisely (effectively 0–1 ms).

After running my “fastizer” tool (which scans and rewrites the program), the same code ran in ~86 ms for the large loop—a 2.8× speedup, all without touching the source code.

For the tiny loop, it remains at near 0 ms—the tool automatically chose not to parallelize it, as overhead would have canceled out the benefit.

	Original code	Fasterized code
Large loop	~247 ms	~86ms
Tiny loop	0-1 ms	0-1ms

JMH Run

Original Code

Benchmark Mode Cnt Score Error Units
SummerBenchmark.bigLoop avgt 5 245.986 ± 5.068 ms/op
SummerBenchmark.randomLoop avgt 5 384.023 ± 84.664 ms/op
SummerBenchmark.smallLoop avgt 5 ≈ 10⁻⁶ ms/op

Benchmark Mode Cnt Score Error Units
SummerBenchmark.bigLoop avgt 5 38.963 ± 10.641 ms/op
SummerBenchmark.randomLoop avgt 5 56.230 ± 2.425 ms/op
SummerBenchmark.smallLoop avgt 5 ≈ 10⁻⁵ ms/op

Fasterized Code

Benchmark Mode Cnt Score Error Units
SummerBenchmark.bigLoop avgt 5 245.986 ± 5.068 ms/op
SummerBenchmark.randomLoop avgt 5 384.023 ± 84.664 ms/op
SummerBenchmark.smallLoop avgt 5 ≈ 10⁻⁶ ms/op

The speedup is even more important: 9x speedup. This is because JMH run in “fully optimized mode” and not my test code.

How It Works (Briefly)

Unlike classic HPC compilers that focus on static arrays and strictly known loop bounds, this method operates at the bytecode level:

Bytecode Scan
- We look for cycles in the control-flow graph. It’s inherently interprocedural, so in principle, even recursion could be detected and parallelized.
Parallelisation analysis
- The tool checks which instructions can be parallelized—focusing on the loop structure and register usage.
- In loopedSum, it identifies the loop-control variable (i), sees it’s incremented in a regular pattern, and recognizes the accumulator (sum) as a candidate for safe parallel reduction.
Generate Two Versions
- Parallel Version for large loops (splitting across multiple threads).
- Sequential Version for smaller loops (no overhead).
Re-Emit the Bytecode
- We produce a new JAR with the transformed bytecode. Run that JAR, and the code automatically parallelizes when it hits big loops.

Why This Is Different

Classic auto-parallelization tools usually assume neat, static loops—common in HPC (High Performance Computing). Real Java code, however, is full of dynamic inputs, random values, and often method calls or recursion that HPC compilers struggle with.

My approach:

Scans and rewrites your code for you—no special hints, no rewriting your loops manually.
Works post-compilation and reads actual bytecode, instead of needing source code or HPC-friendly patterns.
Dynamically decides at runtime whether parallelization is worthwhile based on loop size.

The (Decompiled) Parallelized Code

Below is an example of the final code after transformation. If the loop size exceeds a threshold (here, 10 million), it creates multiple LoopRunner tasks and sums partial results. Otherwise, it runs sequentially.

public int loopedSum(int a, int b, int iteration) {
        long startTime = System.nanoTime();
        int sum = 0;
        int endTime = 0;
        if (10000000 < iteration) {
            threadpool.submit(new LoopRunner(endTime, iteration, sum, a, b, 0));
            threadpool.submit(new LoopRunner(endTime, iteration, sum, a, b, 1));
            threadpool.submit(new LoopRunner(endTime, iteration, sum, a, b, 2));
            threadpool.submit(new LoopRunner(endTime, iteration, sum, a, b, 3));
            threadpool.submit(new LoopRunner(endTime, iteration, sum, a, b, 4));
            threadpool.submit(new LoopRunner(endTime, iteration, sum, a, b, 5));
            threadpool.submit(new LoopRunner(endTime, iteration, sum, a, b, 6));
            threadpool.submit(new LoopRunner(endTime, iteration, sum, a, b, 7));
            sum = 0;
            Future var11 = threadpool.take();
            sum = (Integer)var11.get() + sum;
            var11 = threadpool.take();
            sum = (Integer)var11.get() + sum;
            var11 = threadpool.take();
            sum = (Integer)var11.get() + sum;
            var11 = threadpool.take();
            sum = (Integer)var11.get() + sum;
            var11 = threadpool.take();
            sum = (Integer)var11.get() + sum;
            var11 = threadpool.take();
            sum = (Integer)var11.get() + sum;
            var11 = threadpool.take();
            sum = (Integer)var11.get() + sum;
            var11 = threadpool.take();
            sum = (Integer)var11.get() + sum;
        } else {
            while(endTime < iteration) {
                sum += a + b + endTime;
                ++endTime;
            }
        }

        long endTime = System.nanoTime();
        long duration = (endTime - startTime) / 1000000L;
        System.out.println("Time taken: " + String.format("%,d", duration) + " ms");
        return sum;
    }

public class LoopRunner implements Callable {
    private int param7;
    private int param3;
    private int param6;
    private int param1;
    private int param2;
    private int start;

    public LoopRunner(int param7, int param3, int param6, int param1, int param2, int start) {
        this.param7 = param7;
        this.param3 = param3;
        this.param6 = param6;
        this.param1 = param1;
        this.param2 = param2;
        this.start = start;
    }

    public Object call() {
        int var7 = this.param7;
        int var3 = this.param3;
        int var6 = this.param6;
        int var1 = this.param1;
        int var2 = this.param2;

        for(int var8 = this.start; var8 < var3; var8 += 8) {
            var6 += var1 + var2 + var8;
        }

        return var6;
    }
}

It’s not the prettiest code (it’s generated at the bytecode level and decompiled), but it shows the under-the-hood logic: break the loop among threads, compute partial sums, merge them back.

Conclusion & Next Steps

This is still early-stage and works for a small subset of loop patterns. But it took serious work—this is not trivial “show and tell.” The tool must verify parallel safety, manage concurrency, and handle dynamic scenarios. My next steps:

Extend these transformations to all JVM workloads—i.e., handle more complex loops with branching, exceptions, or object references.
Eventually, push the technique to native machine code, so any compiled language could benefit.

Join the Adventure

I’m a small operation (who also writes poems!), and I’m looking for brilliant engineers who love compilers, concurrency, bytecode, and post-compilation hacking. If that’s you—or if you’re curious about applying this to real-world apps—reach out!

Email: ntoper@gmail.com

3 responses to “A poc of a JVM autoparallelizer”

Eduardo Ramírez

January 26, 2025

Is this PoC using Virtual Threads?

Loading…

1. nicktoper
  
  January 26, 2025
  
  no I’m injecting at app startup something like this threadpool = new ExecutorCompletionService(Summer.executorService = Executors.newFixedThreadPool(8));
  
  8 is the number of core on my machine and is a dynamic value.
  
  I’d love to use virtual thread actually!
  
  Loading…
  
Parallelisation by Recurrence  – Deviant/Abstraction

June 18, 2025

[…] auto-parallelizer PoC used a fun trick to parallelize a loop “for cheap”. I want to explain it […]

Loading…

Deviant/Abstraction