Proof Efficiency in Aleo. Optimizations for Finite Fields and Witness Construction
Aleo is a privacy-first blockchain that uses zk-SNARKs (via the Marlin protocol) to prove general-purpose program execution. In Aleo’s architecture, user programs written in the high-level Leo language are compiled by the SnarkVM synthesizer into R1CS circuits. The prover then computes a witness (all intermediate values) and generates a succinct proof attesting to correct execution, all while keeping inputs private. In practice, proof generation is compute-intensive: Aleo’s documentation notes that “proof generation on Aleo is a compute-intensive process” that even client software often offloads to GPUs to speed things up. This is because the core proving algorithms involve massive finite-field arithmetic. Aleo’s provers predominantly perform two heavy tasks: Number-Theoretic Transforms (NTTs) for fast polynomial arithmetic and Multi-Scalar Multiplications (MSMs) in elliptic groups. In Aleo’s consensus (“Proving”) work, provers earn rewards by generating SNARK proofs that require many FFT/NTT operations and MSMs.
Fast Polynomial Arithmetic via NTT/FFT
At the heart of most modern SNARKs is a polynomial commitment scheme, which relies on fast polynomial multiplication and evaluation. Directly multiplying two degree-𝑛 polynomials takes $O(n^2)$ operations (naïve convolution), but by using a Fast Fourier Transform (FFT) over a finite field (an NTT), this drops to $O(n\log n)$. In Aleo’s field (the 377-bit base field of BLS12-377), one can choose a power-of-two domain size with a primitive root of unity and run a radix-2 FFT. Mathematically, an NTT evaluates a polynomial $f(x)=\sum_{i=0}^{n-1} a_i x^i$ at $n$ equally spaced roots of unity in the field, transforming convolution into pointwise multiplication. For example:
// Pseudocode (radix-2 decimation-in-time NTT)
function NTT(a[0..n-1], ω):
if n == 1: return
(even, odd) = split a into evens and odds
NTT(even, ω^2)
NTT(odd, ω^2)
x = 1
for i in [0..n/2-1]:
t = x * odd[i]
a[i] = even[i] + t
a[i+n/2] = even[i] - t
x = x * ωHere $ω$ is an $n$-th root of unity in the field. The inverse NTT recovers the coefficients from the evaluations. In Aleo, these NTTs are used whenever polynomials are multiplied or when interpolating between coefficient and evaluation forms.
The Marlin SNARK in Aleo performs dozens of such transforms. For instance, on Aleo’s Testnet2, each proof generation involved 11 iterations of FFT followed by MSM steps. (This aligns with the general observation that proof generation is dominated by FFT and MSM workloads.) Concretely, for a large circuit, each column of an AR1CS might require FFTs of length $L\approx32T$, costing roughly $2L\log L$ field multiplies per transform In that example, with $T=2^{20}$ steps, an FFT costs on the order of $1600\cdot T$ field operations. Thus even medium-sized proofs involve millions of multiplications.
Aleo’s implementation uses optimized NTT routines (e.g. in Arkworks or SnarkVM) to handle these efficiently. In practice, many NTT implementations break the transform into parallel “butterfly” operations, which can be multithreaded across CPU cores. The transform steps are fully data-parallel, so multicore CPUs or vectorized instructions can achieve significant speedups. Hardware acceleration is also possible: recent efforts like Tachyon (from the ZKAccelerate initiative) report ~1.4× speedup on NTT operations compared to software libraries.
Parallel Witness Construction and Proving
Beyond polynomial math, witness construction (a.k.a. synthesis) – executing the Leo program to compute all private- and public-input-dependent values – can also be parallelized to some extent. The SnarkVM synthesizer compiles control flow and arithmetic into a flat R1CS. Independent portions of a circuit can be evaluated in parallel; for example, if the program branches or processes arrays, each branch or array chunk’s witness values can be filled by separate threads. Additionally, low-level primitives (like range proofs, hash functions, etc.) often consist of many independent gate evaluations. In short, while some dependence remains (each operation may feed into the next), many constraint groups can be worked on concurrently.
Once the witness is computed, the prover algorithm follows. Modern SNARK backends like Marlin use highly parallel approaches. In particular, Multi-Scalar Multiplication (MSM) – the elliptic-curve exponentiation step in polynomial commitments (similar to KZG or inner-product arguments) – is embarrassingly parallel. MSM is often implemented with Pippenger’s algorithm, which shards the scalar/vector multiplication across threads or GPU cores. Aleo’s own documentation notes that FFTs run on CPU while MSMs consume “CPU and GPU resources”, reflecting this division of labor. Indeed, GPU acceleration is well-suited to the large batch of elliptic multiplications in an MSM. A recent industry report observes that GPU hardware can significantly speed up MSM: for example, Tachyon’s GPU-optimized MSM was reported as 1.8×–10× faster than standard libraries. The same report notes parallel speedups for FFT: Tachyon’s NTT was ~1.4× faster than Arkworks’ default FFT.
Aleo is also developing batch-proving optimizations. Marlin supports batching multiple proofs together so that common polynomial FFTs can be shared. The Aleo team plans to add “Marlin Batch Proving” to SnarkVM, which will amortize the cryptographic work across many proofs and greatly increase throughput for high transaction loads.
Performance: Throughput and Benchmarks
In aggregate, these optimizations pay off in dramatic performance gains. During its Testnet3 Phase II, Aleo reported a roughly 37,000× increase in proofs-per-second (PPS) over Testnet2. (Testnet2 averaged ~20,000 PPS network-wide; Testnet3 saw roughly three orders of magnitude more on similar hardware.) This was achieved by both hardware advances and algorithmic improvements. For example, Aleo adjusted its mining puzzle to focus on witness generation rather than final proof output, deliberately removing redundant FFT/MSM work– effectively shifting the “mining” workload to proof preparation.
On the user side, proof times vary with circuit size. Simple operations (e.g. adding a few field elements) yield tiny circuits and proofs that can be generated in milliseconds on a modern CPU. Complex circuits (like big loops or cryptographic hashes) can have hundreds of thousands or millions of constraints and take seconds to minutes. For instance, before hardware acceleration, a large circuit took on the order of 6 hours to prove – but with ASIC/GPU acceleration this dropped to about 10 minutes. (This example comes from the “zkEmail” circuit at ZKAccelerate, but it illustrates the scale: a complex privacy app can be proven in under 10 minutes with optimized tooling.) Even without custom hardware, a heavily optimized CPU prover might generate a mid-sized proof in a matter of seconds to a minute.
Theory helps set expectations: if $n$ is the number of R1CS constraints, a Marlin proving run does roughly $O(n \log n)$ field operations for the FFTs plus $O(n)$ EC operations for MSM. Thus doubling constraints typically more than doubles time. In practice, the constant factors are large: one estimate suggests an FFT on ~$10^6$ points requires on the order of $10^9$ field multiplications. However, modern implementations exploit every parallel resource. Aleo users often run SnarkVM with multi-threading, and projects like LeoWallet even experiment with browser-based GPU acceleration to further cut times.
Comparing ZK Systems: Cost, Usability, Scalability
Aleo’s approach has trade-offs relative to other ZK platforms. Like many SNARK systems (Groth16, Plonk, etc.), Aleo’s proofs are small (on the order of 1–2 kilobytes) and fast to verify, thanks to elliptic-curve cryptography. By contrast, STARK-based systems have much larger proofs (tens of kilobytes) but require no trusted setup. In Aleo, the use of Marlin means a universal SRS is needed (one ceremony covers all programs). This avoids per-program setup, but requires an initial generation of parameters (Aleo’s universal setup took ~36 hours of computation). In return, Aleo gets quick proof times and the ability to prove arbitrary circuits without new ceremonies.
On the usability front, Aleo offers a high-level language (Leo) and tooling designed for developers. Leo is a familiar imperative/functional language, as opposed to lower-level DSLs like Circom or Cairo. Aleo’s documentation emphasizes that developers “can build any kind of application on Aleo… irrespective of how much computing power it needs,” combining privacy and expressiveness. The SnarkVM stack hides much of the cryptographic complexity, automatically optimizing and parallelizing the heavy math under the hood. In contrast, some ZK environments (e.g. Circom or Noir) require more manual optimization or management of constraints.
Scalability is another dimension. Aleo decouples proof generation from consensus: users submit proofs as transactions, and a distributed network of provers (incentivized hardware rigs) race to compute them. This “proof-of-succinct-work” model means Aleo can scale ZK computation by adding more provers (GPUs, ASICs, etc.), much like adding miners in a blockchain. By contrast, systems like zk-rollups batch proofs on a single L1, or StarkNet miners compete on STARK tasks. Aleo’s specialized prover network and upcoming batch-proofing feature should further boost throughput.
Finally, it’s worth noting that Aleo’s cryptographic costs align with industry norms: as a universal-SNARK, it still “spends” most time in FFTs and MSMs just like Plonk or Marlin. Hardware acceleration is being pursued across the board. Aleo’s elegance lies in integrating these known optimizations into a full-stack privacy platform. The SnarkVM compiler applies loop unrolling, constraint folding, and other circuit optimizations on Leo code, ensuring minimal R1CS size. Its record-based state model naturally limits the number of updated constraints per transaction. Combined with Marlin’s fast prover and small proofs, this makes Aleo competitive: developers get privacy “for free” at verification time, paying mostly in prover compute.