Memory consumption in folded Honk

High level: protstar/galaxy with Honk will increase Prover RAM consumption by over 2x vs non-folded Honk. What to do?

Overview: Protostar/galaxy allows us to fold a Honk instance into an “accumulator” instance.
The accumulator instance represents a relaxed randomised representation of Honk.

A consequence of this is that the accumulator instance polynomials are all random. This risks blowing up memory consumption vs regular instances.

RAM costs for regular Honk

Many precomputed Honk selector polynomials are structured i.e. their values are conssitently small.

e.g. the selectors q_arith, q_sort, q_plookup, q_ecc, q_aux all take Boolean values. The 8 permutation selectors (S1, ..., S4, \sigma1, ..., \sigma4) all take 32-bit values.

The above 13 selectors can be stored in 33n bytes of RAM (n = circuit size).

RAM problems with the randomised accumulator

in a randomised accumulator relation, the above polynomials all take 32-byte values, and requires 416n bytes, a 12x increase.
For a size 2^20 circuit this is 416MB.

Combined with the requirement to represent the randomised accumulator instance and the instance being folded both in RAM at the same time, using a folding scheme increases the RAM requirements of our Prover by over 2x.

Potential Improvements/Solutions

We can define a (quite large) circuit of a fixed depth where all gate selectors are the same for every folded instance. For example, the first 512k gates have q_arith = 1, the next 256k gates have q_sort = 1, the next 128k gates have q_aux = 1 etc etc.

The actual circuit description for a given instance is defined entirely via the copy constraints.

This structure would allow us to not need to store the full representation of the 5 arithmetic gate selectors in memory.
This structure would also reduce Prover time of the folding IOP, due to restricting the number rows each relation is active over for the randomised accumulator instance (this requires its own thread, lots of detail to go into)

There is also the painful option of not using folding at all (instead full recursion + goblin plonk). But this would greatly increase the number of gates of our recursive verifier and also greatly increase Prover costs: with folding the Prover does not need to produce an opening proof at each fold = 2.5n scalar muls with ZeroMorph.

69 Likes

Can we strategically write and read polynomial data to/from disk during proof construction? I don’t have a feel for how much of a slow-down reading and writing would cost vs the speed-up that protogalaxy gives.

65 Likes

I think we can, @charlie is the expert on this.

I think disk caching slows down the Prover substantially though. Ideally RAM consumption is low enough in the general case to not need caching :confused:

61 Likes

The trade-off is somewhat complex:

Using protogalaxy gives us the following:

  • A small constant speedup factor to the Prover (e.g. proof construction time is 0.8 * non-folded time. Maybe 0.7 in the best case)
  • A larger constant speedup factor to recursive verification (which in turn effects Prover times because the verifier is in a circuit that is being proved)

Memory consumption is a lot more binary. Either the system has enough RAM to run the Prover or it doesn’t. As you mentioned, disk caching helps but ideally we only rely on it for edge-cases because it substantially slows down the Prover.

57 Likes

want to learn more about this. is there a related post somewhere?

14 Likes