Storing data of arbitrary length in the public data tree

Today the public data tree stores one field element per leaf. While the storageWrite oracle call accepts multiple elements, under the hood it gets converted to multiple update requests to sequential storage slots. Updates can be batched together, but reading data that’s multiple fields long requires one merkle membership proof for each element.

One option to optimize this is to “merge” the merkle membership proofs for reading. Since each element in a new array (or struct) is inserted one after the other, their corresponding nodes in the indexed merkle tree should land together, so most of the merkle path would be shared. This means a merkle membership proof can provide the longest common subpath to the root (ie the path of the smallest subtree that contains all elements), and then just the paths from the elements to the subtree root.

However, it’s unclear how easy it is to implement this optimization. Also, in worst case this optimization does not offer any gains, since the elements could land right in the middle of the tree in which case the smallest shared subtree is the entire tree.

An easier solution is to allow storing data of arbitrary length in each leaf of the tree. Instead of storing the actual value, we store a hash of the value in each leaf. Public update requests could push the hash of the data, and emit its preimage as an event-like. All in all, we replicate the same structure as in private state: the private state tree stores commitments to data, and the actual data (notes) gets emitted as events.

Thoughts? Does this increased complexity pay off for the savings in proving time from the fewer hashings needed when reading or writing data that comprises multiple fields?

Instead of storing the actual value, we store a hash of the value in each leaf. Public update requests could push the hash of the data, and emit its preimage as an event-like.

This can get funky when you have large structs, that updates just a subset of the value, e.g., you got something like {a: Field, b: Field, c:Field, d:Field} that when you are updating just a it would either need to know only what part of the pre-image you are changing and emit that, or emit the entire struct again. If this is done with the individual fields, mutations of data that lands at the same value as before is just transient and don’t need to make their way onto the data layer, so updating a would just emit a and not the entire thing.

13 Likes

Agree. I was working on the “public keys registry” use case, where most updates are all-or-nothing for any given entry.

Still, in your example, if you updated only a and emitted just that, then an update request just needs to include the offset within the slot being updated, and then it’s on the sequencer to “load” the rest of the data to reconstruct the full preimage (and we could design the gas cost of an sstore to take this into account).

EDIT: Thinking more about it, I think this is a matter of “it depends on the context”, as usual. I’m wondering whether this is a choice we should offer to the dev, as in allowing them to flag a storage definition as “atomic” or not.

9 Likes