Implementing contract upgrades

TLDR: Assuming no enshrined slow updates tree, we can separate classes and instances to mitigate the footguns of delegatecall, and have proxy-based upgrades. This requires implementing at the protocol level the separation between classes and instances, a new delegatecall opcode that targets classes (not instances like in the EVM), and fallback functions (to implement proxies).


Assuming we are not enshrining the slow updates tree for the time being, let’s explore some ideas when it comes to implementing smart contract upgrades. Of course, “no upgrades” is always an option.

Note that, if we don’t have an enshrined slow updates tree, we have no enshrined way to store the current implementation in a way that’s mutable, accessible from private, and doesn’t cause contention issues, but we can still implement one at the application layer. This means that we need application-level code to “load” the current implementation in whatever scheme we implement.

The EVM way: delegatecall and fallback functions

We can always implement upgrades using the same building blocks as the EVM: delegatecall opcode and fallback (aka method_missing, doesNotUnderstand) functions. To build an upgradeable contract, users would have to write a proxy-like contract with a fallback function that loads the current implementation from the slow updates tree, and then does a delegatecall to it. Note that two different fallback functions would be needed: one for private, and one for public.

To support this, we’d need to add 1) delegatecall support in the protocol and langauge, and 2) fallback functions in the protocol and language. The latter requires the kernel circuit to loop over all functions in a contract when it’s resolving what bytecode to run, prove that none of them matches the requested one, and only then load the fallback function.

Now, as @rahul-kothari has perceptively noticed some time ago, I’m not a fan of this solution:

image

delegatecall has been behind numerous security incidents, from the infamous parity wallet multisig hacks from 2017, to the recent Raft hack in which the hacker themself lost their loot due to misuse of this opcode. It also makes it difficult to write reusable code: library authors need to maintain forks of their code with the necessary hacks to support upgradeability (eg no constructors, no default initializers, etc).

I’d argue that the reason behind delegatecall becoming such a footgun is that it overloads what a contract is, making it hard to reason about your own code. A contract can either be an instance, in that it’s a contract you interact with and has its own state, or a class, in that it’s a bunch of stateless code that gets executed from an instance, or anything in between. But to the EVM and the language, it’s the same construct.

So why don’t we enshrine this difference?

Contract instances with immutable class

We can split what we currently call a contract into a contract instance and a class, like Starknet does. An address would refer to a contract instance, that has its own state as well as a pointer to a class. The class is just a bunch of immutable bytecode. When a user calls into an instance, the code from the instance’s class is loaded. Instances do not have any code of their own, they just point to a class. This makes it easier to reason about what a contract is, and gives us the gas-savings of minimal proxies (aka clones) for free.

Our original proposal for classes and instances suggested including a CODEREPLACE opcode that would change the class of an instance, like RSK or Starknet. However, this would require storing the pointer to the class in slow-updates storage, and since it’s not enshrined, this is no longer an option. So we’ll assume that the class of a contract instance cannot be changed.

Nevertheless, we can still use the classes-instances pattern to mitigate the issues with delegatecall. We can keep the delegatecall opcode, only that instead of calling into an instance, we call into a different class. This keeps the healthy separation between instances and code, while allowing to dynamically load different code.

In this scenario, an upgradeable contract would be an instance with a proxy as contract class. This class would delegatecall into an “implementation” contract class loaded from an app-layer slow-updates tree. Note that, in contracts where contention is not an issue (such as account contracts), the implementation class can be stored in a regular private note. And in public-land, it can just be stored in public storage.

Supporting this would require introducing a new call type in the protocol, both for private and public, as well as enshrining the separation of instances and classes.

As a side question: should we still call this delegatecall?

Script execution: A use case for delegatecall, aside from upgradeability, was executing scripts. An example raised by the community was encoding a set of actions for a governance contract to execute, and then delegatecalling into it to carry out those actions impersonated as governance. Keeping delegatecall with the new semantic of calling into a contract class would still allow this use case.

Enshrined proxies vs fallback functions

Assuming we go with the design above, let’s discuss how we want to implement proxies:

  1. Add support for “fallback” functions in the protocol, and implement the proxy as a class with just a fallback function that loads the implementation class and delegatecalls into it. This would be the most similar to the EVM way.
  2. Do not support “fallback” functions. Instead, enshrine proxies in the protocol as a type of contract class that has only a single function, that gets called regardless of the selector used. This is far more restrictive than (1), but may be more efficient since it does not need to prove that no function selector matches the requested one during a call, and it’s not clear if there are any use cases for fallback functions that are not proxies.
  3. Support “fallback” functions in the protocol, but do not expose them at the language level. Instead, add a “proxy” keyword that emulates the behaviour of option (2) by using option (1).

Given the performance improvement of (2) would most likely be negligible, I’d favour option (1) since it’s the most flexible, and doesn’t seem to lead to any security footgun.

19 Likes

For the sake of it, let’s change the initial assumption and pretend we do have an enshrined slow updates tree. In this scenario, we can implement contract classes and instances, remove delegatecalls, remove fallback functions, and just add a codereplace opcode for upgrades.

With this scheme, calling into an upgradeable contract requires logic in the kernel circuit to retrieve the current implementation from the enshrined slow updates tree (or falling back to a default embedded in the address preimage), and then executing that bytecode.

In contrast, the proxy-based scheme above requires 1) a call to the proxy contract, 2) a call to the slow updates global contract to retrieve the implementation, 3) a delegatecall into the loaded implementation class, 4) a public call to assert_current_root in the slow updates global contract. This means that calling into an upgradeable contract using the proxy pattern with app-level slow updates tree requires 4x the number of calls (one of them a public one, which requires spinning up a public vm).

Note that adding a maximum valid block number to a transaction at the protocol level would allow us to remove the public call (4), lowering the total to 3x the number of calls.

Thanks for articulating this so clearly, and for the comparison of the number of calls required with/without an enshrined slow updates tree. This post can serve as a nice reference for anyone who wishes to experiment with implementing (and benchmarking the costs of) an upgrade pattern.

An idea for another upgrade pattern, which builds upon Executing scripts in account contracts, is to write a proxy function which can verify an arbitrary proof. The contract ‘admin’ can then assert which verification keys are acceptable for modifying the state of the contract.

The upgradeable contract looks something like this (in pseudocode which looks nothing like aztec.nr - sorry!):

contract MyUpgradeableContract {
  
  const upgrade_admin: AztecAddress; // someone with the power to upgrade the vk_hash
  
  let currently_accepted_vk_hash: Field; // lazy storage declaration
  
  let x: Field; // some state variable that we'll edit, for example's sake.
  
  fn do_something_to_x(proof, public_inputs, purported_currently_accepted_vk) {
      // Validate the correctness of the vk:
      let purported_currently_accepted_vk_hash = hash(purported_currently_accepted_vk);
      assert(purported_currently_accepted_vk_hash == currently_accepted_vk_hash);
  
      // Verify the proof as correct:
      const result = verify_proof(proof, public_inputs, purported_currently_accepted_vk);
      assert(result == true);
  
      // Extract the new value of x, which has been proven to be updated in line with the currently-accepted vk
      const {new_x} = public_inputs;
      x = new_x;
  }

  fn upgrade_vk(new_vk_hash: Field) {
      assert(context.msg_sender == upgrade_admin);
      currently_accepted_vk_hash = new_vk_hash;
  }
 
}

Does this work?

I guess it’s not a very legible contract anymore. And it’s no longer making a call to a pretty contract, so in a way it’s undoing a lot of the effort we’ve gone to to making contracts feel like existing smart contract architectures…

Edit: we could probably build a library into aztec.nr that abstracts all of the ‘proof, vk, vkhash’ stuff into reading function selectors from classes, if we wanted to make it feel more familiar…

Thanks for going through my ramblings, Mike!

Seems like the main benefit of this approach would be that we don’t need delegatecall nor fallback functions, and can instead just use the verify_proof primitive which is pretty handy. Still, currently_accepted_vk_hash would need to live in a slow updates tree, much like the current_implementation in the approach above.

It definitely works for pure functions, but I’m not so clear it does for side-effects. For starters, you’d need to capture the new nullifiers, commitments, etc from the proof, and merge them into your current context. And I wouldn’t know how you’d handle oracle calls, like querying private notes or emitting events.

Perhaps there’s a missing step in this approach: do_something_to_x explicitly expects the proof and vk in its interface, meaning the caller needs to be aware that it’s calling into an upgradeable function. Instead, it’d be nicer to make it transparent to the caller, by running the code as part of the called function, and not make it responsibility of the caller. Maybe this could help with handling side effects as well.

  fn do_something_to_x(some_arg) {
      // Load current code and execute it
      let code_hash, vk = oracle.get_current_code_for("do_something_to_x");
      let proof, public_inputs = oracle.call_and_prove_private_function(code_hash, [some_arg]);

      // Validate the correctness of the vk:
      let purported_currently_accepted_vk_hash = hash(purported_currently_accepted_vk);
      assert(purported_currently_accepted_vk_hash == currently_accepted_vk_hash);
  
      // Verify the proof as correct:
      const result = verify_proof(proof, public_inputs, purported_currently_accepted_vk);
      assert(result == true);
  
      // Extract the new value of x, which has been proven to be updated in line with the currently-accepted vk
      const {new_x} = public_inputs;
      x = new_x;
  }

Aside from the above, this approach has a limitation in that you cannot add new functions to a contract. You wouldn’t be able to add a do_something_to_y to the contract in the example.

This approach is also more expensive in terms of storage use (since you need to store one vk per function in the contract) and in terms of recursions (since you need to jump to the slow updates tree to “load” the implementation once per function instead of once per contract).

For posterity, we might be able to do it in 2 calls: 1) a call to the proxy contract (which can read “slow update” state from the archive tree, and constrain the recency of the lookup to be close to an exposed max_block_number), 2) a delegatecall into the loaded implementation class.

To enable a private fallback, we could make the function tree (which exists within a contract class) an indexed merkle tree, so that it supports non-membership proofs. Each leaf would contain a pointer to the value of the next-highest function selector. The correctness of these pointers would need to be validated at deployment.
If a non-existent function selector is called, the pxe would need to feed two function tree membership witness into the kernel circuit: one of the “pointer” leaf (which “jumps over” the called function’s selector), and one of the fallback function. The kernel would then realise the called function doesn’t exist, and fall back to accepting the vk hash (and hence accepting the proof) of the fallback function instead.
The function tree might be relatively small, say 64 leaves, so we’re causing 6 extra hashes, plus a couple more hashes to hash the larger leaf values.

To enable a public fallback, the public vm will need to take an approach similar to the evm.

1 Like

I think most contracts will either have no fallback, or just a fallback, so maybe it makes more sense to optimize for these scenarios and go with a regular tree, where we prove fallback-ness by exhaustively iterating through the leaves? Still, this is just an optimization.