Making Code Generators Pleasant to Use

posted 20 Jul 2024

When I work on CRUD services, one of the things I try to ensure is that there’s a single canonical type for every scalar (timestamps, values with units, etc.¹). While relational data can be split and cut depending on what you need and for performance concerns, scalars rarely need that. Having one canonical representation of a scalar makes life much easier for everyone.

It is of course impossible to have a single representation for the entire system. A minimal system usually looks like this:

Typically, the database, backend and frontend are in different languages, and the communication between them is in different formats as well. That means we in practice have at least 5 representations for a single scalar.

However, within each of the sub-components and communication protocols, I try to have only one representation per type, and I try to make it distinct/unique.

Protocol- and data format designers make this really hard for some reason. Assuming I’m designing some middleware, then it’s pretty common to see two/three types per scalar:

What tends to happen is that some popular serialisation format/schema definition (Protocol Buffers, OpenAPI, GraphQL, you name it) will have a tool that generates types for you that you’re supposed to use, forcing you to convert to and from them as they’re not replaceable nor extensible. That usually means you have to make some manual glue code for those one or two manual conversions, a rather boring and mundane task.

Fortunately, there’s no law saying that serialisation formats must generate the types you want to serialise. Perhaps the best example of this is Go’s standard library: If you want to serialise your type to or from a specific encoding, you can implement

xml.MarshalXML and xml.UnmarshalXML for XML support
json.MarshalJSON and json.UnmarshalJSON for JSON support
sql.Scanner and driver.Valuer for generic SQL support

and so on. This sort of API has spread outside of the standard library. For example, the de facto standard Go YAML library also has this API, which means that one type can directly be converted to and from different encodings.

An Example

I think it’s nice to show it in action, so imagine I’d like to represent speed in some way. In Go, I’d probably do something like this:

// Kmh is a float representing a speed in km/h.
type Kmh float64

Here, I’ve decided to pick km/h as the canonical speed unit, and I want all pieces of the system to use that internally. That’s usually okay: If the user wants to see speed in miles per hour instead, we can add a SpeedUnit type which the user can set in their settings, which the client will convert from and to.

Now, assume I want every speed value to be represented in JSON as follows²:

{
  "type": "kmh",
  "value": 10.0
}

The rationale for sending over an object rather than just 10.0 is that it makes it obvious what unit this is in, and it also forces my clients to “unwrap” the value into a speed value in some way – preferably in a type-safe manner.

If I had to manually convert those from maps to speed values I’d be quite angry, so to (de)serialise the Kmh type, I implement the json.Marshaler and json.Unmarshaler interfaces:

type wrapper[T any] struct {
  Type  string `json:"type"`
  Value T      `json:"value"`
}

// MarshalJSON implements the json.Marshaler interface
func (kmh Kmh) MarshalJSON() ([]byte, error) {
  return json.Marshal(wrapper[float64]{
    Type: "kmh",
    Value: float64(kmh),
  })
}

// UnmarshalJSON implements the json.Unmarshaler interface
func (kmh *Kmh) UnmarshalJSON(input []byte) error {
  var val wrapper[float64]
  err := json.Unmarshal(input, &val)
  if err != nil {
    return err
  }
  if val.Type != "kmh" {
    return fmt.Errorf(`expected type "kmh", got %q`, val.Type)
  }
  *kmh = Speed(val.Value)
  return nil
}

It’s possible to factor out most of the contents of MarshalJSON and UnmarshalJSON to make it more reusable, which is probably what you want if you have other units you want to transmit over the wire.

This way of dealing with encoding is pretty neat, as it also composes “out of the box”. If I were to make a struct like so

type RunningSession struct {
  Duration Seconds `json:"duration"`
  Distance Km      `json:"distance"`
  MaxPace  Kmh     `json:"max-pace"`
}

then it’ll automatically be sent over as

{
  "duration": {"type": "seconds", "value": 7527.28},
  "distance": {"type": "km",      "value": 25.04},
  "max-pace": {"type": "kmh",     "value": 14.0}
}

but if I wanted to customise it somehow, I could implement MarshalJSON/UnmarshalJSON to change the output and input to my liking.

The same applies to SQL, where I can implement sql.Scanner and driver.Valuer and get canonical serialisation there as well. This makes sense for the duration type listed above, as Postgres has an internal interval type. I don’t usually do this in general though, as it messes with my ability to take sums and averages and whatnot in raw SQL.

Since I’m usually using Postgres, I instead tend to make custom domain types for all the datatypes I want to be stored, and store them that way:

CREATE DOMAIN _km AS DOUBLE PRECISION;
CREATE DOMAIN _kmh AS DOUBLE PRECISION;

CREATE TABLE running_session (
  -- some ID here
  duration INTERVAL NOT NULL,
  distance _km NOT NULL,
  max_pace _kmh NOT NULL
)

This isn’t really that different from storing it as a DOUBLE PRECISION unfortunately, but when I work with the database, the field type will make it more obvious what kind of unit I work with. (The underscore prefix is there just to make it clear that this is a custom type.)

And with that, I have defined the way to transmit all speed scalars all the way from the database to clients of my service. Obviously, if the client isn’t a Go program but rather some other language – say, JavaScript – I’m forced to implement the JSON serialisation and a Kmh class there as well. But only once, because I’ve ensured there’s only one way to transmit speed.

Serialisation Must Be Extendable

For this to work, the libraries that transmit data over the wire must be able to take in user-defined types. One way is to do it the way Go does: Have some sensible serialisation defaults for common types, then allow the user to override them by implementing MarshalX and UnmarshalX. I’ve seen similar things in both OCaml and Haskell, though only Rust’s serde has generalised this concept from what I can tell. Serde’s pretty neat that way, as you only have to implement a single serialisation and deserialisation function to support a lot of different encoding formats.

Unfortunately, here we hit upon something akin to the expression problem: I don’t really want to make a new Duration type when it is basically a time.Duration wrapper that only adds serialisation methods. The same applies to many other common types, like UUIDs or intervals.

In some languages, there is an alternative: Add a set of encoder and decoder functions that’ll be used whenever a specific type is used. For example, we may want to change the way we transmit time.Duration to

{
    "type": "seconds",
    "value": 7527.28
}

instead of the usual string representation Go gives out of the box. One solution is to provide a set of functions that adheres to the shape func([]byte) (*A, error), where A is the type you want to convert into:

// imagine json.DecoderFunc and DecoderFuncs are defined
// in some way

func decodeTime(data []byte) (*time.Duration, error) {
  var val wrapper[float64]
  err = decoderFuncs.Decode(data, &val)
  if err != nil {
    return nil, err
  }
  if val.Type != "seconds" {
    return nil, fmt.Errorf(`expected type "seconds", got %q`, val.Type)
  }
  t := time.Microsecond * time.Duration(val.Value * 1_000_000)
  return &t, nil
}

var decoderFuncs = json.DecoderFuncs{
  decodeTime,
}

func doDecode() {
  dec := json.NewDecoder(...)
  err := dec.WithDecoderFuncs(decoderFuncs)

  var value myType
  err := dec.Decode(&value)
}

This isn’t a great solution though: It would rely on bypassing the type system and heavy use of reflection. And since reflection really only works in languages where the type isn’t erased at runtime, it can’t be used in languages like Rust, Haskell or OCaml.

As an aside, the solution in dynamic languages can be better than this, though ironically you need to make “types” to get that benefit. For example, assume you’re using Clojure and want to transmit data. You can make something that looks a bit like spec for every encoding scheme you want to support:

(json/deftype ::duration
  {:encode (fn [v] {:type "seconds", :value (to-seconds v)})
   :decode (fn [data]
             (decode logic here))})

(json/defobject ::running-session
  {:duration {:type ::duration}
   :distance {:type ::distance}
   :max-pace {:type ::max-pace}})

Then, to encode or decode something, you’d do

(json/decode ::running-session in-stream)

(json/encode data ::running-session out-stream)

Going Back to Code Generation

As mentioned previously, working with generated code always feels like a chore. Maybe that has less to do with the fact that the code is generated and more about what it generates. Consider Protocol Buffers (protobuf) – the way it generates code is as such:

However, there’s nothing stopping us from sending types into a protobuf code generator instead.

This is harder than just dumping out some types that don’t attempt to integrate with your code though. With protoc, we can only make relationships between the protobuf specification and our internal types by translating to and from the generated types in code:

Now, however, we have to somehow pass in the relationship between our types and the protobuf specification to the improved protobuf code generator instead:

I guess the reason protoc doesn’t do that is because it’s effort: No language is alike, and by emitting code you let the consumers of protoc take that hit instead.

Let’s take a quick stab at it anyway. For Go, we need to encode two things:

Which protobuf message translates to which struct type we have, and which protobuf field maps to which Go field
How, if it’s not obvious, we translate a protobuf type to and from a Go type

Let’s follow on with our running example:

type Km float64
type Kmh float64

type RunningSession struct {
  Duration time.Duration `json:"duration"`
  Distance Km            `json:"distance"`
  MaxPace  Kmh           `json:"max-pace"`
}

The corresponding protobuf message could then look like this

message RunningSession {
  double duration_seconds = 1;
  double distance_km = 2;
  double max_pace_kmh = 3;
}

and our mapping file could look something like this:

lang = go

import "github.com/hypirion/myapp/running"
import "github.com/hypirion/myapp/util"
import "time"

struct proto.RunningSession = go.running.RunningSession {
  duration_seconds = Duration
  distance_km      = Distance
  max_pace_kmh     = MaxPace
}

custom proto.double = go.time.Duration {
  encodeFunc = go.util.SecondsToFloat64
  decodeFunc = go.util.Float64ToSeconds
}

As Distance and MaxPace are float64s under the covers, our improved protoc tool is smart enough to know how to convert those. And whenever it sees a double where our Go field is a time.Duration, it calls the encoder and decoder functions inside the package "github.com/hypirion/myapp/util".

Consistency and Performance

Apart from the effort required to implement a new protobuf compiler, there seems to be reasons for protoc emitting new types for you: It adds an internal field storing unknown decoded fields, and it attaches ways to do generic operations through “reflection”. This new compiler cannot add an identical reflection API, but I also really struggle to see the point of these reflection capabilities added on the types themselves – I’ve genuinely tried to find an example use case but to no avail.

It’s a little tempting to say that we can throw it away, though G.K. Chesterton would probably not approve of that. So let me just say that it’s surely valuable for someone, and I imagine we can make that piece into a new tool instead of bundling it into protoc itself.

There are other concerns, of course: It will abstract away the underlying format, which some people don’t like. It can also technically impact performance if you’re doing something silly within the (de)serialisation functions – such as doing network calls or writing stuff to the database.

However, in the base case, I’d be surprised if this doesn’t have roughly the same performance as the current tool – maybe even a bit better memory-wise, as you inline the (de)serialisation step. I mean, there are probably situations where this could interfere with the network, but I’d be surprised if this could hit harder than what we use today.

On Tractability

The true problem, of course, is the effort involved. Instead of just generating functions, types and values, you need to understand how to inject the deserialisation logic in every language. My proposed solution requires parsing Go code to infer types and to generate code, but that is a lot of effort if you want to support a lot of languages, especially those with parametric polymorphism/generics. In fact, I’m not sure this will even work for some languages.

You could in theory add your constructors and functions (verbosely) in the input file for every language, and while this technically will work, I’m afraid it will feel a little like manually converting e.g. protobuf types to your own. It will be less effort than doing it in code, but it increases the effort to get started with the protocol itself. As a sales pitch, it may not be the best, though I think it will be better long term than the status quo.

And perhaps that’s the problem: We focus more on the protocol itself and being able to quickly get something running with it, rather than thinking about how we integrate a protocol into our code base for readability and maintainability. That isn’t weird though, as people haven’t really seen code generators that do this. Perhaps we can move more people into that mindset if they’re able to see it as a viable alternative.

I think a better phrase here is “non-relational data”: I do think you should signal that the stuff you carry is a set or a bag, or if you have some compound value without any relational data in it, then that also counts as a non-relational value.

But “scalars” get the point across, so I’m using that. ↩
Personally, I prefer ["kmh", 10.0] or [10.0, "kmh"] as it’s easier to read, but it’s much easier to make a custom (de)serialiser for objects instead of a heterogenous list, which is why I’m doing that in this post. The point here is that the value isn’t just 10.0, the client must do some work to “unwrap” the underlying value.

And yes, I know the code doesn’t check whether the value is set or not. In the real world, you should probably do that. ↩