Implementing System-Versioned Tables in Postgres

posted 21 Jan 2024

I like Postgres, but there are things I really wish they would implement. For example, the SQL:2011 spec adds support for system-versioned tables. Unfortunately, Postgres and SQLite are basically the only SQL databases that don’t yet support it. While it has been discussed in length at the postgreqsl-hackers’ mailing list, it seems like the discussion and implementation efforts have been silent in all of 2023.

Now, there are extensions out there that implement versioning – temporal_tables being the most popular I think – but none are supported for managed Postgres instances on e.g. Azure or AWS. This means that if we want system-versioned tables, we’re forced to make it ourselves.

I’m not going to spend paragraph after paragraph explaining why I want temporal tables before we go to the actual implementation, but I’d like to note that I want to use these tables primarily for user data, generated by users when they click/type something on a webpage. They can be used for other things, but if you have time series, want to do event sourcing or online analytical processing, you should probably pick some technology suited for that task instead of throwing temporal tables at it.

If you’re only after the end result, feel free to go to my GitHub repository time-travelling-todo-lists-in-postgres. It is a todo list app with time-travelling capabilities, using the implementation described here. There you’ll have information on how to use it yourself, how to query the past, as well as gotchas and common pitfalls.

The Implementation

The implementation I’ve decided on consists of two tables for every table I want version controlled: A snapshot table for the current state, and a history table.

The snapshot table is the one used for the current state of the world and is the one you’ll usually work with: This works exactly like any other mutable in-place table you’re familiar with. The history table is the one you’ll use when you want to query through history.

First off, we define the tables:

CREATE TABLE mytable (
  mytable_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  data TEXT NOT NULL
);

CREATE TABLE mytable_history (
  -- copy these fields, always keep them first
  history_id UUID PRIMARY KEY,
  systime TSTZRANGE NOT NULL CHECK (NOT ISEMPTY(systime)),

  -- table fields, in the exact same order as in mytable
  mytable_id UUID NOT NULL,
  data TEXT NOT NULL
);

ALTER TABLE mytable_history
  ADD CONSTRAINT mytable_history_overlapping_excl
  EXCLUDE USING GIST (mytable_id WITH =, systime WITH &&);

The history table contains the exact same fields as the snapshot table, but the first two columns in the table are a history_id field and the system time (systime) it was active. lower(systime) is when this version of the record was initially stored, and upper(systime) is when this record was no longer valid. When upper(systime) is infinity, the record has not yet been deleted. The interval can’t be empty obviously, as that wouldn’t match any time interval.

Next, we have a GiST index, which prevents us from having overlapping time intervals for the same primary key, and also speeds up queries on the history. You may want to add some more indices on the history table, depending on what kind of queries you want to do on it.

Then we set up triggers for insert, update and delete:

--
-- insert:
--

CREATE OR REPLACE FUNCTION copy_mytable_inserts_into_history()
          RETURNS TRIGGER AS $$
  INSERT INTO mytable_history
    SELECT gen_random_uuid(), tstzrange(NOW(), NULL), NEW.*;
  RETURN NEW;
$$ LANGUAGE plpgsql;

CREATE TRIGGER mytable_history_insert_trigger
AFTER INSERT ON mytable
    FOR EACH ROW
    EXECUTE PROCEDURE copy_mytable_inserts_into_history();

--
-- update:
--

CREATE OR REPLACE FUNCTION copy_mytable_updates_into_history()
          RETURNS TRIGGER AS $$
  -- ignore changes inside the same tx
  DELETE FROM mytable_history
    WHERE mytable_id = NEW.mytable_id
      AND lower(systime) = NOW()
      AND upper_inf(systime);
  -- close current row
  -- (if any, may be deleted by previous line)
  UPDATE mytable_history
    SET systime = tstzrange(lower(systime), NOW())
    WHERE mytable_id = NEW.mytable_id
      AND systime @> NOW();
  -- insert new row
  INSERT INTO mytable_history
    SELECT gen_random_uuid(), tstzrange(NOW(), NULL), NEW.*;
  RETURN NEW;
$$ LANGUAGE plpgsql;

CREATE TRIGGER mytable_history_update_trigger
AFTER UPDATE ON mytable
    FOR EACH ROW
    WHEN (OLD.* IS DISTINCT FROM NEW.*)
    -- ^ to avoid updates on "noop calls", as best as possible
    EXECUTE PROCEDURE copy_mytable_updates_into_history();

--
-- delete:
--

CREATE OR REPLACE FUNCTION copy_mytable_deletes_into_history()
          RETURNS TRIGGER AS $$
  -- close current row
  -- note: updates and then deletes for same id
  -- in same tx will fail
  UPDATE mytable_history
    SET systime = tstzrange(lower(systime), NOW())
    WHERE mytable_id = OLD.mytable_id
      AND systime @> NOW();
  RETURN OLD;
$$ LANGUAGE plpgsql;


CREATE TRIGGER mytable_history_delete_trigger
AFTER DELETE ON mytable
    FOR EACH ROW
    EXECUTE PROCEDURE copy_mytable_deletes_into_history();

This is… long, to put it mildly, so let’s go through some common tricks all the three triggers (ab)use, and then go over some design decisions I’ve decided on for updates and deletes. We’ll get around to shortening it at the end.

Column Order Matters

Whenever we insert new data into the history table, it will always be on this form:

INSERT INTO mytable_history
  SELECT gen_random_uuid(), tstzrange(NOW(), NULL), NEW.*;

There are two parts of Postgres we abuse here. If we look at the docs for INSERT, we’ll see the following paragraph:

The target column names can be listed in any order. If no list of column names is given at all, the default is all the columns of the table in their declared order; or the first N column names, if there are only N columns supplied by the VALUES clause or query. The values supplied by the VALUES clause or query are associated with the explicit or implicit column list left-to-right.

The other is that NEW.* expands in the exact same order. In our case, NEW.* will expand into NEW.mytable_id, NEW.data, and that means it fits perfectly with the rows of mytable_history:

INSERT INTO mytable_history
  SELECT gen_random_uuid(), tstzrange(NOW(), NULL),
         NEW.mytable_id, NEW.data;

This is pretty clever, but it’s also hacky. For this to work, there are two things you have to get right:

Column order MUST be identical after the history_id and systime columns.
Whenever you add new columns, you MUST add them in the same order to both tables.

If you don’t do so, you will end up with broken inserts in the history table. Those will at best give you an error, but they can also silently corrupt your data.

I don’t like that, but there’s a reason why I don’t do

INSERT INTO mytable_history
  (history_id, systime, mytable_id, data)
SELECT gen_random_uuid(), tstzrange(NOW(), NULL),
       NEW.mytable_id, NEW.data;

and that’s because it creates a lot of effort when you want to remove, rename or add a new column to the table. By omitting the list of target column names and expanding NEW.*, the query will automatically work whenever I decide to add, update or remove columns – provided I do it for both tables.

On System Time

If you haven’t worked with Postgres ranges before, the range queries may look a bit cryptic. The Postgres docs have an excellent description of range types and range functions, which should cover essentially everything related to them. I don’t use much of it though, so since I guess you don’t want to go through it all, I’ll just list up what the different parts do.

The expression

tstzrange(a, b)

creates the timestamptz range [a, b): from a, up to and excluding b. If either a or b are NULL, then they are infinitely far in the past or future, respectively.

lower(x) returns the lower bound of the range, and upper_inf(x) returns true if the upper bound is infinite.

Finally, systime @> NOW() checks if systime contains NOW() – you can think of it as a <= NOW() AND NOW() < b.

The reason for using ranges isn’t primarily because I want to use these functions, but rather because the GiST index ensures no overlap and makes me less worried that my queries will be super slow by accident.

Inserts

With those two things described, inserts are more or less described in their entirety: We insert the row into the history table, saying it’s valid from the transaction start time until forever.

Updates

The update trigger looks like this:

-- ignore changes inside the same tx
DELETE FROM mytable_history
  WHERE mytable_id = NEW.mytable_id
    AND lower(systime) = NOW()
    AND upper_inf(systime);
-- close current row
UPDATE mytable_history
  SET systime = tstzrange(lower(systime), NOW())
  WHERE mytable_id = NEW.mytable_id
    AND systime @> NOW();
-- insert new row
INSERT INTO mytable_history
  SELECT gen_random_uuid(), tstzrange(NOW(), null), NEW.*;

It is set up so that multiple updates in the same transaction will only produce a single history row. That’s intentional, because otherwise ordering would be harder to implement.

If you need multiple versions in the same transaction, you’ll have to replace the half-open interval (the default [)) with a closed one ([]). If not, you’ll end up with empty intervals, which aren’t tied to any timestamps. But now you have to handle identical states at a single timestamp, as multiple records can get the [NOW(), NOW()] interval. If the ordering matters in that case, you’ll have to add another ordering ID to get rows back in the inserted order.

I don’t need that, it’s complicated, and I don’t think it makes sense to have multiple states at a single point in time anyway. If you need that, you’re probably interested in an event table or something else instead.

Deletions

-- close current row
-- note: updates and then deletes for same id
-- in same tx will fail
UPDATE mytable_history
  SET systime = tstzrange(lower(systime), NOW())
  WHERE mytable_id = OLD.mytable_id
    AND systime @> NOW();

When it comes to atomicity, deletions are somewhat of an outlier. If you update and then delete something in the same transaction, what should you do? And uh.. why would you do that in the first place?

“Update and delete” seems quite handy if you want to write down who deleted the object, in which case the deletion will look like this:

UPDATE mytable
  SET deleted_by = ${deleted_by}
WHERE mytable_id = ${mytable_id};

DELETE FROM mytable
WHERE mytable_id = ${mytable_id};

… however, it leaves you with an annoying record: The “delete” record, which isn’t really part of the history, but only contains some information about the deletion itself.

I think it’s better to make a little bit of effort for these, even if you don’t have an immediate need for a changelog/audit log: Store the deletion information in a table. You could create one for all tables, or just one for everything depending on how lazy you are:

CREATE TABLE delete_log (
  delete_log_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  table_name TEXT NOT NULL,
  resource_id UUID NOT NULL,
  deleted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  deleted_by UUID NOT NULL REFERENCES users(user_id)
);

-- then delete as follows

INSERT INTO delete_log (table_name, resource_id, deleted_by)
VALUES ('mytable', ${mytable_id}, ${deleted_by});

DELETE FROM mytable
WHERE mytable_id = ${mytable_id};

I can’t make a trigger out of this easily (or maybe not at all), because deleted_by is information we’ll have to pass down into the database somehow.

Annoying Conflicts

While the triggers themselves don’t impact what you can and cannot do, the GiST index will blow up most attempts at concurrently updating the same item:

ALTER TABLE mytable_history
  ADD CONSTRAINT mytable_history_overlapping_excl
  EXCLUDE USING GIST (mytable_id WITH =, systime WITH &&);

This will even fail if you do SELECT .. FOR UPDATE. When you have two transactions A and B coming in, if the transaction with the most recent NOW() timestamp gets the lock first, the second transaction will fail because systime ends up as an empty interval.

It’s annoying, but not a big deal for me at least. The tables where I need history usually never need write concurrency on the same ID. And if they do, it’s not hard to add a retry loop around the writes… though of course, it adds a pitfall you need to be aware of.

Trying to fix this by using CLOCK_TIMESTAMP() instead of NOW() will only make things worse in my opinion. If you desire to insert multiple items at the same time, usually none of them will have the same timestamp. That makes certain queries on the history table technically wrong:

Imagine you have a list and list element table, and a user creates a new list with 10 elements. With CLOCK_TIMESTAMP(), you now end up with 11 different timestamps: One for the list root, and 10 for each list element. If you wondered what the list looked like when it was first created, you’d first have to fetch the list root. Then you have to fetch all elements that were in the list, say, as of 1 second after it was created, to be relatively sure all the list elements are also returned. But any attempt at recreating the state at some arbitrary time in the past may end up with partial/corrupt results. For that reason, I highly recommend avoiding this, and rather using some other method to persist history if you really need it.

Alternatives

A trigger-based model seems to me like the best one if you have to use Postgres, but there are some alternatives you may want to consider.

Do You Need The History?

There is added complexity here, and there is an increased chance of transactions failing because of overlapping timestamps that you can’t really prevent. That begs the question: Do you really need the history?

My “controversial” opinion is that, in this day and age, I feel the true question is really a matter of whether you can afford it performance-wise, and not whether it’s too complex. This particular implementation of persisting history is probably not very fast, but it is conceptually simple: Whenever you do something to a table, that action is stored in a history table. The original table is intentionally identical to what you’re used to. If you really don’t like the potential GiST conflicts, you can relax the constraints to get identical behaviour to what you’re used to. In that case, the only difference is the insert/delete/update performance.

Yes, there are many cases where it’s clearly bonkers to retain the history, but I also think many people assume storing the history is costlier than it actually is¹. If you’re not sure, try it and see how it goes. Since your original table will be untouched, it’s not hard to roll back.

For me, there are two reasons I think you should consider it, even if you have no immediate plans to expose this to your users:

First, there will be a time when a specific customer needs their data rolled back to an earlier version. That could either be because of accidental deletes, rogue actors, or even a bad deployment that corrupted data for certain customers. While this doesn’t replace backups by a long shot, it’s a lot faster and easier to do

BEGIN;

DELETE FROM mytable
WHERE company_id = ${company_id};

INSERT INTO mytable (your, fields, here)
SELECT h.your, h.field, h.here
FROM mytable_history h
WHERE h.company_id = ${company_id}
  AND h.systime @> ${backup_time};

COMMIT;

It will of course end up being a little more complex than that in practice, depending on the data model and how fine-grained you want to be. But it’s less effort than setting up the backup server, bridging it with a foreign data wrapper, and then finally doing the inserts in exactly the same manner as above, before cleaning up the backup server afterwards.

There are other cases too, like debugging a transient issue one of your users had. I could go on, but it feels like repeating the sales pitch for Datomic.

The second reason relates to the gut reaction of saying “We don’t need it”: If you don’t have time travel abilities readily available, you can’t explore and play with them. This, in turn, means you won’t really think of them when developing new functionality for your customers. At the risk of sounding like a techbro: I think it stifles innovation. Do you really think there’s no value for it somewhere in your system in the future?

Yeah, yeah, this sounds a lot like future-proofing. To be clear, YAGNI isn’t wrong, but I only think it only applies when there’s either

significant overhead (either in terms of implementation complexity or system performance)
something that can be retroactively implemented (i.e. no data loss)

and system-versioned tables are neither.

… well, uh, let me clarify here. The current implementation I’ve shown you requires a lot of work when you make new tables. Which is a good transition to my next point:

Use a Database Suited for the Task

Writing all of these long triggers yourself just because Postgres doesn’t support it is a bit stupid. If you want an open-source alternative, perhaps MariaDB isn’t too bad? It supports SQL:2011 and even has support for bitemporality if you want to go even further.

Of course, I’d be amiss if I didn’t mention Datomic. It is a great database, although not open-source, and language support is rather limited (only Java/Clojure officially). For the ones in Microsoft land, SQL Server has great support for SQL:2011 as well, from what I gather.

As mentioned, this is aimed at user data generated by someone clicking or typing around on a webpage, mutating some content of theirs. But for IoT data, you’d like some other tool than temporal tables, and if you’re big enough, you probably want to use an OLAP database for OLAP stuff.

Sloppy-Paste or Eldritch Horrors

I am stuck with Postgres, both because that’s what I already got, but also because I heavily use PostGIS. Migrating away is simply not an option, so I have to make do with what I have.

What I like about the trigger solution is that it is – aside from the column expansion hack – not terribly hard to understand. The problem is that it’s very verbose, and after being burnt from a couple of copy-paste mistakes in the past, I’d rather have a single trigger for every table if possible.

It is possible… though it does not look pretty. Here’s what the update trigger looks like after we make the history table and ID field into input parameters:

CREATE FUNCTION copy_updates_into_history() RETURNS TRIGGER AS $$
DECLARE
  history_table TEXT := quote_ident(tg_argv[0]);
  id_field TEXT := quote_ident(tg_argv[1]);
BEGIN
  -- ignore changes inside the same tx
  EXECUTE 'DELETE FROM ' || history_table ||
    ' WHERE ' || id_field || ' = $1.' || id_field ||
    ' AND lower(systime) = NOW()' ||
    ' AND upper_inf(systime)' USING NEW;
  -- close current row
  -- (if any, may be deleted by previous line)
  EXECUTE 'UPDATE ' || history_table ||
    ' SET systime = tstzrange(lower(systime), NOW())'
    ' WHERE ' || id_field || ' = $1.' || id_field ||
    ' AND systime @> NOW()' USING NEW;
  -- insert new row
  EXECUTE 'INSERT INTO ' || history_table ||
    ' SELECT gen_random_uuid(), tstzrange(NOW(), null), $1.*'
    USING NEW;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

(The insert and update triggers are similar, so I won’t go over them here.)

To factor out the history table and ID field, we have to use EXECUTE, which makes the entire thing feel even more hacky. It looks ugly, and if you’re not familiar with EXECUTE or the previously mentioned column expansion magic, it may feel outright eldritch. However, I think this is better than the first attempt at the trigger implementation for one big reason. And that is best shown by using it in practice:

CREATE TRIGGER mytable_history_update_trigger
AFTER UPDATE ON mytable
    FOR EACH ROW
    WHEN (OLD.* IS DISTINCT FROM NEW.*)
    -- ^ to avoid updates on "noop calls", as best as possible
    EXECUTE PROCEDURE
      copy_updates_into_history('mytable_history', 'mytable_id');

This is much less verbose to use if you want to retain history for multiple tables!

While it’s true that the query can’t be type-checked ahead of time, the original trigger won’t be either. We have to run them to verify that they don’t refer to wrong tables or ids. If that’s the case, then we should pick the alternative that reduces the chance of a sloppy search-replace.

And this new implementation does that! This trigger has the history table and the ID field in only two locations – the input arguments, whereas the original trigger implementation had them spread out all around in 7 different locations. Also, let’s face it: copy-pasting some big triggers for every single table we want system versioned just wouldn’t feel right.

Summary

The full implementation is over in the time-travelling-todo-lists-in-postgres repository, and the triggers are in the migration file migrations/001_history_triggers.up.sql.

As I mentioned, this isn’t perfect: You have some big pitfalls related to table modifications, it adds a risk of having concurrent updates break, and other databases should have a much better implementation than my three triggers. It’s probably not suitable for large databases either.

But if you’re stuck with Postgres, I think this is fine. However, I think you should be aware of how the implementation works in detail, as there are a couple of ways to shoot yourself in the foot. And if you’ve come this far down, you hopefully do!

Without going too much on a tangent here: I think too many small companies use systems that are made by and for gigantic companies. So much so that “scale” has been ingrained and is almost an implicit requirement for anything we evaluate. ↩