AS OF SQL in Dolt

Dolt is a unique product that can version your data. There are many features it includes but one is the 'AS OF' query to look at a certain past point in time with unlimited time travel (there are other databases that can support this as well - in particular MariaDB and SQL Server - a good survey is found here: illuminatedcomputing.com/posts/2019/08/sql2..).

Solving the AS OF problem gets us closer to building a temporal database.

Let's see an example:

Case 1: A toy payroll example (do not model your payroll like this)

DROP TABLE IF EXISTS logged_hours;
DROP TABLE IF EXISTS employee_hourly;
DROP TABLE IF EXISTS employees;

CREATE TABLE employees (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `name` varchar(255),
    PRIMARY KEY (`id`)
);

INSERT INTO employees (name) VALUES 
  ('Jane'), ('Jeff'), ('Jill'), ('Joe');


CREATE TABLE employee_hourly (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    employee_id int(11),
    hourly_rate_in_dollars int(11),
    PRIMARY KEY (`id`),
    FOREIGN KEY (employee_id) REFERENCES employees(id)
);

INSERT INTO employee_hourly (employee_id, hourly_rate_in_dollars) 
       VALUES (1, 105), (2, 75), (3, 85), (4, 100);


CREATE TABLE logged_hours (
    id int(11) NOT NULL AUTO_INCREMENT,
    employee_id int(11),
    work_date date,
    logged_date date,
    hours_logged int(11),
    PRIMARY KEY (`id`),
    FOREIGN KEY (employee_id) REFERENCES employees(id)
);

INSERT INTO logged_hours (employee_id, work_date, 
                          logged_date, hours_logged) 

VALUES (1, '2000-01-01', '2000-01-02', 4),
       (2, '2000-01-01', '2000-01-02', 2),
       (3, '2000-01-07', '2000-01-07', 3),
       (4, '2000-01-01', '2000-01-01', 2),
       (4, '2000-01-01', '2000-01-01', 1);

Now we may want to run payroll as follows:

SELECT employee_hourly.employee_id, 
       SUM(hours_logged) as total_hours,
       hourly_rate_in_dollars,
       sum(hours_logged) * hourly_rate_in_dollars as pay_for_period,
       '2000-01-01' as period 
FROM logged_hours

JOIN employee_hourly ON logged_hours.employee_id = employee_hourly.employee_id

WHERE logged_date >= '2000-01-01' AND logged_date <= '2000-01-31'

GROUP BY employee_hourly.employee_id;

+-------------+-------------+------------------------+----------------+------------+
| employee_id | total_hours | hourly_rate_in_dollars | pay_for_period | period     |
+-------------+-------------+------------------------+----------------+------------+
|           1 |           4 |                    105 |            420 | 2000-01-01 |
|           2 |           2 |                     75 |            150 | 2000-01-01 |
|           3 |           3 |                     85 |            255 | 2000-01-01 |
|           4 |           3 |                    100 |            300 | 2000-01-01 |
+-------------+-------------+------------------------+----------------+------------+
4 rows in set (0.00 sec)

This is all well and good until we realize that the hourly rate was supposed to be doubled:

UPDATE employee_hourly SET hourly_rate_in_dollars = hourly_rate_in_dollars * 2;

We can run payroll again:

SELECT employee_hourly.employee_id, 
       SUM(hours_logged) as total_hours,
       hourly_rate_in_dollars,
       sum(hours_logged) * hourly_rate_in_dollars as pay_for_period,
       '2000-01-01' as period 
FROM logged_hours

JOIN employee_hourly ON logged_hours.employee_id = employee_hourly.employee_id

WHERE logged_date >= '2000-01-01' AND logged_date <= '2000-01-31'

GROUP BY employee_hourly.employee_id;

+-------------+-------------+------------------------+----------------+------------+
| employee_id | total_hours | hourly_rate_in_dollars | pay_for_period | period     |
+-------------+-------------+------------------------+----------------+------------+
|           1 |           4 |                    210 |            840 | 2000-01-01 |
|           2 |           2 |                    150 |            300 | 2000-01-01 |
|           3 |           3 |                    170 |            510 | 2000-01-01 |
|           4 |           3 |                    200 |            600 | 2000-01-01 |
+-------------+-------------+------------------------+----------------+------------+
4 rows in set (0.00 sec)

This only captures the current state of the data but we would be unable to run the historical flawed report in the form of "SELECT AS OF".

To do that we'd have to modify our SQL slightly to think of hourly rates as immutable events with a valid after timestamp:

CREATE TABLE employee_hourly_v2 (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    employee_id int(11),
    hourly_rate_in_dollars int(11),
    date_inputted date,
    PRIMARY KEY (`id`),
    FOREIGN KEY (employee_id) REFERENCES employees(id)
);

INSERT INTO employee_hourly_v2 (employee_id, hourly_rate_in_dollars, date_inputted) 
       VALUES (1, 105, '1970-01-01'), (2, 75, '1970-01-01'), (3, 85, '1970-01-01'), (4, 100, '1970-01-01');

Now the update clause would be:

INSERT INTO employee_hourly_v2 (employee_id, hourly_rate_in_dollars, date_inputted)
       VALUES (1, 210, '2000-02-15'), (2, 150, '2000-02-15'), (3, 170, '2000-02-15'), (4, 200, '2000-02-15');

This makes the operational system a bit more cluttered due to the strict reporting requirements since we no longer mutate data and append only. But then we are able to answer the AS OF query with some window functions.

To run the payroll today but as it were seen on '2000-01-31' we can do the following sql (note the line that says date_inputted <= '2000-01-31'):

WITH latest_hourly as (

    SELECT * FROM (

        SELECT employee_id, hourly_rate_in_dollars, date_inputted,
               ROW_NUMBER() OVER w as 'row_number'
        FROM employee_hourly_v2 

        WHERE date_inputted <= '2000-01-31'

        WINDOW w AS (
            PARTITION BY employee_id 
            ORDER BY employee_id, date_inputted DESC
        )

    ) T WHERE T.row_number = 1
)

SELECT latest_hourly.employee_id, 
       SUM(hours_logged) as total_hours,
       hourly_rate_in_dollars,
       sum(hours_logged) * hourly_rate_in_dollars as pay_for_period,
       '2000-01-01' as period 
FROM logged_hours

JOIN latest_hourly ON logged_hours.employee_id = latest_hourly.employee_id

WHERE logged_date >= '2000-01-01' AND logged_date <= '2000-01-31'

GROUP BY latest_hourly.employee_id;

+-------------+-------------+------------------------+----------------+------------+
| employee_id | total_hours | hourly_rate_in_dollars | pay_for_period | period     |
+-------------+-------------+------------------------+----------------+------------+
|           1 |           4 |                    105 |            420 | 2000-01-01 |
|           2 |           2 |                     75 |            150 | 2000-01-01 |
|           3 |           3 |                     85 |            255 | 2000-01-01 |
|           4 |           3 |                    100 |            300 | 2000-01-01 |
+-------------+-------------+------------------------+----------------+------------+
4 rows in set (0.01 sec)

To run the payroll on 2000-02-15 (note the line that says date_inputted <= '2000-01-31'):

WITH latest_hourly as (

    SELECT * FROM (

        SELECT employee_id, hourly_rate_in_dollars, date_inputted,
               ROW_NUMBER() OVER w as 'row_number'
        FROM employee_hourly_v2 

        WHERE date_inputted <= '2020-02-15'

        WINDOW w AS (PARTITION BY employee_id ORDER BY employee_id, date_inputted DESC)

    ) T WHERE T.row_number = 1
)

SELECT latest_hourly.employee_id, 
       SUM(hours_logged) as total_hours,
       hourly_rate_in_dollars,
       sum(hours_logged) * hourly_rate_in_dollars as pay_for_period,
       '2000-01-01' as period 
FROM logged_hours

JOIN latest_hourly ON logged_hours.employee_id = latest_hourly.employee_id

WHERE logged_date >= '2000-01-01' AND logged_date <= '2000-01-31'

GROUP BY latest_hourly.employee_id;

+-------------+-------------+------------------------+----------------+------------+
| employee_id | total_hours | hourly_rate_in_dollars | pay_for_period | period     |
+-------------+-------------+------------------------+----------------+------------+
|           1 |           4 |                    210 |            840 | 2000-01-01 |
|           2 |           2 |                    150 |            300 | 2000-01-01 |
|           3 |           3 |                    170 |            510 | 2000-01-01 |
|           4 |           3 |                    200 |            600 | 2000-01-01 |
+-------------+-------------+------------------------+----------------+------------+
4 rows in set (0.00 sec)

I think this overhead of a window function is a reason most try to avoid supporting time travel in their database. It is even more complex if every single table needs to be created with this sort of append only model which is not a standard way of building a traditional relational database.

Here's a key feature of Dolt, we need not make any changes since AS OF is supported natively , and we can observe this as follows. Here's the database you can fork: dolthub.com/repositories/av16/demo/query/main which was built as follows:

Firstly notice you can see the history of all changes , who made them and when:

Screen Shot 2022-01-16 at 1.30.48 AM.png

Now we just run the following query:


SELECT employee_hourly.employee_id, 
       SUM(hours_logged) as total_hours,
       hourly_rate_in_dollars,
       sum(hours_logged) * hourly_rate_in_dollars as pay_for_period,
       '2000-01-01' as period 
FROM logged_hours 

JOIN employee_hourly AS OF TIMESTAMP('2022-01-16 07:15:30') 
   ON logged_hours.employee_id = employee_hourly.employee_id

WHERE logged_date >= '2000-01-01' AND logged_date <= '2000-01-31'

GROUP BY employee_hourly.employee_id;

Screen Shot 2022-01-27 at 6.39.04 AM.png

We can also run the same query to see a point in the past:

SELECT employee_hourly.employee_id, 
       SUM(hours_logged) as total_hours,
       hourly_rate_in_dollars,
       sum(hours_logged) * hourly_rate_in_dollars as pay_for_period,
       '2000-01-01' as period 
FROM logged_hours AS OF TIMESTAMP('2022-01-16 07:15:30')

JOIN employee_hourly 
   ON logged_hours.employee_id = employee_hourly.employee_id

WHERE logged_date >= '2000-01-01' AND logged_date <= '2000-01-31'

GROUP BY employee_hourly.employee_id;

Screen Shot 2022-01-27 at 6.40.53 AM.png

This is just scratching the surface of Dolt. If you simply want time travel check out MariaDB as well. It does actually have more functionality for temporal only use cases like application time periods (mariadb.com/kb/en/application-time-periods).

However Dolt innovates on collaborative use cases in which many people can clone the same database and then work to merge them back together.

A good example is open street maps geo tables where you may want a custom geolocalized map for a specific area (say you map Los Angeles uniquely) but be able to merge and diff in the mainline periodically. Other databases don't have this offering. You can also see who the committer in a git blame style approach.

Another good example is switch configurations. If you've ever used network hardware you have the ability to configure settings in a staging area: arista.com/en/um-eos/eos-configure-session

Once reviewed you can then commit the changes (and in the case of an error have a git revert): github.com/dolthub/dolt/issues/1969

And finally one last example is the case of simulation modeling, you can have multiple branches with different values and run queries against each of them in parallel to understand how a forecasting model would differ before committing the idealized model.

Those are all topics for another day, but hopefully this has piqued your interest. Database engineering is alive and well.