On June 30, 2015, a leap second will occur to synchronize atomic clocks with the Earths’ slowing rotation . The International Earth Rotation and Systems Service (IERS) has used this method 26 times since 1972. The leap second of 2012 caused brief disruptions in computer networks for large sites including LinkedIn, Gawker, Mozilla and Reddit. Most importantly, the critical systems including flight check-ins for Qantas Airlines were disrupted.
On October 1, 2015, Nasdaq will increase the granularity of timestamps on some of its datafeeds to reflect nanosecond-level timing increments as part of the rollout of Versions 2.1 of each feed. According to NasdaqTrader.com’s Data Technical News #2015 – 11, a planned parallel period for 2.0 and 2.1 will occur October 1, 2015, through January 29, 2016 for those feeds which will have new versions including NASDAQ Best Bid and Offer (QBBO), Nasdaq Last Sale (NLS), and NLS Plus.
It would seem that the world is getting more precise in measuring time, but there’s something innately flawed with the rationale behind these efforts.
You think you know time — but you really don’t
The crux of the problem with timestamps is people mistake measurements of time (ie: a timestamp) with the flow of time that we experience in the real world. The time that we experience in the real world is immutable, irrevocable, and infinitely divisible. It flows forever onward in a single direction as a series of events. You can never undo something (as much as we might sometimes wish we could!)
You can’t actually pin down real-world time, because you can’t measure an actual instant in time. We like to think that we talk about an instant of time, but what we’re really talking about is a duration. 6:14 PM isn’t an instant, it’s a duration of time between 6:14 PM and 6:15 PM. Increasing your precision doesn’t change this. A second is always a second long. A nanosecond is still just a duration. No matter how finely you slice it, multiple events can happen within a single second, nanosecond, femtosecond, and so forth.
This means that you cannot use time to determine an order of events. No matter how small you slice time, multiple things will happen within that slice of time.
We can’t really measure time either
Another serious problem with using timestamps is that we can’t really measure time. The problems here are almost too numerous to mention. Clocks can run backwards. Timezone definitions change frequently (9.6 times per year for 2000 – 2009). It’s not possible to synchronize clocks between different systems. A clock on top of a mountain (such as all the computers at a telescope) will run faster than a clock at sea level. Daylight savings can cause problems if you’re not careful. Many systems don’t understand leap seconds or how to handle them.
Then there’s calendar complexities, like Feb. 30th. Less esoteric are problems with concepts like “1 month”. Exactly how long is a month? 30 days? 31? Likewise you never want to try and deal with a time without a date, because then you can’t handle daylight savings or any other timezone changes.
We definitely can’t measure time in a database
Database transactions take time. They don’t happen instantly, but RDBMSes try hard to make it look as if they do. So what time should you record in your database? When the transaction started? When it finished? When it committed?
Where things really fall apart is when you try and sequence events in a database, because the data from a transaction won’t be visible to other transactions until it commits. If transaction A runs from 12:00 to 12:03 and transaction B runs from 12:01 to 12:02, which transaction happened first? You might want to say A because it started first. But A can potentially see data from B after B commits. Conversely, you could say B happened first because it commits first, but clearly some events in transaction A happened before B even started.
Isn’t this all hypothetical?
Not really, though sometimes data quality isn’t important enough to worry about them.
A simple example is the “midnight problem.” Say you’re tracking visitors to your website, and you want to know how many people visit per day. Someone that first hits your site at 11:59PM and stays for a while will end up counted twice if you’re not careful. Once at 11:59 and then a second time after midnight. Technically all those visitors did visit on both days; but you may not have wanted to actually double-count them. People also like to be lazy and say timestamp_a BETWEEN ‘2015-01-01’ AND ‘2015-01-02′, but that’s going to double-count everything that happened at “exactly” midnight.
Here’s a scenario where you would have a big problem with double-counting: finance. If a transaction is delayed it can mess up all kinds of financial calculations, must notably interest. So now you have to build in some buffer into interest calculations, then you have to hope that nothing comes in after the buffer period.
Another major problem is trying to correlate data. If you try and do something simple like “date_a = date_b” it’s going to screw up data around midnight (and the same is true no matter what precision you use). So now you start adding buffers; you end up with expressions like “date_a >= date_b AND date_a < date_b + ’15 minutes’.”
There’s a simple fix to all these problems!
When you need to know “when” something happened you almost always want to know what the sequence of events was. So use a sequence! Or, even better, actually link records to each other. For example, instead of this:
CREATE TABLE transaction_journal( transaction_id , customer_id , transaction_date , transaction_amount );
CREATE TABLE transaction_journal( transaction_hid -- I'll explain the name change later , previous_transaction_hid -- Previous transaction for this customer , customer_id , transaction_date , transaction_amount ); -- Ensure each customer has only one "first" transaction CREATE UNIQUE INDEX ... ON ( customer_id ) WHERE previous_transaction_hid IS NULL; -- Ensure there can only be one "next" transaction for each transaction CREATE UNIQUE INDEX ... ON ( customer_id, previous_transaction_hid ) WHERE previous_transaction_hid IS NOT NULL;
Now there can never be a doubt as to what order the transactions for a customer took place in. You can always “walk the chain” of transactions to find out, starting where previous_transaction_hid IS NULL (the first transaction in the chain).
Note that I prefer recording the previous transaction ID because then you can enforce that a transaction never changes once it’s written. That does have a drawback though; it makes it expensive to find the last transaction. You could instead use a “next_transaction_hid” field, but instead I prefer to go the safer route of having a transaction_journal__latest_transaction table that stores the last transaction ID for each customer.
There are some other ways you could accomplish this, but the reason I like this approach is it’s basically fool-proof. Even if you get your pre wrong somehow the unique indexes will protect you from almost all bugs, because you’d eventually end up trying to create a duplicate.
Another option is to use an ‘ordinal’ field that starts counting at 1 (or 0) and increases by 1. I’m not as big a fan of that approach though, because if you get a gap in the numbering you’re in trouble. Of course, if you’re really paranoid, do both things. That way you can tell if something went wrong.
Another strong variation is to actually create a cryptographic hash for each record that includes the hash of the previous record. That makes it impossible for old data to be erased or changed without detecting it.
But wait, there’s more!
A big benefit of chaining records like this is that you can now point to an exact point in time without any ambiguity. If you’re running an interest calculation for example, you can record the most recent “transaction_hid” you used for that calculation. The next time you calculate, you will start with the transaction “previous_transaction_hid = last_transaction_hid_processed.” No guessing at dates or times. No problems if a transaction shows up late.
Want to know the exact address associated with a customer’s order, despite any changes? Just use “address_hid” and you’re good.
Those use cases are why I use “_hid” instead of just “_id” — it makes it crystal clear that the surrogate key is a specific point in a history chain and not just some generic surrogate key.