Journey to TimeScaleDB [Mike Freedman]

A short history of Time Series databases, and why TimeScaleDB is a PostgreSQL extension rather than a new DB.

Listen to the full interview on SEDaily: https://softwareengineeringdaily.com/2021/06/28/timescale-time-series-databases-with-mike-freedman/


We originally created Timescale, really from our own need. Around that
time, 2014-2015, my co-founder and I, Ajay Kulkarni, who we go back many years, we resynced
up and we started thinking about it was a good time for both of us to think about what the next
challenges are that we want to tackle. It seemed to us that there was this emerging trend of
now, people talk about the digitization, or digital transformation. It feels like somewhat of an
analyst term, but I think, it's really responsive of what's happening, in that if you think about the
large, big IT revolution, it was about changing the back office. What was used to be on paper
was now in computers.


What we saw was somewhat the same thing happened to basically, every industry, from heavy
industry, to shipping, to logistics, to manufacturing, both discrete and continuous and home IoT.
Sometimes this gets blurred under IoT, but we also think about it more broadly as operational
technology, those which are not necessarily bits, but atoms. A big part of that was actually
collecting data of what those systems were doing. It's about sensors and data and whatnot.
When we do Initially looked at this problem, we were thinking about a type of data platform we
would want to build, to make it easy to collect and store and analyze that type of data. I think
that's a way that we're slightly different, or why our – what we ultimately built as our database
ended up being fairly different than a lot of other so-called time series databases. That's
because many of them arose out of IT monitoring, where they were trying to collect metrics from
servers, where we were originally thinking about collecting data more broadly from all these type
of applications and devices around your world.
When we started building it, it was originally focusing mostly on IoT. We quickly ran into this
problem that the existing databases out there and the time series databases out there were not
really designed for our problems. They were often much more limited, because they were
focusing on this narrow infrastructure monitoring problem, where the data maybe wasn't as
important. It was only a very specific type. Let's say, they stored only floats. They didn't have to
have extra metadata that they wanted to enrich their data to better understand what was going
on, like through joins.
After, basically working on this platform for about a year, we somewhat came to the conclusion
that we actually need to build somewhat of our own time series database that was focusing on
this more broad type of problem, and so that's what we do. That's what led the development of
what became Timescale.

JM: Today, what are the most common applications of a time series database?

Like and speak mostly about obviously, TimescaleDB, rather than – as I was
alluding to before, a lot of the other time series databases are much more narrowly focused on
IT monitoring, or observability. We really see our use cases across the field. We certainly see
cases of observability. In fact, we have subsequently built actually a separate product on top of Timescale called Promp scale, that is really used for initially Prometheus metrics, but more
broadly, to make it easier to store observability data with TimescaleDB.
We see still a lot of IoT. We see a lot of logistics. We see financial data and crypto data. We see
event sourcing. We see product and user analytics. We see people collecting data about how
users are using their SaaS platforms. We see gaming analytics, where companies are collecting
information about how people's virtual avatars are actually playing within the games. We see
music analytics. We like to think of the old way, used to find the pop stars, you went down to the
smoky club. Now you collect SoundCloud and Spotify streams, and you use that to identify who
the next breakout artist is going to be.
All of these are example of time series data. It's really what's so exciting to us as is it's such a
broad use case, so horizontal, because basically, it's all about collecting data at the finest
granularity you can.

Tell me about the initial architecture for TimescaleDB. You’re based off of
PostgresSQL. What was the reasoning around that decision?

I think, as you point out, Timescale is actually implemented as an extension on
PostgresSQL. Starting maybe 10 or 15 years ago, PostgresSQL started exposing low-level
hooks throughout its code base. This is not a plugin where you're running a little JavaScript
code. We have function pointers into – we get function hooks into the C. PostgresSQL is written
in C, and so TimescaleDB is, for the most part written in C. We have hooks throughout the code
base at the planner, at sometimes in the storage, at the execution nodes. We are able to insert
ourselves and do Lot of optimizations as part of the same process.

You could ask the question of why not just implement a new database from scratch? Why build
it on top of PostgresSQL? I think this really gets to that, we always viewed ourselves as, and we
hear this from our users and community all the time that we are – they are storing critical data
inside TimescaleDB, and they need it to, A, work and be reliable. They also need it to be – they
have a lot of use case requirements. It’s not this, again, narrow thing where you're collecting
one metrics, and all you're asking to do is figure out the min-max average of a certain metric.
You want to do fancy analysis. You want to do joins. You want to do sub queries. You want to do
correlations. You want to have views. You want the operational maturity of a database. You want
transactions, backup, and restore, and all of the replication and all of the above. Some people
say, it takes maybe 10 years, at least, to build a reliable database. We thought this was a great
way in order to immediately gain that level of reliability, we ourselves are huge fans of
PostgresSQL. It has such a great community. It also has such a large ecosystem.
The idea is that effectively, that entire ecosystem would work from us on day one. That means,
all of the tooling, all of the ORMs, all of your libraries would just work. If we support full SQL, not
SQL-ish. If you know how to use SQL, you could start using – and if your tools speak SQL, if
you're running Tableau, if you're running Power BI if you're running Grafana, if you're running
Superset, those all just start working on day one.
Now, the second part of it is, well, what does that mean to build a time series database on top of
PostgresSQL, which clearly was designed more as a traditional transactional database, OLTP
engine? Sometimes they talk about you think about this architecturally. What I mean by that is
you somewhat think about what your workloads look like and what that would mean from a
software architecture. Maybe I'll give you a very concrete example. Starting maybe 10 or 15
years ago, if you look at traditional databases, you started seeing the growth of what people
commonly now called as log structured merge trees, LSMs.
This is a data structure that goes back to the mid-90s, but I think you first saw Google, Jeff
Dean and Sanjay Ghemawat built something called LevelDB. The whole idea of an LSM tree
was, if you look at a workload that has a lot of updates, so with a lot of e-commerce
applications, with a lot of social networks, you're constantly updating things. Traditional
database, if you think about a disk, if you're doing a lot of in-place updates, and these updates
are randomly distributed across all of your user IDs, this means that you're going to cause your
disk to do a lot of random writes on hard drives, that's particularly bad. You need to move the
disk.
Even on SSDs, it doesn't do great, because SSDs still do a lot better to have sequential writes
than random writes, the way the internals of SSDs work. You started seeing this new type of
database architecture called LSM trees emerge, because people wanted to build databases that
had a lot faster updates. On time series databases, on the other hand, don't typically have this
type of workloads.
If you think of a stream of new observations, with the timestamp, these are typically about
what's happening now. It's typically about a stream of inserts that are about this stock price now,
this stock price now, this stock price now, or different, or a 100, or a 1,000, or a 100,000 different
sensors all about what the recording right now.
If you think about how you would then design the internals of your database and the data
structures, when most of your rights are insert heavy, and particularly about the latest time
interval, then what that would mean is the somewhat internal structure of your data should
reflect that. You should optimize your insert path to make it super-efficient to perform inserts on
the latest time interval. It doesn't have to be perfectly in order, but it mostly is about what's
happening recently, as opposed to what's happening a year ago.
That said, Timescale absolutely allows you to backfill data and perform updates, or deletes to
older data. It's just from a performance perspective, keeps all the recent stuff in memory and
builds more efficient data structure to allow you to insert at much higher rates. For example, on
a single machine, if you're collecting a stream of records, eat for several, let's say 10 metrics,
you'll be able to collect even once 2 million metrics per second on a single, pretty standard
machine. 

Then we see this again and again, the way we think about architecting Timescale is, is really
thinking about what the workload looks like, that people often care about recent data. The way
they want to manage their data changes as that data ages. They might want to optimize for
even fast queries for the recent stuff. They might want to start reorganizing their data as ages. 

They might want to start automated automatically aggregating the data as it ages, and dropping
the raw data for the very old stuff to save space. All of these things are what you'd want in a
good time series database, when it's not what you want from either a traditional OLTP database,
nor if you have a traditional data warehouse, or an analytical database, which doesn't think of
this operational view of time series so central to it.
2021 Swyx