InfluxDB in IoT world: Introduction (Part 1)
Table of Contents
Time series data
InfluxDB is an open-source time series database, a database that is optimized for handling time series data. But what is time series data anyway? Essentially, it’s arrays of numbers indexed by time. Load times of a website, temperature measurements of a smart fridge, daily closing values of the Dow Jones Industrial Average, to name a few, are the examples of time series data. As Baron Schwartz wrote, some of the typical characteristics of time series database are:
- More than 90% of the database’s workload is a high-volume of high-frequency writes
- Writes are typically appends to existing measurements over time
- These writes are typically done in a sequential order, for example: every second or every minute
- If a time-series database gets constrained for resources, it is typically because it is I/O bound
- Updates to correct or modify individual values already written are rare
- Deleting data is almost always done across large time ranges (days, months or years), rarely if ever to a specific point
- Queries issued to the database are typically sequential per-series, in some sort of sort order with perhaps a time-based operator or function applied
InfluxDB introduction
First released in 2013, InfluxDB is a fairly young database. Time series database is not a new idea. For example, Kdb, a commercial high performance Time Series DBMS was first release in 2000. However, just recently it became mainstream, partly due to rise of Internet of Things, partly because of NoSQL and NewSQL movement, and ever increasing volumes of data. InfluxDB is at the lead of DB-Engines popularity ranking, with a huge gap to the second place. It is used by eBay, Cisco, IBM and other big companies.
Why use InfluxDB in IoT wold
So why did we at Airly decided to use InfluxDB? We receive a high volume of air pollution sensors data. And InfluxDB turns out to be extremely fast for ingesting data (thanks to LSM Tree storage engine optimized for time series data). Just how fast? I did a performance testing on a single-node InfluxDB installation on c4.8xlarge Amazon Web Services instance (36 vCPUs, 60GB of RAM, not too shabby) pushing data from another instance in AWS network using InfluxDB benchmark project.
Using --batch-size=5000 --workers=32
, the result is more than decent:
loaded 3888000 items in 9.311872sec with 32 workers (mean rate 417531.503762/sec, 180.18MB/sec from stdin)
Every item is around 10 values, a mix of integers, floats and short strings. 400k writes per seconds, that’s decent 🚀. The load isn’t exactly the IoT traffic, though. Unless you batch your writes on the service layer before sending them to InfluxDB. Which you probably should do. Batch writes are more than recommended [1].
The above isn’t a real benchmark of course, but rather a glimpse of what a single InfluxDB instance can do on the ingress side. For some real benchmarks I would encourage reading InfluxDB comparisons to other DBs.
What’s more, all that data (around 4 million measurements) fits into 78MB worth of disk space. When we had migrated data from PostgreSQL to InfluxDB, we noticed around 19 times decrease in disk space usage. More on our migration to InfluxDB below.
Other than data ingest speed, InfluxDB:
- allows series to be indexed
- has an SQL-like query language
- provides advanced time aggregation features
- provides built-in linear interpolation for missing data
- supports automatic data down-sampling
- supports continuous queries to compute aggregates
Query speed comparison between InfluxDB and PostgreSQL
At Airly, we produce and store air quality sensors data and need to analyze it. The most common load for us are lots of writes to DB (pre-batched by back end service) and analytical queries.
Below are two queries comparison running on InfluxDB and PostgreSQL, both of them running on the same AWS instance type (m4.4xlarge). I’ll call those queries:
- Aggregation:
SELECT avg(value), stddev(value) FROM measurements WHERE type = 'PM25' AND time BETWEEN 'XXX' AND 'YYY';
- Count:
SELECT count(*) FROM measurements WHERE type = 'PM25' AND time BETWEEN 'XXX' AND 'YYY';
The execution time results:
Query | PostgreSQL | InfluxDB |
---|---|---|
Aggregation | 215 seconds | 149 seconds |
Count | 77 seconds | 1 second |
The aggregation query is more CPU-bound, which could explain a similiar result. It’s still 30% decrease.
Coming up next…
In the next part of this series we’re going to see how to easily host InfluxDB on Amazon Web Services, how it scales and see performance tips on AWS.