Issue
I am running Grafana on a EC2 instance in AWS Cloud to visualize data contained in a Timescale database located in the Cloud.
My Grafana dashboard has 14 panels (8 time series, 3 bar charts, 1 pie chart, 2 stats) and all together it performs 69 SQL queries (including 11 queries done in the defined Grafana variables). These queries are partially done towards the raw data table and partially towards hour aggregate views. The dashboards contains also 17 data transformations. Most of the tests have been done by using a time range of 1 day (144 data points when using the raw data table and 24 data points when using the hour aggregate view).
Example of query:
SELECT mtimestamp, mvalue AS "internal temperature" FROM periodic_measurements WHERE ($__timeFilter(mtimestamp) AND apartmentid = $apartment AND roomid = $room_id AND metric = 1)
The problem
When I connect the dashboard to my alfa database (which has a main table of about 16 million entries), the dashboard loads in few seconds.
However, when I use the production database (which has over 1.000 million entries) as a data source, the dashboard takes about 40 minutes to load.
The source data tables
The columns that are used in the Grafana query have indexes since they are all part of a composite primary key.
For example, the raw data table is defined as follows:
CREATE TABLE periodic_measurements
(mtimestamp TIMESTAMP WITH TIME ZONE NOT NULL,
apartmentid INTEGER NOT NULL,
roomid INTEGER NOT NULL,
sensorid INTEGER NOT NULL,
metric SMALLINT NOT NULL,
mvalue NUMERIC NULL,
PRIMARY KEY (mtimestamp,
apartmentid, roomid, sensorid, metric));
production_measurements=> \d+ periodic_measurements;
Table "public.periodic_measurements"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-------------+--------------------------+-----------+----------+---------+---------+-------------+--------------+-------------
mtimestamp | timestamp with time zone | | not null | | plain | | |
originaldb | smallint | | | | plain | | |
originalid | bigint | | | | plain | | |
apartmentid | integer | | not null | | plain | | |
roomid | integer | | not null | | plain | | |
sensorid | integer | | not null | | plain | | |
metric | smallint | | not null | | plain | | |
mvalue | numeric | | | | main | | |
Indexes:
"periodic_measurements_pkey" PRIMARY KEY, btree (mtimestamp, apartmentid, roomid, sensorid, metric)
"periodic_measurements_mtimestamp_idx" btree (mtimestamp DESC)
Triggers:
ts_cagg_invalidation_trigger AFTER INSERT OR DELETE OR UPDATE ON periodic_measurements FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.continuous_agg_invalidation_trigger('1')
ts_insert_blocker BEFORE INSERT ON periodic_measurements FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Child tables: _timescaledb_internal._hyper_1_1422_chunk,
_timescaledb_internal._hyper_1_1423_chunk,
_timescaledb_internal._hyper_1_1424_chunk,
_timescaledb_internal._hyper_1_1425_chunk,
_timescaledb_internal._hyper_1_1942_chunk,
_timescaledb_internal._hyper_1_1943_chunk,
_timescaledb_internal._hyper_1_1944_chunk,
_timescaledb_internal._hyper_1_1945_chunk,
_timescaledb_internal._hyper_1_1946_chunk,
_timescaledb_internal._hyper_1_1947_chunk,
_timescaledb_internal._hyper_1_1948_chunk,
_timescaledb_internal._hyper_1_1949_chunk,
_timescaledb_internal._hyper_1_1950_chunk,
_timescaledb_internal._hyper_1_1951_chunk,
_timescaledb_internal._hyper_1_1952_chunk,
_timescaledb_internal._hyper_1_1953_chunk,
_timescaledb_internal._hyper_1_1954_chunk
Access method: heap
When a continuous aggregate is created in Timescale, an index is automatically created for each GROUP BY column. The index is a composite index, combining the GROUP BY column with the time_bucket column. Here is an example:
CREATE MATERIALIZED VIEW max_hour_aggregates
(mtimestamp, apartmentid, roomid, sensorid, metric, mvalue)
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', mtimestamp, 'Europe/Helsinki'),
apartmentid,
roomid,
sensorid,
metric,
MAX(mvalue)
FROM public.periodic_measurements
GROUP BY time_bucket('1 hour', mtimestamp, 'Europe/Helsinki'),
apartmentid,
roomid,
sensorid,
metric
WITH NO DATA;
testdb=> \d+ _timescaledb_internal._materialized_hypertable_205
Table "_timescaledb_internal._materialized_hypertable_205"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-------------+--------------------------+-----------+----------+---------+---------+-------------+--------------+-------------
mtimestamp | timestamp with time zone | | not null | | plain | | |
apartmentid | integer | | | | plain | | |
roomid | integer | | | | plain | | |
sensorid | integer | | | | plain | | |
metric | smallint | | | | plain | | |
mvalue | numeric | | | | main | | |
Indexes:
"_materialized_hypertable_205_apartmentid_mtimestamp_idx" btree (apartmentid, mtimestamp DESC)
"_materialized_hypertable_205_metric_mtimestamp_idx" btree (metric, mtimestamp DESC)
"_materialized_hypertable_205_mtimestamp_idx" btree (mtimestamp DESC)
"_materialized_hypertable_205_roomid_mtimestamp_idx" btree (roomid, mtimestamp DESC)
"_materialized_hypertable_205_sensorid_mtimestamp_idx" btree (sensorid, mtimestamp DESC)
Triggers:
ts_insert_blocker BEFORE INSERT ON _timescaledb_internal._materialized_hypertable_205 FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Access method: heap
HW and SW configuration
The source database is a Timescale database (PostgreSQL-based) hosted in aiven Cloud service. It has 8 CPUs, 16GB RAM, 512GB of disk space and network bandwidth up to 10Gbps. The server runs Postgres version 14.10. and both databases use Timescale version 2.11.2.
The server where Grafana is running is an EC2 instance in AWS running Linux 22.04. It is a m5a.xlarge instance which has 4vCPUs, 16 GB RAM and network bandwidth up to 10Gbps. The server runs Grafana 9.5.1. (free version).
I am accessing the EC2 Linux machine which hosts Grafana through Remote Desktop Connection from Windows 10 Pro (22H2) by using a connection with over 27Mbps for download and over 5 Mbps for upload.
Test results (source: production database)
During the tests I got the following results from monitoring the two servers (when using the above mentioned HW configuration):
- Timescale server
- CPU usage: 50%
- IOPS read: 1.600
- memory usage: 11%
- EC2 machine
- CPU usage: 38%
- Network in (bytes) - 5 minutes: 5M
- Network out (bytes) - 5 minutes: 210M
Test variants that have been tried
In order to decrease the loading time, I have tried using with different HW configurations, obtaining always the same loading time range. The HW configuration described above is the most capable one among those I have been testing.
Also, I have tried to minimize the amount of separate SQL queries that the dashboard is doing by using common panels but that has not improved the performances at all.
When testing single queries copied from the Grafana dashboard in SQL CLI, the response time of a single query is acceptable (few seconds) and does not justify the long loading time of the full dashboard.
The question
Any hints why my Grafana dashboard becomes so slow to load (more than half an hour) when reading from tables containing thousands of million of entries?
Thanks a lot in advance,
Bernardo Di Chiara
Solution
The problem was caused by certain queries used in the definition of Grafana variables.
Here is an example:
SELECT DISTINCT roomid FROM periodic_measurements WHERE apartmentid = $apartment AND roomid != 0 ORDER BY roomid;
Those queries were very heavy for the measurement database and one reason might be that the periodic_measurements table does not have dedicated index for just the variable apartmentid and no index for the variable roomid either.
I solved the issue by changing the data source of such problematic queries.
The load time has decreased to a couple of minutes or less. Also, the amount of transferred data has decreased in both the Timescale server and the Grafana EC2 machine as well as the CPU utilization in the Timescale server.
Thanks for the hints!
Answered By - Berni Hacker Answer Checked By - Mary Flores (WPSolving Volunteer)