How lemmy.ca took on finding why PostgreSQL was at 100% CPU and crashing the Lemmy website this past weekend. PostgreSQL auto_explain
IIRC, it was lemmy.ca full copy of live data that was used (copy made on development system, not live server - if I'm following). Shared Saturday July 22 on GitHub was this procedure:
...
Notable is the AUTO_EXPLAIN SQL activation statements:
LOAD 'auto_explain';
SET auto_explain.log_min_duration = 0;
SET auto_explain.log_analyze = true;
SET auto_explain.log_nested_statements = true;
This technique would be of great use for developers doing changes and study of PostgreSQL activity. Thank you!
Thank you for sharing, I had not come across the auto_explain technique and was manually pulling out SQL from pg_stat_statements and trying to match up to runtime parameters for $1 $2 etc with EXPLAIN
Oof yeah, that's not gonna be a fun way to troubleshoot!
Thanks for taking the ball and running with this, looks like it's getting a lot of attention now and hopefully next release will have some juicy sql improvements!
I've really been bothered by the server crashes, especially on lemmy.ml since even before I created my first Lemmy account. so far, I haven't seen anyone able to reproduce the overloads/crashes outside of production servers.
If we can get pg_stat_statatements with on 0.18.3 with pg_stat_statements.track = all out of a real Lemmy database run for 12 or more hours, that is having crashes, I think we might find something that's being overlooked.