Postgres WAL Files and Sequence Numbers
The Postgres Write Ahead Log (WAL) is a functional component to the database. WAL makes a lot of key functionality possible, like Point-in-Time-Recovery backups, recovering from an event, streaming replication, and more. From time to time, those deep inside the database will need to work directly with WAL files to diagnose or recover.
Recently in working with one of Crunchy Data's customers, I came across a situation where understanding the names and sequence numbers was important. In working with several of my colleauges that commit to the Postgres project, I collected notes on some of the details inside WAL. The goal today to look at the LSN and naming convention for WAL to help users understand WAL files a little better.
Log Sequence Number
Transactions in PostgreSQL create WAL records which are ultimately appended to
the WAL log (file). The position where the insert occurs is known as the
Log Sequence Number (LSN). The values of LSN (of type
pg_lsn
) can be compared to determine the amount of WAL generated between two
different offsets (in bytes). When using in this manner, it is important to know
the calculation assumes the full WAL segment was used (16MB) if multiple WAL
logs are used. A similar calculation to the one used here is often used to
determine latency of a replica.
The LSN is a 64-bit integer, representing a position in the write-ahead log stream. This 64-bit integer is split into two segments (high 32 bits and low 32 bits). It is printed as two hexadecimal numbers separated by a slash (XXXXXXXX/YYZZZZZZ). The 'X' represents the high 32-bits of the LSN and ‘Y’ is the high 8 bits of the lower 32-bits section. The 'Z' represents the offset position in the file. Each element is a hexadecimal number. The 'X' and 'Y' values are used in the second part of the WAL file on a default PostgreSQL deployment.
WAL File
The WAL file name is in the format TTTTTTTTXXXXXXXXYYYYYYYY. Here 'T' is the timeline, 'X' is the high 32-bits from the LSN, and 'Y' is the low 32-bits of the LSN.
Start by looking at the current WAL LSN and insert LSN. The pg_current_wal_lsn
is the location of the last write. The pg_current_wal_insert_lsn
is the
logical location and reflects data in the buffer that has not been written to
disk. There is also a flush value that shows what has been written to durable
storage.
[postgres] # select pg_current_wal_lsn(), pg_current_wal_insert_lsn();
pg_current_wal_lsn | pg_current_wal_insert_lsn
--------------------+---------------------------
76/7D000000 | 76/7D000028
(1 row)
Although you can guess the name of the WAL file based on the above output, it is
best to use the pg_walfile_name
function.
[postgres] # select pg_walfile_name('76/7D000028');
pg_walfile_name
--------------------------
00000001000000760000007D
(1 row)
Looking at the file system we see that indeed segment 00000001000000760000007D
is the latest modified file. Note that if the database is idle then
00000001000000760000007D could have been the oldest file (based on O/S last
modified date). The reason that is possible is due to PostgreSQL reusing older
WAL segments. During the pg_wal_switch
, the older file was renamed. The O/S
date/time modified will only change when the file is written to. Using -i in the
ls command shows the inode (first column) for the file. When files are reused,
this number does not change as the file is just renamed.
$ ls -larti 00*
169034323 -rw------- 1 bpace staff 376 Jan 30 09:13 0000000100000075000000A0.00000060.backup
169564733 -rw------- 1 bpace staff 16777216 Feb 13 15:49 00000001000000760000007E
167120667 -rw------- 1 bpace staff 16777216 Feb 13 15:50 00000001000000760000007F
167120673 -rw------- 1 bpace staff 16777216 Feb 13 16:00 00000001000000760000007B
167120686 -rw------- 1 bpace staff 16777216 Feb 13 16:18 00000001000000760000007C
169564722 -rw------- 1 bpace staff 16777216 Feb 13 16:18 00000001000000760000007D
Let's create a small table and perform a WAL switch.
[postgres] # create table test (a char(1));
CREATE TABLE
Time: 23.770 ms
[postgres] # select pg_switch_wal();
pg_switch_wal
--------------
76/7D018FD8
(1 row)
Looking again at the files we now see that 00000001000000760000007D file has
been updated (date/time changed from O/S perspective). With some other items
occurring in the background the next segment 00000001000000760000007E has also
received some writes after the pg_switch_wal
.
$ ls -larti 00*
169034323 -rw------- 1 bpace staff 376 Jan 30 09:13 0000000100000075000000A0.00000060.backup
167120667 -rw------- 1 bpace staff 16777216 Feb 13 15:50 00000001000000760000007F
167120673 -rw------- 1 bpace staff 16777216 Feb 13 16:00 000000010000007600000080
167120686 -rw------- 1 bpace staff 16777216 Feb 13 16:18 000000010000007600000081
169564722 -rw------- 1 bpace staff 16777216 Feb 13 16:24 00000001000000760000007D
169564733 -rw------- 1 bpace staff 16777216 Feb 13 16:24 00000001000000760000007E
In the newly created table, insert one record. Ensure the database is quiet and grab the current WAL LSN before and after the change. Note that we are inserting 1 byte ('a') into a single table with a single column.
[postgres] # select pg_current_wal_lsn(), pg_current_wal_insert_lsn();
pg_current_wal_lsn | pg_current_wal_insert_lsn
--------------------+---------------------------
76/7E000060 | 76/7E000060
(1 row)
[postgres] # insert into test (a) values ('a');
INSERT 0 1
[postgres] # select pg_current_wal_lsn(), pg_current_wal_insert_lsn();
pg_current_wal_lsn | pg_current_wal_insert_lsn
--------------------+---------------------------
76/7E000108 | 76/7E000108
(1 row)
Using the two LSN positions we can calculate the amount of WAL that was
generated by the INSERT
(168 bytes in this case).
[postgres] # select '76/7E000108'::pg_lsn - '76/7E000060'::pg_lsn size_bytes;
size_bytes
------------
168
(1 row)
Grab the current WAL LSN and WAL insert LSN and perform one more switch. Then list out the files.
[postgres] # select pg_switch_wal();
pg_switch_wal
---------------
76/7E0001D0
(1 row)
$ ls -larti 00*
169034323 -rw------- 1 bpace staff 376 Jan 30 09:13 0000000100000075000000A0.00000060.backup
167120673 -rw------- 1 bpace staff 16777216 Feb 13 16:00 000000010000007600000080
167120686 -rw------- 1 bpace staff 16777216 Feb 13 16:18 000000010000007600000081
169564722 -rw------- 1 bpace staff 16777216 Feb 13 16:24 000000010000007600000082
169564733 -rw------- 1 bpace staff 16777216 Feb 13 16:26 00000001000000760000007E
167120667 -rw------- 1 bpace staff 16777216 Feb 13 16:26 00000001000000760000007F
With the information we captured in the previous steps, use pg_waldump
to get
a human readable summary of the WAL segment contents. In the following command
the starting position (-s) and ending position (-e) are specified along with the
WAL file name (00000001000000760000007E). The start position was the
current_wal_lsn
before our INSERT
and the ending position was the
current_wal_lsn
after our insert. Previous, just using LSNs, we determined
there were 168 bytes written to WAL from our transaction. Looking at the waldump
reveals that the INSERT
took 103 bytes (57 for the INSERT
, 46 for the
COMMIT
).
$ pg_waldump -s 76/7E000060 -e 76/7E000108 00000001000000760000007E
rmgr: Heap len (rec/tot): 57/ 57, tx: 59555584, lsn: 76/7E000060, prev 76/7E000028, desc: INSERT+INIT off 1 flags 0x08, blkref #0: rel 1663/5/53434 blk 0
rmgr: Transaction len (rec/tot): 46/ 46, tx: 59555584, lsn: 76/7E0000A0, prev 76/7E000060, desc: COMMIT 2023-02-13 16:25:19.441483 EST
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 76/7E0000D0, prev 76/7E0000A0, desc: RUNNING_XACTS nextXid 59555585 latestCompletedXid 59555584 oldestRunningXid 59555585
Looking at the entire WAL file from the point before our INSERT
, the INSERT
itself can be seen, then the COMMIT
. Finally, the Checkpoint is logged and the
switch from the pg_wal_switch()
that was performed. The RUNNING_XACTS
entries are added if wal_level
is set to replica or higher. The
RUNNING_XACTS
entries capture details of the current snapshot (active
transactions). The last entry, SWITCH
, is the pg_switch_wal
that was
performed.
$ pg_waldump 00000001000000760000007E
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 76/7E000028, prev 76/7D018FC0, desc: RUNNING_XACTS nextXid 59555584 latestCompletedXid 59555583 oldestRunningXid 59555584
rmgr: Heap len (rec/tot): 57/ 57, tx: 59555584, lsn: 76/7E000060, prev 76/7E000028, desc: INSERT+INIT off 1 flags 0x08, blkref #0: rel 1663/5/53434 blk 0
rmgr: Transaction len (rec/tot): 46/ 46, tx: 59555584, lsn: 76/7E0000A0, prev 76/7E000060, desc: COMMIT 2023-02-13 16:25:19.441483 EST
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 76/7E0000D0, prev 76/7E0000A0, desc: RUNNING_XACTS nextXid 59555585 latestCompletedXid 59555584 oldestRunningXid 59555585
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 76/7E000108, prev 76/7E0000D0, desc: RUNNING_XACTS nextXid 59555585 latestCompletedXid 59555584 oldestRunningXid 59555585
rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 76/7E000140, prev 76/7E000108, desc: CHECKPOINT_ONLINE redo 76/7E000108; tli 1; prev tli 1; fpw true; xid 0:59555585; oid 61620; multi 799; offset 1657; oldest xid 716 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 59555585; online
rmgr: XLOG len (rec/tot): 24/ 24, tx: 0, lsn: 76/7E0001B8, prev 76/7E000140, desc: SWITCH
Wrapping Up
I hope you don’t need to dig into WAL too often and may your pg_wal_switch
events be few and far between. As a reminder, various initdb options and compile
time options can change the outcome of the calculations and assumptions made.
WAL is a complicated topic and any element related to WAL should be treated with
great care.
Need help with WAL? Contact us and let's see if we can help.