Devious SQL: Dynamic DDL in PostgreSQL
Supporting PostgreSQL DBAs is an important part of daily life here at
Crunchy Data. I’ve recently run across a few use
cases where utility queries based on the current state of the database are
needed. A simple example could be where you have a table that is the target of
logical replication and the id
column becomes out of sync with the sequence
that generated the data. This would result in new rows having primary key
conflicts. To correct this issue, you would need to set the sequence to generate
values past the current max value in the table.
This example is part of a larger class of problems which are best solved with functionality that SQL by itself does not directly provide: Dynamic DDL. Data Definition Language (DDL) in SQL itself is notoriously non-dynamic, with strict parsing rules, predefined data types, table structures, and queries based on known and articulated columns.
So how can we bend SQL to our will and execute Dynamic DDL Postgres queries without having to manually write these queries each time? In this next installment of my Devious SQL series (see posts #1 and #2), I’ll show you some SQL approaches to get the job done.
Altering sequence restart values
Let us again consider a scenario where we want to explicitly provide the
RESTART
value for a
sequence via a
query. This is an easy thing to express in terms of what we would like to do: we
want to reset a sequence to start after the current maximum value of the table
it is associated with.
Trying the naïve approach, we get:
ALTER SEQUENCE big_table_id_seq RESTART (SELECT max(id) + 1 FROM big_table);
ERROR: syntax error at or near "(", at character 41
STATEMENT: ALTER SEQUENCE big_table_id_seq RESTART (SELECT max(id) + 1 FROM big_table);
As we can see, this approach isn't supported by the PostgreSQL grammar, as it is expecting an actual value here, not a subquery (as nice as that would be).
So what are some approaches here?
Using psql
variable substitution
If we are using psql
, we have a few options on how to solve this problem. One
approach is using
psql
variables
and first selecting the value we want into a variable, then substituting this
value into the expression we pass to psql:
-- use \gset to set a psql variable with the results of this query
SELECT max(id) + 1 as big_table_max from big_table \gset
-- substitute the variable in a new query
ALTER SEQUENCE big_table_id_seq RESTART :big_table_max ;
ALTER SEQUENCE
In this example, we are using the \gset
command to capture the results of the
first query and store it for use later in the psql
session. We then
interpolate this variable into our expression using the :big_table_max
syntax,
which will be passed directly to the PostgreSQL server.
Using psql
's \gexec
command
Another method of utilizing psql
for dynamic SQL is constructing the query as
a SELECT
statement returning the statements you wish to run, then using the
\gexec
command to execute the underlying queries. First let's look at making
ourselves a query that returns the statement we want, then we'll run this
statement using \gexec
:
SELECT 'ALTER SEQUENCE big_table_id_seq RESTART ' || max(id) + 1 as query FROM big_table;
SELECT 'ALTER SEQUENCE big_table_id_seq RESTART ' || max(id) + 1 as query FROM big_table \gexec
query
ALTER SEQUENCE big_table_id_seq RESTART 100001
ALTER SEQUENCE
A benefit of this approach compared to the variable substitution one is that
this can work with more complex statements and multiple return values, so you
could construct queries based on arbitrary conditions and generate more than one
SQL query; the first implementation is limited to queries that return single
rows at a time. This also gives you a preview of the underlying SQL statement
that you will be running before you execute it against the server with
\gexec
, so provides some level of safety if you were doing some sort of
destructive action in the query.
Dynamic SQL without psql
Not everyone uses psql
as the interface to PostgreSQL, despite its obvious
superiority :-), so are there ways to support dynamic SQL using only server-side
tools? As it so happens there are several, using basically the same approach of
writing a plpgsql
snippet to generate the query, then EXECUTE
to run the
underlying utility statement. These roughly correlate to the approaches in the
psql
section above in that they work best for single or multiple dynamic
statements.
DO
blocks
To use server-side Dynamic SQL we will need to construct our queries using
plpgsql
and execute the underlying text as if we were issuing the underlying
query ourselves.
DO $$
BEGIN
EXECUTE format('ALTER SEQUENCE big_table_id_seq RESTART %s', (SELECT max(id) + 1 FROM big_table));
END
$$
LANGUAGE plpgsql;
DO
In this case we are using PostgreSQL's built-in format()
function which
substitutes arguments similar to printf()
in C-based languages. This allows us
to interpolate the subquery result we were wanting in this case, resulting in a
string that PostgreSQL can EXECUTE
and giving us the result we want.
Create an exec()
function
Almost identical in function to the DO
block, we can also create a simple
plpgsql
function that simply calls EXECUTE
on it input parameter like so:
CREATE OR REPLACE FUNCTION exec(raw_query text) RETURNS text AS $$
BEGIN
EXECUTE raw_query;
RETURN raw_query;
END
$$
LANGUAGE plpgsql;
SELECT exec(format('ALTER SEQUENCE big_table_id_seq RESTART %s', (SELECT max(id) + 1 FROM big_table)));
CREATE FUNCTION
exec
ALTER SEQUENCE big_table_id_seq RESTART 100001
This may seem like a fairly pointless change compared to the previous approach, as we have basically only moved our query into a parameter that we pass in, but what it buys us is the ability to call this function against a list of queries that we construct using normal SQL, giving us the option of running each in turn.
Restrictions
So what type of SQL can be run in each of these sorts of approaches, and are there any restrictions in what we can run via Dynamic SQL with these methods? The main consideration about the different approaches is related to commands that need to be run outside of an explicit transaction block.
Consider if we wanted to run a REINDEX CONCURRENTLY
on all known indexes, so
we used the exec()
approach to construct a REINDEX CONCURRENTLY
statement
for all indexes in the public
schema:
SELECT
exec(format('REINDEX INDEX CONCURRENTLY %I', relname))
FROM
pg_class
JOIN
pg_namespace
ON pg_class.relnamespace = pg_namespace.oid
WHERE
relkind = 'i' AND
nspname = 'public'
ERROR: REINDEX CONCURRENTLY cannot be executed from a function
CONTEXT: SQL statement "REINDEX INDEX CONCURRENTLY big_table_pkey"
PL/pgSQL function exec(text) line 3 at EXECUTE
As you can see here, this won't work as a function due to REINDEX CONCURRENTLY
needing to manage its own transaction state; in PostgreSQL, functions inherently
run inside a transaction to allow the impact of a function to either completely
succeed or completely fail. (Atomicity in ACID.)
Let's try this using \gexec
:
SELECT
format('REINDEX INDEX CONCURRENTLY %I', relname)
FROM
pg_class
JOIN
pg_namespace
ON pg_class.relnamespace = pg_namespace.oid
WHERE
relkind = 'i' AND
nspname = 'public'
\gexec
REINDEX
Since the \gexec
handling is done by psql
, the resulting statement is
effectively run at the top-level as if it appeared literally in the SQL file.
More advanced usage
Look for a followup blog article where I go into more advanced techniques using
Dynamic SQL, particularly using the \gexec
function or exec()
itself. Until
next time, stay devious1!
Footnotes
-
Devious: longer and less direct than the most straightforward way. ↩