Postgres Data Types in PL/Python
I've been digging a little bit into PL/Python recently-inspired by our recent launch of Crunchy Bridge: a multi-cloud Postgres service with PL/Python and PL/R built right in. One thing that has tripped me up a few times is how PostgreSQL data types work with Python - especially when using external packages like NumPy or SciPy.
Data type mapping is something you'll deal with when working with different
languages or trying to get separate systems to talk to one another. Mapping
types isn't just related to PL/Python, in fact one colleague once
wrote a program to interoperate from Ruby to
Python in C to make some of such things easier. In the available procedural
languages in Postgres, there's going to be at least some documentation on what
data types in that language correspond to which types in Postgres (and vice
versa). For PL/Python, there are a few differences to note in the mappings for
Python 2 and Python 3; I'd recommend studying the
docs as they're
pretty comprehensive. Just a quick reminder that Python 2.7 end-of-life was
earlier in 2020, so the code samples here (as well as our other posts on
PL/Python) will be using Python 3 (plpython3u
).
With all of that said, I'll share a few interesting things I encountered in PL/Python data type mapping, especially when I added NumPy and SciPy to the picture. I'll assume you have Python knowledge and are also familiar with Postgres user-defined functions, but if not I've also made sure to link to documentation along the way.
Arrays in NumPy/Python vs Postgres
Like I mentioned in an earlier post, you can use external packages in PL/Python as long as it's on the same host as Postgres and your Python installation. Let's do a quick, basic example of a function that has input parameters and uses NumPy:
CREATE OR REPLACE FUNCTION create_array_from_range (start integer, stop integer)
RETURNS int[]
AS $$
import numpy as np
return np.arange(start, stop).tolist()
$$ LANGUAGE 'plpython3u';
The
arange()
function in NumPy returns a one-dimensional NumPy array, which behaves like
built-in container types in Python such as lists and dictionaries. The
PL/Python docs on array types
state that if you want to get a Postgres array out of the PL/Python function,
you should return a Python list. So we're also adding in the
tolist()
function to ensure that we're returning a list value and ultimately an array
back out to Postgres.
Let's see what happens when we use this function with values from a sample table
int_table
:
int_table
id | col1 | col2 |
---|---|---|
1 | 1 | 7 |
2 | 5 | 56 |
3 | 99 | 107 |
SELECT create_array_from_range(col1, col2) FROM int_table;
create_array_from_range
----------------------------------------
{3,4,5,6}
{5,6,7, … ,53,54,55}
{99,100,101, ... ,104,105,106}
We get a one-dimensional array for each row. Working with multi-dimensional arrays is fine too - the "inner" arrays on the same level have to be of the same size, and again you'll want to convert the entire array into a nested list.
Passing table columns as array arguments
You can also pass in entire table columns as arrays to the function. As you
might expect, Postgres arrays are likewise passed into PL/Python as Python
lists. In the next function example we'll work with data on red wine quality
saved in a table called winequality_r
, and this time we'll also play around
with SciPy, just for the sake of demonstration.
This data is available from the UCI Machine Learning Repository. The table has columns that store numeric data on attributes of 1599 red wines such as sulphates, residual sugar, ph, etc, and one column for the "quality" score for that wine, in integers from 0 to 10.
As a disclaimer, I'm not a data scientist nor analyst, but I do like wine (for reds I like cabernets, and you can never go wrong with Spanish wines; something that I also recently discovered but haven't had much luck finding locally are orange wines). So I thought I'd try my hand at some simple correlation testing between a given physicochemical attribute and the quality of the wine:
CREATE OR REPLACE FUNCTION correlation_test (attribute_col numeric[], quality_col int[])
RETURNS text
AS $$
import numpy as np
from scipy.stats import pearsonr
attribute = np.array(attribute_col,dtype='float')
quality = np.array(quality_col,dtype='float')
return pearsonr(attribute, quality)
$$ LANGUAGE 'plpython3u';
If I wanted to see whether there was a correlation between alcohol content and quality:
SELECT correlation_test(
array_agg(alcohol ORDER BY id),
array_agg(quality ORDER BY id)
) FROM winequality_r;
correlation_test
---------------------------------------------
(0.4761663239992742, 2.8314769799724706e-91)
In addition to using
array_agg()
,
you could use the
ARRAY
constructor
as well to pass arrays into the function:
SELECT correlation_test(
(SELECT ARRAY(SELECT alcohol FROM winequality_r ORDER BY id)),
(SELECT ARRAY(SELECT quality FROM winequality_r ORDER BY id))
);
The SciPy
pearsonr()
function takes in two input arrays - but not Postgres arrays (that is, lists
in Python). So the Postgres arrays are converted to NumPy arrays first using the
array()
function, before we can call pearsonr()
. The output is the
Pearson correlation coefficient
and p-value between the two arrays.
Out of curiosity, I used the
utility function
plpy.info()
to check what data type was being returned:
plpy.info(type(pearsonr(attribute, quality)))
INFO: <class 'tuple'>
I've just set the function return type as text
but you could of course choose
another return type such as an array (which should be compatible with Python
tuples). In the next example, we'll try returning the result as multiple values,
i.e. into multiple columns, out of PL/Python.
Output parameters to return composite results
User-defined functions in Postgres can also be defined with output parameters, which means that the function returns those parameters as part of the result.
1. Let's call an additional SciPy function
To add on to our example, let's say we want to make our correlation_test
function above do a little bit more: in addition to giving us the Pearson
correlation coefficient and corresponding p-value, we'll also get it to
calculate the
Spearman correlation coefficient
and p-value between two datasets. In SciPy, we can use the
spearmanr()
function for exactly that.
The interesting thing is that when you return only the spearmanr()
result as
text, the value looks like a different type from what we got with pearsonr()
:
# (PL/Python function body)
import numpy as np
from scipy.stats import spearmanr
attribute = np.array(attribute_col,dtype='float')
quality = np.array(quality_col,dtype='float')
return spearmanr(attribute, quality)
SpearmanrResult(correlation=0.47853168747024344, pvalue=2.7268377398474203e-92)
And if we combine the two expressions for the return value (i.e.
return pearsonr(attribute, quality) + spearmanr(attribute, quality)
), it looks
like the pearsonr()
tuple again:
(0.4761663239992742, 2.8314769799724706e-91, 0.47853168747024344, 2.7268377398474203e-92)
I tried to double-check the spearmanr()
return type with plpy.info()
but it
wasn't much help for me:
INFO: <class 'scipy.stats.stats.SpearmanrResult'
It doesn't seem that SpearmanrResult
is well-documented, but my googling seems
to indicate that this is an example of a Python
named tuple.
A named tuple is comprised of fieldname=value
pairs. And, since it's still a
tuple, you can concatenate it with another type to form a new tuple, which is
why the +
operator in the return expression above works and just returns all
four values in one new tuple.
2. Let's use a dictionary to return multiple values
So we get all four statistics in one very simple function, which is all well and good - but you might be thinking, "what if I need to send this value to a client application, and/or an app developer or analyst needs to work with this value?" Sending a string that doesn't explicitly indicate all of the elements it contains isn't particularly helpful.
Let's go ahead and modify our function:
CREATE OR REPLACE FUNCTION correlation_test (attribute_col numeric[], quality_col int[], OUT pearson_c double precision, OUT pearson_pv double precision, OUT spearman_c double precision, OUT spearman_pv double precision)
AS $$
import numpy as np
from scipy.stats import spearmanr, pearsonr
attribute = np.array(attribute_col,dtype='float')
quality = np.array(quality_col,dtype='float')
pearson_result = pearsonr(attribute, quality)
spearman_result = spearmanr(attribute, quality)
result = {}
result['pearson_c'] = pearson_result[0]
result['pearson_pv'] = pearson_result[1]
result['spearman_c'] = spearman_result[0]
result['spearman_pv'] = spearman_result[1]
return result
$$ LANGUAGE 'plpython3u';
We store the Pearson tuple and Spearman named tuple in two separate variables, then initialize an empty dictionary and populate it with individual key-value pairs for the four data points. While it's not explicit in the PL/Python docs, my guess is that a Python object would work for this scenario as well.
If we try calling the function like so:
SELECT correlation_test(
array_agg(alcohol ORDER BY id),
array_agg(quality ORDER BY id)
) FROM winequality_r;
It still looks like the result from the earlier version:
correlation_test
--------------------------------------------------------------------------------------
(0.4761663239992742,2.8314769799724706e-91,0.47853168747024344,2.7268377398474203e-92)
But the output is now actually a record (and if you run the query with a GUI like pgAdmin it'll indicate that).
All we need to do to return this record
in multiple values is to write the
SELECT
statement like this:
SELECT (correlation_test_tuple(
array_agg(alcohol ORDER BY id),
array_agg(quality ORDER BY id)
)).* FROM winequality_r;
(Note that the function call itself is wrapped in parentheses.)
pearson_c | pearson_pv | spearman_c | spearman_pv
--------------------+------------------------+---------------------+------------------------
0.4761663239992742 | 2.8314769799724706e-91 | 0.47853168747024344 | 2.7268377398474203e-92
This way, we have our statistics laid out nicely in a table-like format with multiple columns. This will make it a bit easier in your application code as well to just reference individual data points if that's what you need.
The result dictionary keys in the function match the OUT parameters in the function definition. That's how Postgres knows which dictionary value maps to which parameter (and in which column they go).
From NumPy, to standard Python, to Postgres… Oh my!
To sum up what we covered in terms of what you might possibly deal with in data type mapping in PL/Python:
- Stick to converting arrays to Python lists if you want to ultimately get back a Postgres array when calling your PL/Python function.
plpy
utility functions are your friend and can help you uncover a little bit of how data is processed in your PL/Python code (especially if you're using external Python packages). I've also started digging a bit intoplpy
functions if you're keen on reading more.- You can take advantage of SciPy and NumPy for processing data in your function and get back multiple data points as output, and still make it easy to parse using output parameters.
Keen on trying this all out yourself? Crunchy Bridge is the only cloud PostgreSQL service that includes PL/Python and NumPy and SciPy (among others!). Create an account and let us know what you think.