Exploring PL/Python: Turn Postgres Table Data Into a NumPy Array
PL/Python can be a convenient and powerful way to take advantage of your PostgreSQL database. In an earlier post, I talked a little bit about how you can use modules included in the Python standard library with PL/Python functions. In this post, we'll try running NumPy in a simple user-defined function which also takes advantage of PL/Python database access functions. The function will show a working example of how to easily convert a data table in Postgres to a NumPy array.
Perhaps you've got a data file and you may be using standard Python modules like
csv
and/or functions such as
NumPy's
genfromtxt()
to ingest the data for analysis. But if you already have the data in Postgres,
you have the option to carry out some processing on the database level as well.
In playing around with NumPy in PL/Python I wondered if it was possible to take
an entire Postgres table and "transform" it into a NumPy array. For this
exercise I used data on red wine quality (available from the
UCI Machine Learning Repository),
stored in a Postgres table called winequality_r
. Here's a sampling of the
table data (total rows = 1599):
winequality_r
id |
fixed_acidity |
volatile_acidity |
citric_acid |
residual_sugar |
chlorides |
free_sulfurdioxide |
total_sulfurdioxide |
density |
pH |
sulphates |
alcohol |
quality |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 7.4 | 0.7 | 0 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
2 | 7.8 | 0.88 | 0 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.2 | 0.68 | 9.8 | 5 |
I wanted this table to end up looking like the following with NumPy - an array
with a
shape
value of (1599, 12)
:
array([[7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5.0],
[7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5.0],
...
])
I figured that I would need to execute some SQL inside the PL/Python function, so here's what I eventually decided to do:
-
Initialize an empty list, which would eventually become a list of lists that can then be turned into a two-dimensional NumPy array.
new_list = []
-
Select all table rows from
winequality_r
but exclude the id values.SELECT fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfurdioxide, total_sulfurdioxide, density, ph, sulphates, alcohol, quality FROM winequality_r
-
Find a way to append each row as a list within
new_list
. Basically, if I could get the results of thatSELECT
statement to be an iterable, I could presumably convert each row into a list and then add all of them to the "outer"new_list
; perhaps something that looks like this:for r in results: new_list.append(r)
-
Finally, convert new_list to a NumPy array.
np.asarray(new_list,dtype='float')
Database access functions in plpy
In checking the
official docs
I saw that PL/Python automatically imports a module called plpy
, which comes
with database access functions that allow you to execute Postgres commands. Two
options looked promising: plpy.execute()
and plpy.cursor()
. Both take in a
SQL query string and return a result object.
According to the docs, calling plpy.execute()
will have the entire result set
to be read into memory; on the other hand, plpy.cursor()
returns a cursor
object with a fetch method so you can process the result in smaller batches.
At this point, all I needed was to run a simple SELECT
, and it wasn't a
massive dataset so I tried plpy.execute()
first. This function returns an
object that "emulates a list or dictionary object" - lists and dictionaries are
iterables, so it seems like it should work.
So I now have the beginnings of my function:
CREATE OR REPLACE FUNCTION table_to_narray ()
RETURNS numeric[]
AS $$
import numpy as np
new_list = []
wine_r = plpy.execute("SELECT fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfurdioxide, total_sulfurdioxide, density, ph, sulphates, alcohol, quality FROM winequality_r")
for r in wine_r:
new_list.append(r)
return np.asarray(new_list,dtype='float')
$$ LANGUAGE 'plpython3u';
But running SELECT table_to_narray();
returns an error that indicates I don't
quite have the right data type in my list:
ERROR: TypeError: float() argument must be a string or a number, not 'dict'
CONTEXT: Traceback (most recent call last):
PL/Python function "table_to_narray", line 19, in <module>
return np.asarray(new_list,dtype='float')
PL/Python function "table_to_narray", line 491, in asarray
PL/Python function "table_to_narray"
Utility functions in plpy to the rescue
From the error, 'dict'
seemed to refer to the return object from
plpy.execute()
. I'll spare you the gory details of the various other things I
tried to make the for loop eventually return the list that I wanted, but my
colleague Steve Pousty was kind enough to point out
plpy
utility functions,
which allow PL/Python to send messages and exceptions to the client - helping
you debug Python in Postgres!
You'll want to take note of the
client_min_messages
runtime config variable: the default level is NOTICE
, but INFO
level
messages are always sent to the client as well. I'll go with plpy.info()
here
but plpy.notice()
works just fine too.
I wanted to look at what exactly I was getting from plpy.execute()
, so I just
added this immediately after the SQL statement execution:
plpy.info(wine_r)
INFO: <PLyResult status=5 nrows=1599 rows=[{'fixed_acidity': Decimal('7.4'), 'volatile_acidity': Decimal('0.70'), … }, … }]>
That does indeed look like something that emulates a dictionary object. I do
know that I only want the dictionary values (i.e. 7.4
) and not the dictionary
keys or anything else from the result object. PLyResult
isn't well-documented
(though you can take a look at the
source code),
but it does appear that, like the cursor object returned from plpy.cursor()
,
when you iterate over PLyResult
you get each table row already in dictionary
format. In any case, the built-in
dict.values()
method in Python worked well enough in extracting just the values for adding to
new_list.
From table to NumPy array in a user-defined function
Here's the version that I was happy with:
CREATE OR REPLACE FUNCTION table_to_narray ()
RETURNS numeric[]
AS $$
import numpy as np
new_list = []
wine_r = plpy.execute("SELECT fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfurdioxide, total_sulfurdioxide, density, ph, sulphates, alcohol, quality FROM public.winequality_r")
for r in wine_r:
# Each row in result is converted to a list that gets appended into outer list
new_list.append(list(r.values()))
# Convert list of lists to 2d numpy array
# But to make the function return SQL arrays, a Python list must be returned
return np.asarray(new_list,dtype='float').tolist()
$$ LANGUAGE 'plpython3u';
table_to_narray
-----------------
{{7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5.0},{7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5.0},
...
{6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6.0}}
The main thing I'll point out is that I'm not doing anything else to the NumPy array, so for this function to return the result as a SQL array it has to be converted back to a list. This is a rather superficial example; I'd imagine that for a real use case, you'd go on and actually operate on and process the array, eventually returning a different final (and usable!) value with your PL/Python functions.
So this little exercise answered my question: you can indeed create a function
that takes a table in Postgres and convert it into a NumPy array. This was a
great way to learn a bit about database access and utility functions too. The
docs do recommend the plpy.cursor()
method for larger datasets so that may be
something for you to keep in mind.
What else might you do with Python embedded in Postgres? It seems like there's a lot of possibilities - for example, my colleague Craig Kerstiens dives into setting up a recommendation engine with Python and Pandas inside Postgres in this blog post. You might also want to give Crunchy Bridge a try: Crunchy Data's new Postgres cloud service that include PL/Python (and packages like Pandas and NumPy) out of the box. In any case - if you have ideas, or are already doing something interesting with PL/Python that you'd like to share, feel free to leave a comment and let us know!