Building a recommendation engine inside Postgres with Python and Pandas
I'm a big fan of data in general. Data can tell you a lot about what users are doing and can help you gain all sorts of insights. One such aspect is in making recommendations based on past history or others that have made similar choices. In fact, years ago I wrote a small app to see if I could recommend wines based on how other ones were rated. It was a small app that I shared among just a handful of friends, some with similar taste, some with different taste. At first it was largely an academic exercise of writing a recommendation engine, but if I could find some new wines I liked along the way, then great. Turns out it was a lot more effective at recommending things than I expected, even with only a small handful of wines rated.
The other thing I'm a fan of is Postgres (not a big surprise there), and earlier today I was starting to wonder why couldn't I do more machine learning directly inside it. Yeah, there is madlib, but what if I wanted to write my own recommendation engine? So I set out on a total detour of a few hours and lo and behold, I can probably do a lot more of this in Postgres than I realized before. What follows is a quick walkthrough of getting a recommendation engine setup directly inside Postgres on top of Crunchy Bridge, our database as a service.
Starting with existing code
To start, I browsed quickly for some example recommendation engines in Python. For simplicity sake I wanted something smaller and concise – I didn't mind so much if it leveraged other libraries. In this case I came across a clean rec-engine example which leveraged Pandas and a simple data model to make life easier.
Setting up my initial table structure
I took the exact data set that the example app used and converted it into SQL to load it up:
CREATE TABLE orders (id int, product_id int);
CREATE TABLE products(id serial, name text);
INSERT INTO orders
VALUES (1,1),(1,2),(2,3),(2,10),(2,13),(3,3),(4,8),(4,9),(4,12),(5,3),(5,5),(5,7),(5,12),(6,1),(7,5),(7,13),(8,4),(9,3),(10,3),(10,13),(11,1),(11,8),(11,4),(12,8),(12,12),(13,5),(13,2),(13,7),(14,3),(14,13),(14,5),(15,3),(15,13);
INSERT into products ("name")
VALUes ('Baseball Bat'), ('Baseball Glove'), ('Football'), ('Basketball Hoop'), ('Football Helmet'), ('Batting Gloves'), ('Baseball'), ('Hockey Stick'), ('Ice Skates'), ('Soccer Ball'), ('Goalie Mask'), ('Hockey Puck'), ('Cleats');
Loading my DataFrame
The Python example loads the DataFrame directly from CSV. In my case I wanted everything in Postgres. In the above I have all the data in the table, but getting that into the DataFrame... well I didn't quite want to parse back into a CSV format. There are probably a lot of ways to do this (create a JSONB object, create a custom type, etc.), but I chose something that, while pretty simplistic, could be easy follow... 2 arrays ordered the same way, then creating the dataframe from them.
So, in order to do that I'm going to define my function and import pandas:
CREATE OR REPLACE FUNCTION getrecommendations (id integer, orderids int[], orderedproducts int[], productids int[], productnames text[])
RETURNS json
AS $$
import pandas as pd
Note: You'll need to have plpython3u
installed for your PostgreSQL database
first.
If you'll notice instead of orders being all one array or dictionary I pass in it is 2 arrays, and then products is 2 arrays. In order to pass the data into the SQL function I'll construct them as:
(SELECT ARRAY(SELECT id from orders order by id))
Then I can continue on in loading the orders data into my data frame:
o = {'order_id': orderids, 'product_id': orderedproducts}
orders = pd.DataFrame(data=o)
The next set of data is going to be identical to the python example just embedded in my PostgreSQL function:
orders_for_product = orders[orders.product_id == id].order_id.unique();
relevant_orders = orders[orders.order_id.isin(orders_for_product)]
accompanying_products_by_order = relevant_orders[relevant_orders.product_id != id]
num_instance_by_accompanying_product = accompanying_products_by_order.groupby("product_id")["product_id"].count().reset_index(name="instances")
num_orders_for_product = orders_for_product.size
product_instances = pd.DataFrame(num_instance_by_accompanying_product)
product_instances["frequency"] = product_instances["instances"]/num_orders_for_product
recommended_products = pd.DataFrame(product_instances.sort_values("frequency", ascending=False).head(3))
When I get down to my products section I'm going to do the same thing as I did for orders – create a dictionary and then load the DataFrame. And finally I ran return the result set as a JSONB object. When I put it all together the end to end function looks like:
CREATE OR REPLACE FUNCTION getrecommendations (id integer, orderids int[], orderedproducts int[], productids int[], productnames text[])
RETURNS json
AS $$
import pandas as pd
o = {'order_id': orderids, 'product_id': orderedproducts}
orders = pd.DataFrame(data=o)
orders_for_product = orders[orders.product_id == id].order_id.unique();
relevant_orders = orders[orders.order_id.isin(orders_for_product)]
accompanying_products_by_order = relevant_orders[relevant_orders.product_id != id]
num_instance_by_accompanying_product = accompanying_products_by_order.groupby("product_id")["product_id"].count().reset_index(name="instances")
num_orders_for_product = orders_for_product.size
product_instances = pd.DataFrame(num_instance_by_accompanying_product)
product_instances["frequency"] = product_instances["instances"]/num_orders_for_product
recommended_products = pd.DataFrame(product_instances.sort_values("frequency", ascending=False).head(3))
p = {'product_id': productids, 'name': productnames}
products = pd.DataFrame(data=p)
recommended_products = pd.merge(recommended_products, products, on="product_id")
return recommended_products.to_json(orient="table")
$$ LANGUAGE 'plpython3u';
Getting your recommendations
And I can get some recommendations by calling the function directly within SQL:
SELECT json_pretty(getrecommendations(
3,
(SELECT ARRAY(SELECT id from orders order by id)),
(SELECT ARRAY(SELECT product_id from orders order by id)),
(SELECT ARRAY(SELECT id from products order by id)),
(SELECT ARRAY(SELECT name from products order by id))
));
{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"product_id","type":"integer"},{"name":"instances","type":"integer"},{"name":"frequency","type":"number"},{"name":"name","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"product_id":13,"instances":4,"frequency":0.5714285714,"name":"Cleats"},{"index":1,"product_id":5,"instances":2,"frequency":0.2857142857,"name":"Football Helmet"},{"index":2,"product_id":7,"instances":1,"frequency":0.1428571429,"name":"Baseball"}]}
Just because you can do something doesn't always mean you should. Embedding
all of your application logic directly in the database can make tracking
migrations and releases difficult. At the same time, a complex pipeline that
takes a nightly extract, loads something into Spark, generates results, that you
then feed back into the database isn't exactly lightweight. In the case of
plpython3u
and Pandas, scheduling something like the above to run daily with
pg_cron
could be a much simpler solution. With a mix of SciPy, NumPy and
Pandas there is a lot of interesting potential here and I'd love to hear what
practical uses others come up with
@crunchydata, or give it yourself a
try-our database-as-a-service
Crunchy Bridge comes already
preconfigured with plpython3u
and SciPy, NumPy, and Pandas.