R Predictive Analytics in Data Science Work using PostgreSQL
Greetings friends! We have come to our final blog post in my series about the data science workflow using PostgreSQL. In the last blog post, we used PL/R to create a function which returns the output from a logistic regression model trained on our fire data. We then took that model object and stored it into a separate table.
Today we are going to finish up by showing how to use that stored model to make predictions on new data. By the way, I did all of the Postgres work for the entire blog series in Crunchy Bridge. I wanted to focus on the data and code and not on how to run PostgreSQL.
Numer models
When people make numer models they usually have one of two purposes in mind.
- Using the model to get a better understanding of the processes that leads to the outcome.
- Trying to get the most accurate prediction of the outcome.
These two options can sometimes be at odds in your analysis.
You can usually make a more accurate prediction model using black box techniques such as neural networks or generalized additive models. But these techniques do not allow any inference about magnitude or directions of cause and effect. There are other reasons they can be at odds, such as predictive power versus parsimony of explanatory variables, but we are not going to go in depth on that today.
In our example, we are not particularly interested in an explanatory model since we have little ability to control the weather. Instead, we are more interested in a predictive model to give us fire occurrence prediction as accurately as possible.
The full use case for using this model would be:
- Run a weather forecast model predicting tomorrow’s weather for the explanatory variables used in our logistic regression.
- Plug those weather values into our logistic regression model (covered in today's blog post).
- Return the predicted probability of tomorrow’s weather.
- Give this information to someone who can use it to make better decisions.
Our use case is very similar to other non-fire use cases. Here are some examples:
- Create a credit card fraud model, store it in the database, write a trigger on insert that predicts the probability of fraud.
- Create a model that predicts market demand for an item, store it in the database, produce a report every morning that shows store managers how much inventory to order.
- Build a predictive model of your favorite football team (yes I mean the U.S. version and the International version), update based on stats from their last game/match, predict their likelihood of winning their next match.
Let’s get into the final step on our data science journey.
Using the R model to predict fire probability
We are going to use PL/R to do the statistical work, just like we did in the last post. Unless you know R and PL/R, I highly recommend you go back and read it. I am not going to cover the benefits of PL/R or the pieces of setting up a PL/R function.
At the end of last post we took the R object for a logistic regression model and stored it in a bytea column along with some metadata columns (including a timestamp column for record insertion).
The new function
The prize we get for using R and storing the model object in a table is the ability to write a simple function to generate a prediction. It’s so small I am going to paste the entire function here:
CREATE OR REPLACE FUNCTION final.predict_logistic( model bytea, newdata
final.test, out probability numeric)
as $$
probability <- predict(model, data.frame(newdata), type = 'response')
return(probability)
$$
LANGUAGE 'plr';
We are creating this function in the final
schema. The bytea we are passing in
will be the R model we stored in the table.
The declaration for the newdata
parameter might be new for some people. If
you
recall, the test
table contains data we separated out from the training data set and has the same
schema as the data
table. By declaring it of type final.test
we are
indicating that we expect a record to be passed that matches the schema of the
table in the final
schema named test
.
Finally we declare that we are returning, via an out variable, the calculated probability.
One line of R code
The actual code of our function is 1 LINE! I could have returned the call to
predict
, but I like separating those calls out for clarity.
This line also shows why R is such a powerful language for statistics and data science. Built-in are core functions for most of the methods data scientists and statisticians use. It is a Domain Specific Language for statistics and data science.
Our call to
predict
:
- Takes a R model created with a linear or generalized linear model (our stored model).
- Takes a data frame which contains one or more rows of data that matches the independent variables used in the model.
- We then tell the function we want the predicted response rather than the standard errors on the response.
- The function returns a value between 0 and 1, representing the probability of fire.
And that’s it!
Our function does all we need it to do. The data scientist gives this function to the DBA and they put it in the database. Now, application developers can just pass in data and get back a prediction of fire probability and the data scientist can sleep soundly because nobody can “mess up” implementing a prediction function.
Calling our function in SQL
Using this function is quite simple since we only need to pass in 2 parameters.
SELECT final.predict_logistic(
newdata := (SELECT f.\*::final.test FROM final.test AS f ORDER BY f.date_time
limit 1),
model := (SELECT model AS the model FROM final.logistic_models AS l ORDER BY
l.created_at DESC LIMIT 1) ) AS fire_probability;
Passing in the new explanatory data we do a select
on final.test.*
and cast
it to final.test
like the function expects. To limit the rows passed in,
submit the oldest data value.
Pass in the R model, we select the model column from the final.logistic_models
table. Again, we sort the results but this time we only return the newest model.
Voila!
To make this easier for application developers to use, you might want to make this query into a view.
Wrapping up the whole series
With this post we have come to the conclusion of our series on using Postgres and SQL in a data science/research workflow.
We covered all these topics without using any code running outside of Postgres:
- Cleaning up the data
- Massaging the data to fit our needs
- Subsetting the data
- Combining two data sets to create the data we needed for analysis
- Randomly sampled our table into training and testing data
- Created a function to create new columns that contained the centered and standardized values for the explanatory variables
- Created another function to fit a logistic regression model with R using the centered and standardized data
- Stored the R logistic regression model in a new table
- Finally, wrote a function that accepts the stored model and new data to predict the probability of a new fire
If I had told you, before this series, that we were going to do all this work inside PostgreSQL, I bet the response for most of you would have been “no way!”. And that was the point of this series.
Like my coworker Craig said:
“You probably won’t get them all the way there, but you will have opened their world to see what is possible.”
One of the messages I hope I imparted to you was the clear separation of concerns you can achieve by using PostgreSQL as the central node in your architecture.
The data can be cleaned and duplicated with strong permission handling. The data scientists can work with the db data in their native environment to come up with the algorithms. The DBA or sysadmin can then put those algorithms into the database as functions available to application developers.
Finally the application developer gets the benefit of not having to rewrite the algorithm in their language, instead they can just do a simple SQL call. Every different “user” of the database gets to interact with the database for their needs without reaching into other “users” domains.
One other benefit of using the approach outlined here is that all these steps can be put in a text file and run against the database. You don't have to go back and copy and paste to recreate data in a spreadsheet. All your steps are clearly laid out in code that can be rerun anytime you need.
My final goal was to show how learning more SQL would lead to big benefits in your day to day work. I was pleasantly surprised by the functionality I never knew existed in PostgreSQL, like randomly sampling a table.
Did this series make you think more about how you use PostgreSQL? Have you used PostgreSQL in data science workflows? I would love to hear more about your experience and insights by sending messages to our twitter account.
Happy data sleuthing!