Data Loading in Postgres for Newbies
So you’re new to Postgres and you want to test it out and see how it works. Or maybe you’re building a new app and you need to seed your database. How do you get data in your Postgres database? There’s a lot out there on the topic of data migration for Postgres and some of it can be complicated and overwhelming. I’m going to give you some quick and simple ways to get test data IN your database.
csv
If you need to load data from csv files, the psql \copy
command in Postgres is
a utility that can come to your rescue. It copies data from a file on disk to a
table as bulk data load. \copy
works from the command line and you’ll need to
be connected to the database using psql
.
Let’s take an example of storm events data from NOAA. Downloading one of the files locally and unzipping it you’ll have a csv ready to copy.
Before you can load the data, you’ll need to create a table in Postgres to store
the data and give names to the columns. To get the column headers without
reading the whole file run the linux head
command:
head StormEvents_locations-ftp_v1.0_d2019_c20220425.csv
Resulting in:
YEARMONTH,EPISODE_ID,EVENT_ID,LOCATION_INDEX,RANGE,AZIMUTH,LOCATION,LATITUDE,LONGITUDE,LAT2,LON2
201904,135109,809787,1,.28,"NNE","GENOA",33.3834,-93.9174,3323004,9355044
Now you’ll create a table definition for these. You can peek at the data to make
an informed guess about the data types. You can always use text
if you’re not
sure.
CREATE TABLE weather_events
(YEARMONTH text, EPISODE_ID int, EVENT_ID int,
LOCATION_INDEX int, RANGE float, AZIMUTH text,
LOCATION text, LATITUDE float, LONGITUDE float,
LAT2 int, LON2 int);
Now \copy
the data in. Make sure you run with csv header since your file has
headers in it:
psql \copy weather_events FROM ~/Downloads/StormEvents_locations-ftp_v1.0_d2019_c20220425.csv WITH CSV header
And done!
JSON
There’s a lot of JSON data out in the world nowadays. An essential tool for working with JSON is the jq tool and I’ll show an example of working with a JSON file from NOAA.
In this case, there is no distinct column header, so we will just create a
single JSON column and populate separate rows for each JSON array element. We
will use the \copy
command from earlier, this time using the FROM PROGRAM
option which lets us read the output of a command instead of a data file. Using
this approach, we can preprocess the JSON file using jq
and load each row in
the JSON array into a row of the table.
First make your table:
CREATE TABLE solar_load (data jsonb);
Then load the data:
psql \copy solar_load FROM PROGRAM 'jq -c -r .[] < ~/Downloads/solar_probabilities.json';
If you want to break up each row array into separate columns after this you can do something like this to create a new table with separated columns:
CREATE TABLE solar_load_with_columns(date date, c_class_1_day int,
c_class_2_day int, c_class_3_day int, m_class_1_day int,
m_class_2_day int, m_class_3_day int, x_class_1_day int,
x_class_2_day int, x_class_3_day int, "10mev_protons_1_day" int,
"10mev_protons_2_day" int, "10mev_protons_3_day" int,
polar_cap_absorption text);
Followed by an insert to move the data over:
INSERT INTO solar_load_with_columns
SELECT (json_populate_record(NULL::solar_load_with_columns, data::json)).*
FROM solar_load;
GUIs
There are a lot of user interfaces for Postgres and if you’re a newbie you will likely want to use one or more of these for working with Postgres. PgAdmin4 is a very popular one and it has some key features. The downside to PgAdmin imports is that you have to create the table and columns before you can load in data.
I’ve done quite a bit of testing with DBeaver and that tool does a lot of the work for you when you import data via csv. To import data into Postgres, you create a new server resource, connect your resource, create a database, and use the csv import tool and the table creations and columns are created for you.
The DBeaver import has worked pretty well for me with simple data inserts. It
can get a little tricky when you’re loading large files and working with its
batch import features, so watch out for that. In my testing, the psql \copy
function performs better than DBeaver for big data loads.
Dump and Restore
pg_dump
is a common tool for extracting data from your database. It will
output your data definitions and your data in a raw SQL form which makes it
convenient to move data from different databases, doing upgrades, or a basic
level of backup (though dedicated backup tools are a better option here). There
are a lot of configuration options for this, so
check the postgres docs
before you get in too far.
To run a pg_dump on your database:
pg_dump
postgres://postgres:vcnQ1Ay0etSwUzimGBlFJirZagzpg0hbXy3byrnhbHMZf0Yaug6@p.6z5h5nfdy3jsblmo2a5nz2a.db.postgresbridge.com:5432/postgres >
backup.sql;
From a GUI, you can normally run a dump command as well, which will result in the same file being generated.
To restore a database from a previous database:
create database restore;
psql -d
postgres://postgres:vcnQ1Ay0etSwUzimGBlFJirZagzpg0hbXy3byrnhbHMZf0Yaug6@p.6z5h5nfdy3jsblmo2a5nz2a.db.postgresbridge.com:5432/restore
< backup.sql
If you need help understanding these connection strings used above, check out the diagram and information we published in a tutorial on Postgres Database and Schemas.
Foreign Data Wrappers
Getting data into one database from another one is really easy using foreign
data wrappers. The postgres_fdw
will let you connect to an external postgres
resource and query it and use it from another database. I’ll show you a quick
setup.
- The FOREIGN instance, the one that has the data.
- The DESTINATION instance, the database querying foreign data.
On the destination, create the extension to allow you to connect to other databases:
CREATE EXTENSION IF NOT EXISTS postgres_fdw;
On the foreign side, create a new user for the destination server to connect as:
CREATE USER fdw_user WITH PASSWORD 'pizza1';
GRANT SELECT, INSERT, UPDATE,
DELETE ON TABLE solar_load_with_columns TO fdw_user;
On the destination side create the foreign server, which tells Postgres where to connect for the remote data:
CREATE SERVER foreigndb_fdw
FOREIGN DATA WRAPPER
postgres_fdw
OPTIONS (host 'p.vbjrfujv5beutaoelw725gvi3i.db.postgresbridge.com',
port '5432', dbname 'postgres', sslmode 'require');
On the destination side, create the user mapping. This is how Postgres knows which user on the foreign side to connect as. In this case, all users on the destination side will connect as the same user.
CREATE USER MAPPING for PUBLIC SERVER
foreigndb_fdw OPTIONS (user 'fdw_user', password 'pizza1');
On the destination side, import the schema and limit it to the table names you want. This makes it so Postgres has a local table definition that matches the remote table’s definition and can be queried on the destination server.
IMPORT FOREIGN SCHEMA "public"
LIMIT TO(solar_load_with_columns)
FROM SERVER foreigndb_fdw INTO public;
Now your destination can query the foreign table!
SELECT *
FROM solar_load_with_columns;
Generate Data with a Query
Sometimes it is useful to generate mock data in Postgres that is more specific for your data model. An example of some easy queries for making fake data:
To create a series of user ids, use something like:
CREATE TABLE mock_login AS
SELECT 'userid' || s AS username FROM generate_series(1,10000) AS s;
To create random time intervals along with the user ids do something like:
CREATE TABLE mock_login2 AS SELECT 'userid' || s AS username, now() -
random() * interval '10 days' AS last_login FROM generate_series(1,10000) AS s;
And to add in random md5 passwords, use something like:
CREATE TABLE mock_login3
AS SELECT 'userid' || s as username, md5(random()::text) AS password, now() -
random() * interval '10 days' AS last_login FROM generate_series(1,10000) AS s;
Ok, now we’ve got a seed database of usernames, passwords, and login dates.
SELECT * FROM mock_login3 LIMIT 5;
username | password | last_login
----------+----------------------------------+-------------------------------
userid1 | 774b0cc950b990755199479829f019da | 2022-10-31 02:11:53.26224+00
userid2 | 594ec00eb078f7d8a62150597a779b86 | 2022-10-24 05:17:23.53117+00
userid3 | fde356d4e1257ac9dc7afb50f25fed60 | 2022-10-28 16:55:13.853136+00
userid4 | 2b3166be39ad16699f6e78eeba370c0c | 2022-10-24 13:54:36.446941+00
userid5 | 16541644f364416fbbbb321f36a0e56a | 2022-10-27 11:24:40.139086+00
(5 rows)
If you’re going for data generation, there’s also some really fun tools out there for that, like Mockaroo and the Rails gem Faker.
pgbench
I see a lot of engineers do a quick data load using pgbench
since it comes
with the database as a separate binary and it's a quick way to get testing with
Postgres, irrelevant of the content. pgbench
is useful if you are testing the
performance of the database itself without regard for what data is in it. By
default pgbench
will insert 100,000 rows into the pgbench_accounts table. You
can also add in a scale factor (-s
) to determine how large the sample data
should be. So a scale factor of 100 would be 10,000,000 rows.
pgbench -i -s 100 -d
postgres://postgres:vcnQ1AgQwUzimGBlFJirZagzpg0hbXy3byrnhbHMZf0Yaug6@p.6z5h5nct5fdylmo2a5nz2a.db.postgresbridge.com:5432/postgres
SELECT COUNT(*) FROM pgbench_accounts;
count
----------
10000000
(1 row)
Load It!
Hopefully now you have some very basic data loading strategies ready to go forward and do some testing. If you’re looking for something to load Geospatial data, we covered that a while back in Loading Data into PostGIS: An Overview. I always encourage people to do this kind of work in the cloud so you can really see what Postgres can do when property is tuned and resourced. Try Crunchy Bridge out for a few minutes and you’ll see what I mean.