Populating a PostgreSQL database
It is often the case that we need to populate a table with initial data. A typical approach is to run an sql script to perform a bulk insert. However this is not ideal for cases where there are millions of rows. To tackle this, PostgreSQL provides the COPY command, a very efficient way of inserting a large amount of data.
This command may be used like this:
1 | COPY table_name(column1, column2, column3, ...) FROM 'data.csv' DELIMITER ';' CSV HEADER; |
where we specify the table name, its columns and the data file.
To illustrate, let's create a table that mimics an household electric power consumption. We can use a pre-built docker image containing a PostgreSQL installation for this and a public dataset for testing.
1 | $ cat docker-compose |
We are now ready to configure our database:
1 | $ docker-compose up -d db |
Finally, we can load the data into our newly created table: 1
2
3
4
5\timing on
COPY power_consumption(Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3) FROM '/var/lib/postgresql/data/household_power_consumption.csv' DELIMITER ';' CSV HEADER;
Time: 7906.845 ms
It took roughly 8 seconds to insert 2 075 259 records. This operation works best if there are not any indexes or foreign keys present that may introduce overhead in each insert. It is usually preferred to create them after.