Importing millions of rows using Postgres COPY
Our team needed to do a one-off backfill of ~9M rows into our production database. We looked into ways to do this most efficiently, and decided on Postgres’ COPY command. From some research, it seemed like this command is the de-facto method of moving mass amounts of data into (and out of) Postgres.
Copy from CSV
First challenge: decide on the input method. We chose CSV, which successfully passed some test imports we did with small amounts of data.
psql -c "COPY $TABLE (id, user_id, content_id, created_at) FROM 'input.csv' FORMAT csv, HEADER MATCH"
It also passed the tests of importing the real data into my local database. But when I ran the command on the production server, it failed because it was looking for the file on the server.
The file must be accessible by the PostgreSQL user (the user ID the server runs as) and the name must be specified from the viewpoint of the server. (Source)
(Technically, it didn’t fail, but logged a COPY 0
— meaning it imported nothing)
Bummer. Getting a file on the server would be a massive pain, as it’s RDS, and I didn’t know at the time that I could import files from S3. Though even if I did, the S3 approach requires admin privileges on the RDS server, which we don’t have and would be a massive pain to obtain.
What I should have tried
Use \copy
to perform a frontend (client) copy. It’s identical to the COPY
command but reads/writes files on the client’s file system, routing the data to the server as needed.
But…
We still don’t know if we would have run into a statement timeout error using this method. I’ve run into this issue by using the STDIN method, but I don’t see people complaining of running into it using the CSV method.
Update (2023-11-06): I’ve tried \copy and ran into a statement timeout error, albeit it was a longer one (~3 minutes)
There’s a spot-on Stack Overflow thread that answers this exact question. The reason why STDIN fails while \copy ... from '/path/to/data.csv
doesn’t is because, in the STDIN method, it generates a massive string that it then tries to feed as an entire argument. No chunking.
In the CSV method, the connection stays alive because the file’s content is being streamed chunk by chunk. Seems to just be the way Postgres copy command is built. Maybe we can prove this some day by looking at the source code. 🙂
Copy from STDIN
Well, this is the point I kinda diverged. I didn’t know why my COPY
command was importing 0 rows (RTFM!), so I fell back to the handy way that already tested out before and never failed me: COPY ... FROM STDIN
psql -c "COPY $TABLE (id, user_id, content_id, created_at) FROM STDIN DELIMITER ','" << input.csv
Worked for small inputs. Worked for big inputs to local DB. Failed for big inputs on production, because of statement timeout (60 seconds). It became apparent that there was no built-in chunking for the STDIN method.
Chunking the imports
I created a bash script (with the massive help of GitHub Copilot 🤖) that chunks data using tail
and head
and pipes it into the STDIN command.
# This is a simplified version of the script, for brevity
# Usage: sqlcopy.sh <target_table> <file>
TABLE=$1
FILE=$2
CHUNK_SIZE=50000
ROWS=`wc -l $FILE | awk '{print $1}'`
CHUNKS=`expr $ROWS / $CHUNK_SIZE`
REMAINDER=`expr $ROWS % $CHUNK_SIZE`
# if the remainder is greater than 0, add 1 to the number of chunks
if [ $REMAINDER -gt 0 ]; then
CHUNKS=`expr $CHUNKS + 1`
fi
for (( i=1; i<=$CHUNKS; i++ ))
do
START_ROW=`expr $CHUNK_SIZE \* $i - $CHUNK_SIZE + 1`
END_ROW=`expr $CHUNK_SIZE \* $i`
# if the end row is greater than the number of rows, set it to the number of rows
if [ $END_ROW -gt $ROWS ]; then
END_ROW=$ROWS
fi
tail -n +$START_ROW $FILE | head -n $CHUNK_SIZE | PGCONNECT_TIMEOUT=6 psql -U $USER -h $HOST -p $PORT -d $DB -c "COPY $TABLE (id, user_id, content_id, created_at) FROM STDIN DELIMITER ','" || exit 1
# check if the copy was successful
if [ $? -ne 0 ]; then
echo "$ERRLOG copy failed"
exit 1
fi
echo "$LOG copied rows $START_ROW to $END_ROW"
REMAINING=`expr $ROWS - $END_ROW`
echo "$LOG $REMAINING rows remaining"
done
Worked nicely except for running into a couple of timeouts, in which case I had to delete the already imported data from the file because the script isn’t smart enough to continue where it left off.
Now that the data is in…
After I got the data into a temporary table, my next challenge was to move it into the real table, but only after doing a bit of filtering mid-way.
INSERT INTO real_table (id, user_id, content_id, created_at)
SELECT tt.id, tt.user_id, tt.content_id, tt.created_at
FROM tmp_table tt
JOIN users u ON u.id = tt.user_id
WHERE tt.created_at < '2022-10-01T12:00:00.000Z'::timestamp
For ~9M rows, this timed out.
Looked into chunking, once again. I finally got it to work by inserting chunks of 1 million rows, by using a combination of LIMIT 1000000
and date ranges (... AND tt.created_at > {offset_date}
) depending on the current cursor timestamp that I had to figure out manually.
Was a major pain. I didn’t expect the table-to-table inserts to be an extra obstacle on top of the COPY
shenanigans.
What could’ve helped here
I’ve now found a lot of advice that suggest the following to speed up bulk data loading:
- Many have suggested setting your table as
UNLOGGED
to save on WAL generation, speeding up the imports significantly. [1]- This answer also has a great suggestion of running the CREATE TABLE and COPY in the same transaction, to generate minimal WAL. In general, it brings light to a clean way of containing everything in one transaction to avoid data integrity loss.
- Drop and recreate indexes - there’s a lot of overhead in updating the index as you add rows to the table, it’s faster if you just recreate the index after the bulk insert
- Is it possible to
CREATE INDEX
andDROP INDEX
as part of a transaction? Can these operations be rolled back just as any other query? Worth testing out. - A safe way to do this approach: clone the entire target table, indexes and all, and play around with that. If it all turns out well, you can just rename the table.
- Is it possible to
- Drop and recreate foreign keys - if the target table has FKs, each row addition will be checked against the FK constraint. This is likely what caused my issue!
- These are a bit risky to just drop and recreate, as we won’t know if we’ve violated the constraints during the import, we’ll only on recreating the FK.
- The same safe way can be applied in this approach: clone the table with FKs, drop the FKs, load the data, then recreate the FKs. If anything is messed up, at least you didn’t mess with the real data.
- Another possible safe way: can we include FK constraints as part of transactions, to then be able to rollback if restoring the FK fails because of constraint violations? Worth testing out.
- (Update 2023-11-06) AWS recommends importing data into RDS via S3. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PostgreSQL.S3Import.html
- Downside: this needs an extension to be installed, and can only be done by an admin.