5

I'm trying to load a CSV of about 100M records (around 8GB on disk) into Postgres via the copy command: copy mytable from 'path/to/myfile.csv' with CSV; I have been monitoring the progress by checking the reported table size in pgAdmin and comparing it with the CSV size. I know that's going to be a loose comparison at best, and I'd love to hear if there's a better way to monitor progress.

Here's the issue: this load has been going on for quite a long time (too long, I think) and as I keep checking the table size, the loading seems to be decelerating. That is, it takes much longer now to load in a new 100MB of data than it did earlier on in the load. Why?

Is there any tuning, configuration, or alternate approach I can take for a faster load other than breaking up my CSV into many smaller files?


Update: schema/data specifics

One representative data row:

1234567890,FOOBARF,2010-01-15 03:07:05,0.924700,0.925000

Complete schema definition:

CREATE TABLE mytable
(
  id integer NOT NULL,
  rname character varying(7) NOT NULL,
  ts timestamp without time zone NOT NULL,
  stat1 numeric NOT NULL,
  stat2 numeric NOT NULL,
  CONSTRAINT pk_id PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE mytable OWNER TO postgres;
G__
  • 294
  • 2
  • 9

1 Answers1

2

You might have better luck if you can disable indexes. But this is not a good answer as you did not provide enough information about the table.

Please post table defn, constraints, indexes!!!, and triggers.

Also, are you [make] sure the CSV data is correct and matches your table?

Leo
  • 1,008
  • 1
  • 8
  • 13
  • See edits on post. There are no indexes (yet; I was planning on adding after I got the data in). – G__ Jul 08 '10 at 17:21
  • I've run tests: (first: you do clearly have an index on id) I created a 10M row file with your data incrementing only the ID. On very nice server it took 1:40s to load. Then I dropped you index and it took 55s to load. – Leo Jul 08 '10 at 19:22
  • 1
    Removing the PK did the trick. I guess the issue is that the PK needed to ensure uniqueness, and therefore as the list grew it had more and more old records to check to ensure each new one was unique. The result: every subsequent records takes a bit longer to insert than the previous one. – G__ Jul 08 '10 at 22:22