18

This question is reposted from Stack Overflow based on a suggestion in the comments, apologies for the duplication.

Questions

Question 1: as the size of the database table gets larger how can I tune MySQL to increase the speed of the LOAD DATA INFILE call?

Question 2: would using a cluster of computers to load different csv files, improve the performance or kill it? (this is my bench-marking task for tomorrow using the load data and bulk inserts)

Goal

We are trying out different combinations of feature detectors and clustering parameters for image search, as a result we need to be able to build and big databases in a timely fashion.

Machine Info

The machine has 256 gig of ram and there are another 2 machines available with the same amount of ram if there is a way to improve the creation time by distributing the database?

Table Schema

the table schema looks like

+---------------+------------------+------+-----+---------+----------------+
| Field         | Type             | Null | Key | Default | Extra          |
+---------------+------------------+------+-----+---------+----------------+
| match_index   | int(10) unsigned | NO   | PRI | NULL    |                |
| cluster_index | int(10) unsigned | NO   | PRI | NULL    |                |
| id            | int(11)          | NO   | PRI | NULL    | auto_increment |
| tfidf         | float            | NO   |     | 0       |                |
+---------------+------------------+------+-----+---------+----------------+

created with

CREATE TABLE test 
(
  match_index INT UNSIGNED NOT NULL,
  cluster_index INT UNSIGNED NOT NULL, 
  id INT NOT NULL AUTO_INCREMENT,
  tfidf FLOAT NOT NULL DEFAULT 0,
  UNIQUE KEY (id),
  PRIMARY KEY(cluster_index,match_index,id)
)engine=innodb;

Benchmarking so far

First step was to compare bulk inserts vs loading from a binary file into an empty table.

It took:  0:09:12.394571  to do  4,000  inserts with 5,000 rows per insert
It took: 0:03:11.368320 seconds to load 20,000,000 rows from a csv file

Given the difference in performance I have gone with loading the data from a binary csv file, first I loaded binary files containing 100K, 1M, 20M, 200M rows using the call below.

LOAD DATA INFILE '/mnt/tests/data.csv' INTO TABLE test;

I killed the 200M row binary file (~3GB csv file) load after 2 hours.

So I ran a script to create the table, and insert different numbers of rows from a binary file then drop the table, see the graph below.

enter image description here

It took about 7 seconds to insert 1M rows from the binary file. Next I decided to benchmark inserting 1M rows at a time to see if there was going to be a bottleneck at a particular database size. Once the Database hit approximately 59M rows the average insert time dropped to approximately 5,000/second

enter image description here

Setting the global key_buffer_size = 4294967296 improved the speeds slightly for inserting smaller binary files. The graph below shows the speeds for different numbers of rows

enter image description here

However for inserting 1M rows it didn't improve the performance.

rows: 1,000,000 time: 0:04:13.761428 inserts/sec: 3,940

vs for an empty database

rows: 1,000,000 time: 0:00:6.339295 inserts/sec: 315,492

Update

Doing the load data using the following sequence vs just using the load data command

SET autocommit=0;
SET foreign_key_checks=0;
SET unique_checks=0;
LOAD DATA INFILE '/mnt/imagesearch/tests/eggs.csv' INTO TABLE test_ClusterMatches;
SET foreign_key_checks=1;
SET unique_checks=1;
COMMIT;
enter image description here

So this looks quite promising in terms of the database size that is being generated but the other settings don't appear to affect the performance of the load data infile call.

I then tried loading multiple files from different machines but the load data infile command locks the table, due to the large size of the files causing the other machines to time out with

ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction

Increasing the number of rows in binary file

rows:  10,000,000  seconds rows:  0:01:36.545094  inserts/sec:  103578.541236
rows:  20,000,000  seconds rows:  0:03:14.230782  inserts/sec:  102970.29026
rows:  30,000,000  seconds rows:  0:05:07.792266  inserts/sec:  97468.3359978
rows:  40,000,000  seconds rows:  0:06:53.465898  inserts/sec:  96743.1659866
rows:  50,000,000  seconds rows:  0:08:48.721011  inserts/sec:  94567.8324859
rows:  60,000,000  seconds rows:  0:10:32.888930  inserts/sec:  94803.3646283

Solution: Precomputing the id outside of MySQL instead of using auto increment

Building the table with

CREATE TABLE test (
  match_index INT UNSIGNED NOT NULL,
  cluster_index INT UNSIGNED NOT NULL, 
  id INT NOT NULL ,
  tfidf FLOAT NOT NULL DEFAULT 0,
  PRIMARY KEY(cluster_index,match_index,id)
)engine=innodb;

with the SQL

LOAD DATA INFILE '/mnt/tests/data.csv' INTO TABLE test FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';"

enter image description here

Getting the script to pre-compute the indexes appears to have removed the performance hit as the database grows in size.

Update 2 - using memory tables

Roughly 3 times faster, without taking into account the cost of moving an in-memory table to disk-based table.

rows:  0  seconds rows:  0:00:26.661321  inserts/sec:  375075.18851
rows:  10000000  time:  0:00:32.765095  inserts/sec:  305202.83857
rows:  20000000  time:  0:00:38.937946  inserts/sec:  256818.888187
rows:  30000000  time:  0:00:35.170084  inserts/sec:  284332.559456
rows:  40000000  time:  0:00:33.371274  inserts/sec:  299658.922222
rows:  50000000  time:  0:00:39.396904  inserts/sec:  253827.051994
rows:  60000000  time:  0:00:37.719409  inserts/sec:  265115.500617
rows:  70000000  time:  0:00:32.993904  inserts/sec:  303086.291334
rows:  80000000  time:  0:00:33.818471  inserts/sec:  295696.396209
rows:  90000000  time:  0:00:33.534934  inserts/sec:  298196.501594

by loading the data into a memory based table and then copying it to a disk based table in chunks had an overhead of 10 min 59.71 sec to copy 107,356,741 rows with the query

insert into test Select * from test2;

which makes it approximately 15 minutes to load 100M rows, which is approximately the same as directly inserting it into a disk based table.

Ben
  • 283
  • 2
  • 6
  • 1
    I think changing the primary key to just `id` should be faster. (Although I think you are not looking for this) – DavidEG Oct 02 '11 at 10:25
  • Hi David, thanks for the comment, unfortunately without the key the queries we need to do aren't fast enough ( the logic behind the primary key selection is outlined in this post http://stackoverflow.com/questions/4282526/mysql-group-by-optimization ) – Ben Oct 02 '11 at 12:17
  • 1
    Is this just for testing? You might want to look at the MySQL MEMORY engine: http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html If you are planning to deploy this as an architecture I'm curious how you plan to recover from failures, it seems like something that would be better handled by MapReduce/Hadoop. – polynomial Oct 02 '11 at 17:21
  • Hi polynomial, thanks for the tip, at the moment we are just testing different feature detectors at different scales, once the database is generated it wont change much (in the current spec anyway) – Ben Oct 03 '11 at 01:39

3 Answers3

4

Good question - well explained.

how can I tune MySQL to increase the speed of the LOAD DATA INFILE call?

You've already got a high(ish) setting for the key buffer - but is it enough? I'm assuming this is a 64-bit installation (if not then the first thing you need to do is upgrade) and not running on MSNT. Have a look at the output of mysqltuner.pl after running a few tests.

In order to use the cache to best effect, you may find benefits in batching/pre-sorting the input data (most recent versions of the 'sort' command have a lot of functionality for sorting large datasets). Also if you generate the ID numbers outside of MySQL, then it may be more efficient.

would using a cluster of computers to load different csv files

Assuming (again) that you want to have the output set behave as a single table, then the only benefits you'll get are by distributing the work of sorting and generating ids - which you don't need more databases for. OTOH using a database cluster, you will get problems with contention (which you shouldn't see other than as performance problems).

If you can shard the data and handle the resulting datasets independently, then yes, you will get performance benefits - but this does not negate the the need to tune each node.

Check you've got at least 4 Gb for the sort_buffer_size.

Beyond that, the limiting factor on performance is all about disk I/O. There's lots of ways to address this - but you should probably be considering a mirrored set of striped datasets on SSDs for optimal performance.

symcbean
  • 19,931
  • 1
  • 29
  • 49
1
  • Consider your limiting factor. It's almost certainly single-threaded CPU processing.
  • You've already determined that load data... is faster than insert, so use that.
  • You've already determined that really large files (by row number) slow things down a lot; you want to break them up into pieces.
  • Using non-overlapping primary keys, queue up at least N*CPU sets, using no more than one million rows... probably less (benchmark).
  • Use sequential blocks of primary keys in each file.

If you want to be really spiffy, you can create a multi-threaded program to feed a single file to a collection of named pipes and manage the insert instances.

In summary, you don't tune MySQL for this so much as you tune your workload to MySQL.

Jeff Ferland
  • 20,239
  • 2
  • 61
  • 85
-1

I do not remember exactly the syntacx but if it's inno db you can turn off foreign key check.

Also you can create the index after the import, it's can be a really performance gain.

  • Defering the index rebuild only improves performance where the number of rows already in the table is significantly smaller than the number of rows you are adding. – symcbean Oct 03 '11 at 12:38