10

I have a table with 1.4 billion records. The table structure is as follows:

CREATE TABLE text_page (
    text VARCHAR(255),
    page_id INT UNSIGNED
) ENGINE=MYISAM DEFAULT CHARSET=ascii

The requirement is to create an index over the column text.

The table size is about 34G.

I have tried to create the index by the following statement:

ALTER TABLE text_page ADD KEY ix_text (text)

After 10 hours' waiting I finally give up this approach.

Is there any workable solution on this problem?

UPDATE: the table is unlikely to be updated or inserted or deleted. The reason why to create index on the column text is because this kind of sql query would be frequently executed:

SELECT page_id FROM text_page WHERE text = ?

UPDATE: I have solved the problem by partitioning the table.

The table is partitioned into 40 pieces on column text. Then creating index on the table takes about 1 hours to complete.

It seems that MySQL index creation becomes very slow when the table size becomes very big. And partitioning reduces the table into smaller trunks.

SiLent SoNG
  • 257
  • 1
  • 2
  • 5
  • 1
    What's wrong with using the normal `CREATE INDEX` statement? –  May 10 '10 at 17:06
  • I'd suggest this question might be better off on ServerFault - it's more of a DB admin than a programming question. – therefromhere May 10 '10 at 17:07
  • @Derk: the normal CREATE INDEX approach is too slow. I have to complete the task within 1 day. –  May 10 '10 at 17:09
  • 1
    Hmm... I don't think that you can get around this. Building the index requires the DBMS to scan through all the records, gather their "text" fields and insert/change the corresponding tree nodes/subtrees. And this takes much time for 34G... – chiccodoro May 10 '10 at 17:09
  • How much memory does your DB server have? Have you configured MySQL to use all that memory, or is it limiting itself? –  May 10 '10 at 17:10
  • @tster: I have configured the following memories: key_buffer_size: 1G, sort_buffer_size: 32M, myisam_sort_buffer_size: 32M, tmp file is located in a disk with 100GB available space. –  May 10 '10 at 17:13
  • That looks like how many McDonald's has served. Just limit your data to `Where NutritionalVaue > 0` and it's quite manageable without an index. – John K May 10 '10 at 18:01

7 Answers7

4

Could it be your system just is not up to the task? I do not use MySQL (SQL Server here), but I know the pain of indexing an 800 million entry table. Basically.... you need the right hardware for that (as in: lots of fast discs). I now use nearly a dozen Velociraptors and performance is great ;)

SQL Servers (not as MS SQL Server, but as database servers using SQL) live and die with disc access, and normal discs are just not up to the task of larger operations.

TomTom
  • 50,857
  • 7
  • 52
  • 134
  • My doubt is that index creation is usually very fast if the record count is small; say, millions. But when the count is at billions the index creation becomes so slow. Seems like the time growth is exponential. –  May 10 '10 at 17:16
  • Should not really be. MySQL in general has limits, but it is not a crap database, and that would be VERY bad. Index generation gets slower, but by log(n), not (n), so it should not really be THAT bad. – TomTom May 10 '10 at 17:34
4

You might want to create an index on first (for instance, 10) characters of the text field.

From the Docs:

Indexes can be created that use only the leading part of column values, using col_name(length) syntax to specify an index prefix length:

CREATE INDEX ix_text ON text_page (text(10))
4

I have solved the problem by partitioning the table.

The table is partitioned into 40 pieces on column text. Then creating index on the table takes about 1 hours to complete.

It seems that MySQL index creation becomes very slow when the table size becomes very big. And partitioning reduces the table into smaller trunks.

SiLent SoNG
  • 257
  • 1
  • 2
  • 5
3

Set the sort_buffer_size to 4GB (or however much you can depending on how much memory you have).

Right now the create index is doing a sort but since you have a 32MB sort_buffer_size, it is basically thrashing the hard drive needlessly.

tster
  • 131
  • 2
  • These posts are pretty much directly disagreeing with you: http://www.xaprb.com/blog/2010/05/09/how-to-tune-mysqls-sort_buffer_size/ and better http://ronaldbradford.com/blog/more-on-understanding-sort_buffer_size-2010-05-10/ It sounds like that's not a global value, it's per-query, so that's 4GB per query you're recommending. Also, when it exceeds 256K it gets mem-mapped to disk rather than being actual in-memory memory. If you keep it small it requires multiple passes, but it avoids disk (it doesn't swap). – Ry4an Brase Jul 19 '10 at 14:11
3

If you do not need to make queries like:

SELECT page_id FROM text_page WHERE text LIKE '?%';

I would suggest creating a new hash column and index the table by the column. The overal size of the table + index might be much smaller.

UPD: By the way, 1.4 billion primary key integers occupy about 6 GB, that is the average length of the string is less than 30 characters, that is indexing on a prefix might be more preferable.

You should also take a look at MERGE storage engine.

newtover
  • 131
  • 2
2

One way to do this is to create a new table with the index set and copy the data to the new table.

Also, make sure you have enough temp space.

decompiled
  • 143
  • 7
  • 1
    I have tried this approach. After 10 hours less than 1% data has been copied over to the new table. –  May 10 '10 at 17:08
  • 1
    Dude... it's 1.4 BILLION records. Not million, BILLION. That's a lot. It's gonna take a while regardless. –  May 10 '10 at 17:09
  • If you do choose to do this method break the copy into smaller chunks. Say about 100 to 200 million for each copy. –  May 10 '10 at 17:16
  • 1
    @decompiled,breaking it into smaller chunks won't do anything (actually, it might make it less efficient). @Bryan, Even with 1.4 billion records, it shouldn't take 1,000 hours. –  May 10 '10 at 17:35
0

In case you're still wondering how to do this best, I would suggest you to use an online alter table tool.

There are lot of them on the internet, one of the famous ones are:

We have the same issues with big tables (more then 500mil records) and the altering goes perfect. It creates a new tmp table, adds trigger on the original table (for the new update/delete/insert records) and in the mean time it copies all the records to the new table (with the new structure)

Good luck!

GregL
  • 9,030
  • 2
  • 24
  • 35
Ali Alwash
  • 131
  • 3