13

I read here and there that using the utf8_unicode_ci collation ensures a better treatment of unicode text (for example, it knowns how to expand characters such as 'œ' into 'oe' for searching and ordering) compared to the default utf8_general_ci which basically just strips diacritics. Unfortunately, both sources indicate that utf8_unicode_ci is slightly slower than utf8_general_ci.

So my question is: what does "slightly slower" mean? Has anyone run benchmarks? Are we talking about a -0.01% performance impact or rather something like -25%?

Thanks for your help.

MiniQuark
  • 3,695
  • 2
  • 20
  • 23
  • As far as a benchmark, why not use the query time? I may be an idiot, but what if you ran up a VM and test the query time on a large complicated query for both character encodings? (I have not seen benchmarking done for this before) – Ablue Dec 30 '10 at 13:00

2 Answers2

8

Well, I did not find any benchmarks in the Internet, so I decided to made benchmarks myself.

I created a very simple table with 500000 rows:

CREATE TABLE test(
  ID INT(11) DEFAULT NULL,
  Description VARCHAR(20) DEFAULT NULL
)
ENGINE = INNODB
CHARACTER SET utf8
COLLATE utf8_general_ci;

Then I filled it with random data by running this stored procedure:

CREATE PROCEDURE randomizer()
BEGIN
  DECLARE i INT DEFAULT 0;
  DECLARE random CHAR(20) ;

  theloop: loop
    SET random = CONV(FLOOR(RAND() * 99999999999999), 20, 36);

    INSERT INTO test VALUES (i+1, random);

    SET i=i+1;

    IF i = 500000 THEN
      LEAVE theloop;
    END IF;

  END LOOP theloop;
END

Then I created the following stored procedures to benchmark simple SELECT, SELECT with LIKE, and sorting (SELECT with ORDER BY):

CREATE benchmark_simple_select()
BEGIN
  DECLARE i INT DEFAULT 0;

  theloop: loop

    SELECT * FROM test WHERE Description = 'test' COLLATE utf8_general_ci;

    SET i = i + 1;

    IF i = 30 THEN
      LEAVE theloop;
      END IF;

  END LOOP theloop;

END

CREATE PROCEDURE benchmark_select_like()
BEGIN
  DECLARE i INT DEFAULT 0;

  theloop: loop

    SELECT * FROM test WHERE Description LIKE '%test' COLLATE utf8_general_ci;

    SET i = i + 1;

    IF i = 30 THEN
      LEAVE theloop;
      END IF;

  END LOOP theloop;

END

CREATE PROCEDURE benchmark_order_by()
BEGIN
  DECLARE i INT DEFAULT 0;

  theloop: loop

    SELECT * FROM test WHERE ID > FLOOR(1 + RAND() * (400000 - 1)) ORDER BY Description COLLATE utf8_general_ci LIMIT 1000;

    SET i = i + 1;

    IF i = 10 THEN
      LEAVE theloop;
      END IF;

  END LOOP theloop;

END

In the stored procedures above utf8_general_ci collation is used, but of course during the tests I used both utf8_general_ci and utf8_unicode_ci.

I called each stored procedure 5 times for each collation (5 times for utf8_general_ci and 5 times for utf8_unicode_ci) and then calculated the average values.

Here are the results:

benchmark_simple_select() with utf8_general_ci: 9957 ms
benchmark_simple_select() with utf8_unicode_ci: 10271 ms
In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 3.2%.

benchmark_select_like() with utf8_general_ci: 11441 ms
benchmark_select_like() with utf8_unicode_ci: 12811 ms
In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 12%.

benchmark_order_by() with utf8_general_ci: 11944 ms
benchmark_order_by() with utf8_unicode_ci: 12887 ms
In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 7.9%.

nightcoder
  • 288
  • 4
  • 7
2

I didn't see any benchmark, but you can run your own using the BENCHMARK function:

BENCHMARK(count,expr)

As advised by Matthew you may run a parallel installation of MYSQL, but consider that there could be a huge difference between different architecture (sparc, intel, 32bit, 64bit, ...).

tmow
  • 1,187
  • 8
  • 20