MySql: Daily crashes for the last 3 weeks on beefy server

Question

My MySQL server running the latest version of MySql on Ubuntu 16.x keeps crashing once or twice a day. Sometimes it repairs itself fairly quickly (10 minutes). Sometimes I have to reboot and do an fsck to get things running again.

What would be causing this?

Things I've tried so far:

increased RAM from 1.5GB up to 5GB.
Hardware Upgrades: MotherBoard, processor, RAM (DDR4) but that didn't help (it was running a 7 year old processor, now running 7th Gen Core I5).
Setup UFW firewall to ensure it wasn't being caused by bots attacking MySQL or other services.
In my.cnf, changed innodb_buffer_pool_size from 128MB to 500MB. didn't help but still in place
I've run: mysqlcheck -u root -p --auto-repair --optimize --all-databases multiple times. didn't help
In my.cnf, Decreased mysql max_connections from 151 to 80 and restarted mysql. didn't help
Decreased apache MaxRequestWorkers from 150 to 100. Didn't help. Still crashing.
I already had a 1GB Swap file. Left it.
Scoured through Apache2 logs, SysLog, any other log that seemed appropriate but haven't found anything that caught my eye.
Shutdown the server and tried to move the VM to another drive but it fails with File Error.
My latest suspicion is this is being caused by a bad block but running badblocks seemed to trigger a crash at 25% complete. During the fsck I see this: fsck critical medium error, dev sda, sector 147306432

Here is a typical mysql error log:

2017-04-20T18:43:46.958430Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 11791ms. The settings might not be optimal. (flu shed=92 and evicted=0, during the time.)
2017-04-20T18:44:11.989905Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 6822ms. The settings might not be optimal. (flushed=8 and evicted=0, during the time.)
2017-04-20T18:44:49.145162Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5021ms. The settings might not be optimal. (flus hed=0 and evicted=0, during the time.)
2017-04-20T18:45:22.322429Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 26338ms. The settings might not be optimal. (flu shed=10 and evicted=0, during the time.)
2017-04-20T18:45:53.926808Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4510ms. The settings might not be optimal. (flus hed=0 and evicted=0, during the time.)
2017-04-20T18:46:03.097400Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5384ms. The settings might not be optimal. (flus hed=13 and evicted=0, during the time.)
2017-04-20T18:46:39.247467Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 14848ms. The settings might not be optimal. (flu shed=8 and evicted=0, during the time.)
2017-04-20T18:47:16.271672Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 29107ms. The settings might not be optimal. (flu shed=8 and evicted=0, during the time.)
2017-04-20T18:47:53.669557Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5969ms. The settings might not be optimal. (flus hed=37 and evicted=0, during the time.)
2017-04-20T18:50:23.879411Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 37671ms. The settings might not be optimal. (flu shed=6 and evicted=0, during the time.)
2017-04-20T18:55:07.190725Z 0 [Warning] Changed limits: max_open_files: 1024 (requested 5000) 2017-04-20T18:55:07.235759Z 0 [Warning] Changed limits: table_open_cache: 431 (requested 2000)
2017-04-20T18:55:10.486670Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_times tamp server option (see documentation for more details).
2017-04-20T18:55:11.563578Z 0 [Note] /usr/sbin/mysqld (mysqld 5.7.17-0ubuntu0.16.04.2) starting as process 24701 ...
2017-04-20T18:55:21.979225Z 0 [Note] InnoDB: PUNCH HOLE support available
2017-04-20T18:55:21.979250Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-04-20T18:55:21.979253Z 0 [Note] InnoDB: Uses event mutexes 2017-04-20T18:55:21.979256Z 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2017-04-20T18:55:21.979259Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.8
2017-04-20T18:55:21.979262Z 0 [Note] InnoDB: Using Linux native AIO 2017-04-20T18:55:22.004800Z 0 [Note] InnoDB: Number of pools: 1
2017-04-20T18:55:22.060762Z 0 [Note] InnoDB: Using CPU crc32 instructions
2017-04-20T18:55:22.104584Z 0 [Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
2017-04-20T18:55:24.184701Z 0 [Note] InnoDB: Completed initialization of buffer pool
2017-04-20T18:55:24.210160Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2017-04-20T18:55:26.405242Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2017-04-20T18:55:27.508456Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 35288448161
2017-04-20T18:55:27.508478Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 35288448170
2017-04-20T18:55:27.508630Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 35288448170
2017-04-20T18:55:27.508634Z 0 [Note] InnoDB: Database was not shutdown normally!
2017-04-20T18:55:27.508637Z 0 [Note] InnoDB: Starting crash recovery.
2017-04-20T18:56:16.516761Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2017-04-20T18:56:16.516785Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2017-04-20T18:56:16.516817Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2017-04-20T18:56:16.621736Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2017-04-20T18:56:16.622203Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2017-04-20T18:56:16.622211Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2017-04-20T18:56:16.622565Z 0 [Note] InnoDB: Waiting for purge to start
2017-04-20T18:56:16.672708Z 0 [Note] InnoDB: 5.7.17 started; log sequence number 35288448170
2017-04-20T18:56:16.672708Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 52462ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
2017-04-20T18:56:16.673192Z 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2017-04-20T18:56:16.702959Z 0 [Note] Plugin 'FEDERATED' is disabled.
2017-04-20T18:56:16.851553Z 0 [ERROR] Function 'archive' already exists
2017-04-20T18:56:16.851568Z 0 [Warning] Couldn't load plugin named 'archive' with soname 'ha_archive.so'.
2017-04-20T18:56:16.851574Z 0 [ERROR] Function 'blackhole' already exists
2017-04-20T18:56:16.851575Z 0 [Warning] Couldn't load plugin named 'blackhole' with soname 'ha_blackhole.so'.
2017-04-20T18:56:16.851578Z 0 [ERROR] Function 'federated' already exits 2017-04-20T18:56:16.851579Z 0 [Warning] Couldn't load plugin named 'federated' with soname 'ha_federated.so'.
2017-04-20T18:56:16.851582Z 0 [ERROR] Function 'innodb' already exists 2017-04-20T18:56:16.851583Z 0 [Warning] Couldn't load plugin named 'innodb' with soname 'ha_innodb.so'.
2017-04-20T18:56:17.044733Z 0 [Warning] Failed to set up SSL because of the following SSL library error: SSL context is not usable without certificate and private key
2017-04-20T18:56:17.044754Z 0 [Note] Server hostname (bind-address): '0.0.0.0'; port: 3306
2017-04-20T18:56:17.044761Z 0 [Note] - '0.0.0.0' resolves to '0.0.0.0';
2017-04-20T18:56:17.044779Z 0 [Note] Server socket created on IP: '0.0.0.0'. 2017-04-20T18:56:18.483575Z 0 [Note] Event Scheduler: Loaded 0 events
2017-04-20T18:56:18.483706Z 0 [Note] Executing 'SELECT * FROM INFORMATION_SCHEMA.TABLES;' to get a list of tables using the deprecated partition engine. You may use the startup option '--disable-partition-engine-check' to skip this check.
2017-04-20T18:56:18.483716Z 0 [Note] Beginning of list of non-natively partitioned tables
2017-04-20T18:56:25.478293Z 0 [Note] InnoDB: Buffer pool(s) load completed at 170420 13:56:25
2017-04-20T18:56:26.091240Z 0 [Note] End of list of non-natively partitioned tables
2017-04-20T18:56:26.091423Z 0 [Note] /usr/sbin/mysqld: ready for connections. Version: '5.7.17-0ubuntu0.16.04.2' socket: '/var/run/mysqld/mysqld.sock' port: 3306 (Ubuntu)
2017-04-20T18:56:26.155810Z 4 [ERROR] /usr/sbin/mysqld: Table './example/wp_options' is marked as crashed and should be repaired
2017-04-20T18:56:26.155889Z 5 [ERROR] /usr/sbin/mysqld: Table './example/wp_options' is marked as crashed and should be repaired
2017-04-20T18:56:26.156037Z 4 [Warning] Checking table:'./example/wp_options'
2017-04-20T18:56:35.816730Z 4 [ERROR] /usr/sbin/mysqld: Table './example/wp_usermeta' is marked as crashed and should be repaired
2017-04-20T18:56:35.816875Z 4 [Warning] Checking table: './example/wp_usermeta'

score 2 · Accepted Answer · answered Apr 23 '17 at 04:28

The last two bullet points indicate the problem. Your bad blocks suspicion seems well founded.

File error when trying to move the VM
Badblocks crashing at about the same spot every time.

While the server is running, dump the databases to a file on the host OS. Since the server is crashing, and you don't know exactly what table, database, or record it's accessing when it does go down, take the time to dump each database, maybe even each table, separately. Hopefully the bad blocks don't occur within your data, but within some file the system is trying to use. In any case, if one of the dumps triggers a crash, twice if you want to double check it, then consider that table or database as suspect and review it by hand as best you can.

Then create a new VM, on a different physical disk, with all the needed installs. Import the dumped data, including the inspected version of any suspect data. On all tables do some random sanity checks with the data, especially for the tables built from any suspect dumps. Then do whatever level of testing you deem appropriate to ensure that the new VM and database are working properly and has valid data.

Make the new VM the "live" server, retire the old VM, and begin backup/recovery for the rest of the physical drive that held the crashing server VM. Once you have retrieved all, or all available, data from that disk, you can determine its health (suspect) and whether or not you are willing to trust it with further use for important data. Maybe it can be used as a place to place your /tmp directory, or other transient structures, or as swap space, freeing up space on another, presumably good, disk for important data.

Since I know the bad sector from the fsck (added after your answer), it seems there must be an easier way to isolate the bad sector and move the VM to a new drive. Any idea how I can do that? — EfficionDave, Apr 23 '17 at 14:12
The crash history indicates that the bad sector is in the VM's disk image. So _moving_ the image file will still try to read that sector, and fail, every time. You can _try_ copying the MySQL data files directly, from the running VM to a physical disk, and use those to shorten the reset on a new VM image. Doing that you will probably find which database/table is stored in that sector, and only have to rebuilt that one. Still, if data integrity is important, you can only take so many shortcuts. All the time spent finding work-arounds could be better spent provisioning the new VM. — , Apr 23 '17 at 20:34

MySql: Daily crashes for the last 3 weeks on beefy server

1 Answers1