What/Where should I look for in trying to rent a server for high-memory computational needs?

0

I am in the process of parsing machine-readable data from wiktionary into a SQL database using wikokit, but have since realized that, according to the time estimate provided by wikokit, it will take a solid month to finish (53499 minutes.) This is after following general recommendations for increasing performance on MySQL, including using mysqltuner and following its directions.

I am currently using my personal computer to parse Wiktionary, the specs of which are 4.2 Ghz CPU and 8G ram, using a 1TB HDD. Very naively calculated, I imagine a 256G RAM server would then be able to parse Wiktionary in about a day. I think I need to rent a server high in RAM, average in CPU, and minimal in HDD space, at a per second rate for a day or two. However, I'm not sure whether that would even work (would scaling up memory by 30x really help?), or where to find such a thing, or what to even google to start.

Some more details: Wikokit works by taking a raw, unparsed SQL database of Wiktionary, parsing it through its Java code using Connector/J, then loading it into a formatted SQL database that starts off empty. Essentially raw_enwikt -> wikt_parser.java -> parsed_enwikt. The guide can be found here.

I'm not very experienced in server usage, so I am also unsure of whether you would be able to set up MySQL and run Java code on it on any arbitrary server. I need the parsed database in relatively short time, within the next 3-4 days.

stripedneck

Posted 2019-07-26T20:51:23.083

Reputation: 3

Question was closed 2019-07-27T19:35:10.287

What is the model number of your CPU? (We need more information than just the clock speed like threads, etc.) – zandermar18 – 2019-07-26T20:55:17.553

This question does not appear to be about computer software or computer hardware within the scope defined in the help center. Amazon and Azure both can be used for computational tasks. However, 256 GB virtual machine, will be extremely expensive. You might be better off building your own server, – Ramhound – 2019-07-26T20:55:24.130

Should I ask in server-fault? I posted here because I am not a professional in IT by any means and was hoping personal MySQL use is covered under SU. The CPU is i5-4670k OC'd to 4.2 Ghz. – stripedneck – 2019-07-26T21:07:50.523

To put it into perspective a 64 thread 256 GB Azure VM would be $4.5K per month. You could build a similar server for the same price. – Ramhound – 2019-07-26T21:22:33.337

Server Fault will not make service recommendations. There isn’t a SE for that type of question. Your personal computer is indeed within scope, but your question, isn’t about your current configuration it’s about if a VPS would be appropriate for this task (a matter of opinion) – Ramhound – 2019-07-26T21:24:47.027

Answers

1

I had a quick look at the parser pages. I'm not sure if you're needing a totally up-to-date copy, but it looks like the bulk of the hard work has been done for you. Their page at http://whinger.krc.karelia.ru/soft/wikokit/index.html has a downloadable sql file that you can just dump into the database. It's already been run through their parser. The big downside is that it's from 2015/16.

If you must have an up-to-date version, then a 256GB system looks like massive overkill to me and possibly solving the wrong problem. You'll find that beyond a certain point it just doesn't make a difference. I'd be tempted to try a smaller 16GB or 32GB system, but run the parser, its input files and mysql from a ramdrive (be warned, if the machine crashes or reboots or runs out of memory then all the current progress is lost) and see if that goes faster. Mostly that takes out any filesystem bottlenecks.

The other thing I would try is taking out the database part of the parser. This assumes that the mysql writes are the slow bit. Have the parser java write all its SQL commands to a text file instead. Then taking a leaf from mysqldump's book, turn off all mysql's key processing, load the data and then turn the keys and constraints back on. That will be considerably faster than processing each row one by one. This won't work if the database has autonumbered keys used in relations, though.

After writing all of this I also read the code for Main.java and realised it's pulling its data from the live database. An easy optimisation would be to download the latest database snapshot (https://dumps.wikimedia.org/enwiktionary/latest/) and run everything from a local database. Even if it's just dumped onto another machine on the local network, that alone should give you a good speed boost because you're not contending for bandwidth and server time with everyone using wiktionary.

Greig

Posted 2019-07-26T20:51:23.083

Reputation: 161

I am indeed looking for a more updated parsed database. Could you point me to a resource where I can learn more about running a ramdrive/ramdisk setup on rented servers/vps? I've never come across such a thing, though the concept makes sense to me.

I'm not sure how I'd go about altering the code to make it print to text, but thank you for providing another idea. As for using the live database, to my knowledge it doesn't pull information from wiktionary online - where in the code are you seeing that? – stripedneck – 2019-07-27T06:53:07.297

I'm looking at Main.java and the connect strings, but I haven't followed it back to see exactly where those db strings are defined. If you already have a local db it's probably worth running mysqltuner on it too. I would also experiment with mounting the db on a separate machine as that would take some of the work off the main parsing system. – Greig – 2019-07-27T10:53:32.960

I haven't done much with ramdrives on linux, but once you have a cloud instance it seems pretty flexible in terms of what you can do with it. You can find instructions for mounting a ramdrive at https://kerneltalks.com/linux/how-to-create-ram-disk-in-linux/

– Greig – 2019-07-27T10:55:23.540