Convert Large HTML Table to CSV

4

1

I've got some data (basic personal details) which I need to export from an old legacy system to a newer system (MySQL DB). It's approximately 200MB split across 6 files.

The system I'm exporting from can export all the required data as an Excel file. However it's method of creating an Excel is to create a HTML table and save it with a .xls extension. When viewed in a text editor they just look like this (except 600,000 lines long).

<table>
<tr>

            <td class="tableh">
                <b>Birth&nbsp;Date</b>
            </td>

            <td class="tableh">
                <b>First&nbsp;Name</b>
            </td>

Much to my surprise Excel can actually open such a document. Unfortunately it takes about half an hour to do so (per file), and almost as long to save it into a proper format.

Does anyone have any suggestions for converting these (faster than the 45-60mins it takes Excel)?

Dracs

Posted 2013-07-10T07:28:58.740

Reputation: 2 616

open it in a browser and copy/paste from there to excel maybe? – mcalex – 2013-07-10T07:58:51.307

@mcalex Just gave it a shot, Chrome crashed when trying to copy it all. – Dracs – 2013-07-10T08:16:11.830

1Could you specify "some data" and "one system" and "another" ? If the data is in an HTML file, you could use the import tools to show it in Excel and then save as CSV. you would not normally muck around with HTML source in Excel. – teylyn – 2013-07-10T09:09:42.943

@teylyn I've clarified my question a bit. I just tried using Excel's Data import tools. It took ~20 minutes for one file. It's a small improvement, but it did seem more stable doing it. – Dracs – 2013-07-10T10:35:55.827

I would try changing the extension to .html, open it in Excel, then Save As a csv file. If this didn't behave I would use a good editor (Notepad++ or Sublime Text) to clean up (remove stuff!) in the HTML file before opening it in Excel again. – Andy G – 2013-07-10T18:41:29.350

@AndyG Sorry should have specified, I renamed the files to .html when doing the Excel data import. I just had a go at using some regex with Sublime to convert it to CSV. Had to force close it though after about half an hour of it processing. – Dracs – 2013-07-10T23:15:01.780

Answers

2

Is it just a single table? Are you sure your system cannot export as csv? (It seems unusual that it can export to Excel, or html, but not csv.)

I would be tempted to split the file(s) up (again) and then re-assemble it later. I would try replacing (in ST or another editor) text firstly that doesn't require regex. The regex-replace is likely much slower than standard replace. I would use standard replace as much as possible, even using it just in order to reduce the amount of regex-replacements I need to make. I would experiment a little to work out how long the whole process might take.

Alternatively, I might attempt to convert the file to XML format, adding tags at the beginning and, again, a bit of replacing. Then skip Excel altogether and I found this for MySQL:

LOAD XML LOCAL INFILE '/pathtofile/file.xml' 
INTO TABLE my_tablename(personal_number, firstname, ...);

to import XML. "LOAD XML INFILE is available from MySQL 5.2"

You might wait to see if someone else might contribute. Good luck.

Andy G

Posted 2013-07-10T07:28:58.740

Reputation: 231

I might even search for an application that would convert the HTML to suitable XML for me. – Andy G – 2013-07-10T23:48:55.180

I should have mentioned that the system does support CSV. But it gives no way of encapsulating or escaping commas already in the data (which are reasonably common in my data set). I've already done the conversion at using Excel, using another computer at work (so question is academic at this point). But I did have a look into the XML import options and it seems like it would be flexible enough to work. Unfortunately I can only access the DB via phpMyAdmin which was giving 'Out of Memory' errors when I tested it. – Dracs – 2013-07-12T00:15:50.233

2

Is a non-excel solution also suitable? If yes, how fast is your up- and download speed?

This online service did a good job in my quick test.
Depending on your Internet connection, it might be faster than Excel.

Input file

<table>
<tr>
            <td class="tableh">
                <b>Birth&nbsp;Date</b>
            </td>
            <td class="tableh">
                <b>First&nbsp;Name</b>
            </td>
</table>
</tr>

Output file

Birth Date, First Name
  • You can choose between different delimiters (comma, semicolon, vertical bar, tabulation)

Of course, you have to ask yourself if those personal details can be uploaded to an online service or not. I can't tell you that.

nixda

Posted 2013-07-10T07:28:58.740

Reputation: 23 233

I tried some test data myself and it does seem to work quite well. But yes as you pointed out, due to the personal information, I'd be uncomfortable uploading it to third party. – Dracs – 2013-07-11T23:55:15.863

0

Sorry I don't have code to share, but maybe your best strategy is cut out the Excel part of the process and go straight to comma-delimited text.

Two ideas:

  1. Make the html file a proper xml file (should only need a couple of header tags and an overall wrapper tag), make an xslt stylesheet (should be very simple), and run the xml file through an xslt processor like xsltproc.

  2. If you're on Linux / Unix, consider writing a script using a stream editor like awk to strip out the tags and output the fields.

In either case the logic would be something like:

  • if you encounter a <th>, output a newline
  • if you encounter a <td> </td> pair, output its contents enclosed in quotes

There's a great xslt forum on SO, worth posting there for code samples.

user257337

Posted 2013-07-10T07:28:58.740

Reputation: 1