Utility or Script to Split Large HTML Files

2

I have a workflow that generates HTML files in excess of 10 MB. I would like to split those 10 MB files into smaller files.. more in the 1 MB range.

The HTML file has internal links and anchors in it. So whatever utility I use would need to alter the <a href="../foo>foo</a> and the <a id="fooID"></a> appropriately to use the names of the smaller files.

What I'd imagine doing is having some utility break the file at a predetermined location, probably each point.

Does anyone know of a Command Line utility or script exist to automate this?

UPDATE: I ended up creating my own bash script to do this. I regret I can't share the code, but I'll paraphrase it:

  • create sub directory to put the resulting split html files
  • split the file (I used xslt 2.0)
    • I put the original <head> in all the files
    • I used an xpath that split the file on a specific block-level tag
  • Recurse (xslt 2.0) through the directory of new files to create a temporary XML file with two fields:
    • the filename
    • the @id found
  • Recurse (xslt 2.0) through the directory of new html files again. Each @href was compared to the XML file of @id/filename pairs, and the href was corrected to the new filename if needed

It works so far. Though, it's not an answer to this Q because it isn't a purpose built utility available to all.

Paulb

Posted 2012-11-26T12:14:47.907

Reputation: 737

"I have a workflow that generates HTML" what is this workflow? An application perhaps? – siebz0r – 2012-11-26T12:42:57.320

HTML has tags that start at the beginning of the document, and end at the end - such as html and body. So how do you propose that you get a sane portion of a document when you slice it in the middle? – Paul – 2012-11-26T13:06:16.103

siebz0r: The "workflow" that creates the html is a database application. It's really not in my authority to change that, so I'll have to live with getting the large file and myself. – Paulb – 2012-11-27T10:53:57.663

Paul: I don't have a good answer to your Q the moment. There are patterns in the text. For example, maybe throw everything from the top of the file '<html> to <body>' in memory.. split the the file from '<body>' to '</body>', beginning and ending a new file at every point where an '<h3>' is found, manually paste a '</body> and </html>' at the end of each new smaller file. Seems basic, which made me think there was something that already did this. – Paulb – 2012-11-27T11:01:01.710

No answers