2
I have a workflow that generates HTML files in excess of 10 MB. I would like to split those 10 MB files into smaller files.. more in the 1 MB range.
The HTML file has internal links and anchors in it. So whatever utility I use would need to alter the <a href="../foo>foo</a>
and the <a id="fooID"></a>
appropriately to use the names of the smaller files.
What I'd imagine doing is having some utility break the file at a predetermined location, probably each point.
Does anyone know of a Command Line utility or script exist to automate this?
UPDATE: I ended up creating my own bash script to do this. I regret I can't share the code, but I'll paraphrase it:
- create sub directory to put the resulting split html files
- split the file (I used xslt 2.0)
- I put the original
<head>
in all the files - I used an xpath that split the file on a specific block-level tag
- I put the original
- Recurse (xslt 2.0) through the directory of new files to create a temporary XML file with two fields:
- the filename
- the @id found
- Recurse (xslt 2.0) through the directory of new html files again. Each @href was compared to the XML file of @id/filename pairs, and the href was corrected to the new filename if needed
It works so far. Though, it's not an answer to this Q because it isn't a purpose built utility available to all.
"I have a workflow that generates HTML" what is this workflow? An application perhaps? – siebz0r – 2012-11-26T12:42:57.320
HTML has tags that start at the beginning of the document, and end at the end - such as
html
andbody
. So how do you propose that you get a sane portion of a document when you slice it in the middle? – Paul – 2012-11-26T13:06:16.103siebz0r: The "workflow" that creates the html is a database application. It's really not in my authority to change that, so I'll have to live with getting the large file and myself. – Paulb – 2012-11-27T10:53:57.663
Paul: I don't have a good answer to your Q the moment. There are patterns in the text. For example, maybe throw everything from the top of the file '<html> to <body>' in memory.. split the the file from '<body>' to '</body>', beginning and ending a new file at every point where an '<h3>' is found, manually paste a '</body> and </html>' at the end of each new smaller file. Seems basic, which made me think there was something that already did this. – Paulb – 2012-11-27T11:01:01.710