Archiving Several hundred Webpages to a doc or pdf

1

1

I work for a University Library and have to archive our University Bulletin (the rules, regulations, departments, courses, etc). I have done the most recent one but now have to do every year back to 1997. 2014-15 was 3,100 pages long and took around 1.5 weeks to finally get together. I did each page one at a time, copied into Word, formatted a bit, and then saved all the word docs to PDF. I need to have documents searchable, and maintain some type of Table of Contents/Bookmarks so it can be easily navigated. I found a way to get HTML's downloaded for each department page by grabbing the links through the inspect element in the browser, and then using HTTrack to download them.

But now I face basically the same issue as going to each site and copy pasting text. There's no good way to do it. I have Acrobat X but that doesn't maintain any of the bookmarks/headings and is pretty tough to maintain a similar look from page to page. I also keep havign issues with gettign "Clean" documents that I can turn into PDF/A. I guess I'm just looking to see if anyone has dealt with something similar or could think of some solution I could put together. I can upload the 2014-15 file if it helps. I don't even know if this is the best Exchange to use. Any help would be greatly appreciated!

crabab11

Posted 2015-01-20T20:29:43.260

Reputation: 19

Question was closed 2015-01-22T03:06:25.097

Answers

1

You might use the WebCapture feature of Acrobat Pro, with appropriate settings (stay on same server, stay within branch, etc.).

Max Wyss

Posted 2015-01-20T20:29:43.260

Reputation: 1 481

It sort of works but the site is apparently too large to capture and there's no way to do it in multiple sets. I get an "Out of Memory" error after around 300 pages. The way the site is set up you can't just choose a section and capture that. It's all XML and doesn't seem to have a simple (tree?) structure. If I can capture all the links inspect element, I can then download all the htmls. If I put local links to those into a single html, could I then run Acrobat to capture the entire local "site"? – crabab11 – 2015-01-21T18:53:13.920