How to extract text from websites

I am looking for a way to automate extracting text from multiple websites to a word document. When pasted into the word doc I need it to paste using "merge formatting". This is how I would like the extracted text to end up after it gets pasted to the word doc - Example

I need the text extractor to only extract text starting from the chapter title "Book 1, Chapter 1 – Early Morning at a Township" to the end of the chapter where it says "previous chapter" but not including "previous chapter". I then need it to move to the next chapter until it reaches the end of Book 21 Chapter 44.

Coiling Dragon: Book 1 - Chapter 1 going until Coiling Dragon: Book 21 - Chapter 44

Woz

Posted 2016-01-14T03:14:59.230

Reputation: 303

websites ARE text (markup) so you will be looking for a parser. for a simple HTML only site, a generic XML parser could be used to extract element bodies, but you would have to write script code to handle the text accumulation, filtering, copying and pasting. for more complicated sites that generate dynamic content serverside or in client scripts, you may find it a little more difficult. Good luck. – Frank Thomas – 2016-01-14T04:37:36.260

What code have you got so far? – Burgi – 2016-01-14T09:55:41.410

I currently do not have any code written for this yet, to be honest I am not even sure how to write code for scripts. Does anyone have a suggestion for an XML parser? – Woz – 2016-01-22T10:13:28.893

@Wozzie honestly you can do this within word pretty easily using vba. – BigElittles – 2016-07-08T17:07:00.870

How to extract text from websites

No answers