Extracting Text from HTML Files

0

In Windows, how would I go about parsing a folder full of HTML files and extracting all the strings between a particular tag pair?

Ideally, this would all go into a CSV file, with one field for the filename, a second field for each string (say, everything within an H2 tag), and one or more records from each file.

boobounder

Posted 2019-03-07T15:26:26.463

Reputation: 11

HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. Parsing HTML with regex is a hard job

– Toto – 2019-03-07T15:43:45.803

I am better at the regex, it is the looping through the folder that I don't know. So when I look at the link, I wonder what those code blocks run in. It took me a while to see that it is PERL, which I don't have on my PC, and which I haven't coded in since 2003. That doesn't mean it can't be done, but I was thinking something more like PowerShell or some other Windows tool (I am in a university, and not in the CS department, so the IT people frown on us having language environments installed). – boobounder – 2019-03-07T17:26:10.273

No answers