Stripping and reformatting specific HTML tags from content

0

I'm currently working on a study involving Stack Exchange content and trying to find an efficient way to bring the content into my CAQDAS. The CAQDAS I'm using is DeDoose.

My issue is that the support for HTML is fairly poor, it doesn't keep the formatting so it's a mess. I've tried various options like Printing as a PDF (but the PDF support is pretty bad too). So I'm looking to turn it into rich text.

I have thousands of pages to treat. I checked out a Chrome Extension called ReaderView. It does a pretty good job but removes some stuff I need like formatting and user information. Currently, the best option I can think of with my skills is copy-pasting each page's HTML in InDesign and creating a script using GREP to get rid of the stuff I don't need.

My dream app would be an HTML editor where I can just uncheck all the HTML/CSS stuff I don't want and export as RTF or something along those lines...

Do any of you know of a better way to do this?

This looks like a good lead but I'm not Python savvy at all...

curious

Posted 2017-07-28T22:36:50.467

Reputation: 161

Answers

1

Well I ended up solving my answer, though this question did earn me the tumbleweed badge...not sure if that's a good thing!

I managed to get what I wanted by using the Selenium API, extracting and parsing different selectors to HTML/text, depending on what I needed.

curious

Posted 2017-07-28T22:36:50.467

Reputation: 161