Since the HTML was assembled from heterogeneous sources, chances are that you're not interested in any scripts it may contain. So your approach looks viable to me.
I believe that the most practical and maintainable approach would be to make a list of allowed HTML tags and attributes (you are likely to not want, more than simply not need, CSS code; so id
, style
and class
attributes may be discarded altogether). Tags like IFRAME
and attributes such as onfocus
would very likely not be included in the list at all.
Once you have this, you can run your HTML through a parser that will drop any unknown tags and attributes. You may want to do this in two stages, a "tagsoup" cleaner (HTML Tidy or TagSoup or HTMLClean or BeautifulSoup - I'd try HTML Tidy first) and a full-fledged HTML parser.
This way you'd end up with a well-formed and defanged HTML code which can be then manipulated much better and in more safety.
I'd avoid regexp matching unless I really had no other alternative.