How to extract terms from an HTML document

2

I have a HTML document filled with terms that I need to put into a spreadsheet.

They follow this basic pattern:

<ul>
     <li class="name"><a href="spot.html">Spot</a></li>
     <li class="type">Dog</li>
     <li class="color">Red</li>
</ul>
<ul>
     <li class="name"><a href="mittens.html">Mittens</a></li>
     <li class="type">Cat</li>
     <li class="color">Brown</li>
</ul>
<ul>
     <li class="name"><a href="squakers.html">Squakers</a></li>
     <li class="type">Little Parrot</li>
     <li class="color">Rainbow</li>
</ul>

It's very consistent.

I need to extract the string within the li.name a (so, "Spot") but only if the type is "Dog" or "Parrot", and put them in a spreadsheet.

I've been trying to use Sublime Text's ability to Find with regex, but I'm really struggling, and since regex and HTML usually don't play nice, I was wondering if there is a better and easier way to accomplish this. Thanks.

bookcasey

Posted 2012-06-21T14:28:18.557

Reputation: 207

Answers

4

Here's a JavaScript implementation that actually uses the DOM, checks the type class and writes the name class if the type class contains the appropriate word. If more types are necessary, just add them to the searchfor variable with a pipe (|) separating them.

var searchfor = /Dog|Parrot/gi;

var win = window.open();

var lists = document.body.getElementsByTagName("ul");

for (i in lists) {
    var points = lists[i].getElementsByTagName("li");

    for (j in points) {
        if ((" " + points[j].className + " ").indexOf(" " + "type" + " ") > -1) {
            if (points[j].innerHTML.match(searchfor) != null) {
                for (k in points) {
                    if ((" " + points[k].className + " ").indexOf(" " + "name" + " ") > -1) {
                        win.document.writeln(points[k].innerHTML + "<br />");

                        break;
                    }
                }
            }
        }
    }
}

Tested on jsFiddle: http://jsfiddle.net/wdR5Y/

The easiest way to use it is to convert it to a bookmarklet with something like this: http://userjs.up.seesaa.net/js/bookmarklet.html

As JavaScript, it's OS independent and supported by most popular web browsers.

To import to a spreadsheet depends on your spreadsheet application, but often copy and paste is enough (a new window is opened with the output).


If it were ID, not class, this would have been a fair bit easier... ah well. Credit to a Stack Overflow answer for getting the element by class name.

Bob

Posted 2012-06-21T14:28:18.557

Reputation: 51 526

Thanks, Bob! I can't get the bookmarklet to work, but the idea still applies! – bookcasey – 2012-06-21T17:43:04.843

@bookcasey It works for me™ with Firefox, Chrome or Opera with your sample HTML. Just add the bookmarklet as a bookmark, and use it on the page you want to parse. If your sample HTML doesn't match the real one, then I can't guarantee anything (perhaps you can modify it yourself?). – Bob – 2012-06-22T03:07:18.180

7

Don't use Regex to parse XML or HTML, use an XML or HTML parser.

Another approach is to convert XML or HTML to text then use grep

See Application for extracting XML tags from a document
See Is there a native tool for parsing xml files available on RedHat?
See Scripting: what is the easiest to extract a value in a tag of a XML file?

RedGrittyBrick

Posted 2012-06-21T14:28:18.557

Reputation: 70 632