Question

What tool (preferably for Linux) can select the content of an HTML element based on its CSS path?

Example

For example, consider the following HTML document:

<html>
<body>
  <div class="header">
  <h1>Header</h1>
  </div>
  <div class="content">
    <table>
      <tbody>
      <tr><td class="data">Tabular Content 1</td></tr>
      <tr><td class="data">Tabular Content 2</td></tr>
      </tbody>
    </table>
  </div>
  <div class="footer">
  <p>Footer</p>
  </div>
</body>
</html>

What command-line program (e.g., a kind of "cssgrep") can extract values using a CSS selector? That is:

cssgrep page.html "body > div.content > table > tbody > tr > td.data"

The program would write the following to standard output:

Tabular Content 1
Tabular Content 2

Related Links

Thank you!

Dave Jarvis

Posted 2013-01-06T00:07:36.943

Reputation: 2 126

Answers

Use the W3C tools for HTML/XML parsing and extraction of content using CSS selectors. For example:

hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "td.data"

Will produce the desired output:

Tabular Content 1
Tabular Content 2

Using a line length of 240 characters ensures that elements with long content will not be split across multiple lines. The hxnormalize -x command creates a well-formed XML document, which can be used by hxselect.

Dave Jarvis

Posted 2013-01-06T00:07:36.943

Reputation: 2 126

2For macOS users, brew install html-xml-utils. – anishpatel – 2018-05-05T20:11:26.220

CSS Solution

The Element Finder command will partially accomplish this task:

For example:

elfinder -j -s td.data -x "html"

This renders the result in JSON format, which can be extracted.

XML Solution

The XML::Twig module ("sudo apt-get install xml-twig-tools") comes with a tool named xml_grep that is able to do just that, provided that your HTML is well-formed, of course.

I'm sorry I'm not able to test this at the moment, but something like this should work:

xml_grep -t '*/div[@class="content"]/table/tbody/tr/td[@class="data"]' file.html

ZeroOne

Posted 2013-01-06T00:07:36.943

Reputation: 171

https://github.com/ericchiang/pup has a CSS-based query language that conforms closely to your example. In fact, with your input, the following command:

pup "body > div.content > table > tbody > tr > td.data text{}"

produces:

Tabular Content 1
Tabular Content 2

The trailing text{} removes the HTML tags.

One nice feature is that the full path need not be given, so that again with your example:

$ pup 'td.data text{}' < input.html
Tabular Content 1
Tabular Content 2

One advantage of pup is that it uses the golang.org/x/net/html package for parsing HTML5.

peak

Posted 2013-01-06T00:07:36.943

Reputation: 111

Node can do that with JQuery and a fake DOM.

I made a Docker image for that (https://hub.docker.com/r/phil294/jquery-jsdom/):

docker run --rm -i phil294/jquery-jsdom '$("body > div.content > table > tbody > tr > td.data").text()' < page.html

Second argument is JavaScript code, so you can do anything you want, really.

phil294

Posted 2013-01-06T00:07:36.943

Reputation: 123

Command-line CSS selector tool

Question

Example

Related Links

Answers

CSS Solution

XML Solution