Sed extract data from html table

0

With sed I can extract data from an HTML file? For example like this:

<html>
...
<table>
<tr>
 <td>R1A</td><td>R1B</td>
 <td>R1C</td><td>R1D</td>
</tr>
<tr>
 <td>R2X</td><td>R2Y</td>
 <td>R2W</td><td>R2Z</td>
</tr>
</table>
....
</html>

Extract this output:

R1A R1B R1C R1D
R2X R2Y R2W R2Z

In my text editor I use the following regular expression:

/<tr>.*?<td>(.*?)</td>.*?<td>(.*?)</td>.*?<td>(.*?)</td>.*?<td>(.*?)</td>.*?</tr>/s

Sebtm

Posted 2010-03-07T14:38:15.027

Reputation: 393

3This is not a do my work for me site. I could solve this easily but you haven't even bothered to make a polite question out of it. – Nifle – 2010-03-07T14:46:08.583

1

This way madness lies: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

– Paused until further notice. – 2010-03-07T15:00:30.937

Politeness helps – fpmurphy – 2010-03-07T15:10:08.213

@Nifle: I understand. I was too direct. Sorry. – Sebtm – 2010-03-07T15:45:11.250

No need to all pile onto a new user. Politeness from old users helps, too. – JRobert – 2010-03-08T00:10:20.087

Answers

1

Not a sed solution but an XSLT one

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:output method="text" />

  <xsl:template match="//table/tr">
     <xsl:value-of select="descendant::td[1]"/>
     <xsl:text>  </xsl:text>
     <xsl:value-of select="descendant::td[2]"/>
     <xsl:text>  </xsl:text>
     <xsl:value-of select="descendant::td[3]"/>
     <xsl:text>  </xsl:text>
     <xsl:value-of select="descendant::td[4]"/>
  </xsl:template>

</xsl:stylesheet>

fpmurphy

Posted 2010-03-07T14:38:15.027

Reputation: 1 260

But if HTML is malformed, this solution works? – Sebtm – 2010-03-07T15:48:40.150