trying to extract two tables at a time from http files with 40 tables each

0

I have about 20 webpages. Each page has top banner navigation and then has information on up to 20 vehicles. There are 2 tables per vehicle.
The logical flow is: page navigation, table 1 for vehicle 1, table 2 for vehicle 1, table 1 for vehicle 2, table 2, for vehicle 2, ... end of page.
Example of tables included below.

I want to get the information out of the html pages and into a database.
The plan: separate out the data for each individual vehicle into individual files & then parse/extract the data from the files.

I do not understand awk so I am using sed.

Extraction plan: find the line w/ "car_photo", go back 4 lines (which will be the table tag), extract from that line until the second /table tag. Repeat until the final set of tables.

I've looked online for examples of how to get sed to extract from a given line number until the next instance of a regular expression... it keeps extracting until the final instance. & even if it did work, I want it to extract up to the 2nd instance.

Here is a sample of a file, with the data replaced w/more generic info.


32321 Make: Model:
Year:
VIN:
Color:
Year Acquired:
Mileage: Last Oil Change: Insurance Due: Registration Expires:
32322 Make: Model:
Year:
VIN:
Color:
Year Acquired:
Mileage: Last Oil Change: Insurance Due: Registration Expires:
32321 Make: Model:
Year:
VIN:
Color:
Year Acquired:
Mileage: Last Oil Change: Insurance Due: Registration Expires:
32323 Make: Model:
Year:
VIN:
Color:
Year Acquired:
Mileage: Last Oil Change: Insurance Due: Registration Expires:
32324 Make: Model:
Year:
VIN:
Color:
Year Acquired:
Mileage: Last Oil Change: Insurance Due: Registration Expires:
32325 Make: Model:
Year:
VIN:
Color:
Year Acquired:
Mileage: Last Oil Change: Insurance Due: Registration Expires:


I tried to create a loop that would run 20 times. Each time, sed would extract lines 1 through the line with </table> then sed runs again to delete those lines. It then extracts lines 1 through the next line with </table> again (to get the 2nd table) sed then deletes the second table.

Each time `sed` extracts a table, it concatenates to a new file using the loop counter.

The problem is that sed is not stopping at the first occurrence of </table>. It is stopping at the LAST occurrence.

Mike

Posted 2013-07-20T09:43:38.167

Reputation: 11

We will need an example of your input file and an example of your desired output. Do the tables have any IDs or NAMEs? What lines is sed deleting? Do you want to end up with N files each of which contains two tables and nothing else? Do you want to have real html files named after each of the cars? If you don't show us your data we cannot help you. – terdon – 2013-07-20T14:07:22.023

I'm including a sample of one set of data. There are basically two tables per vehicle. 20 vehicles per page. I want to first separate each vehicle data into a separate HTML. I will then parse the data fields and put them into a database. – Mike – 2013-07-28T01:00:48.790

<table>

<tr><td width="90">

<div class="car_photo"> <div class="space"> <img src="../photos/veh5.jpeg">

</div> </div> </div> </td> <td align="right" class="car_details" width="400"> <table> <tr> <td class="line_bottom" width="190">

<div class="text_left">32325</a> </div>

</td> </tr> </table> <div class="line_bottom"> Make: </div> <div class="line_bottom">Model: <br>Year: <br /> </div> </td> <td class="car_details" width="400">

<div class="line_bottom">Mileage:</div>

<div class="line_bottom">Oil Change: </div>

<div class="line_bottom">Registration:</div> <br> </td> </tr> </table> – Mike – 2013-07-28T01:09:13.450

the page I am extracting from contains other header tables, so I can't just say "every two tables". I want sed to find the div class "car_photo" tag, backup to 3 lines to the table tag, and extract down to the end of the second table. Export that set of 2 tables, and repeat until the end of the file. – Mike – 2013-07-28T01:25:14.847

Please don't add information in the comments, it is hard to read and easy to miss. [Edit] it into your question instead. To clarify, you want to match the "car_photo" tag (you might have mentioned that in your question) then "three lines". What do you mean by three lines? Are these lines in the code or in the rendered html? Do you mean table rows? Do you want to match the entire table you posted in your last comment? Including the nested tables? Please post the table in your question, we will need to see what is on the same lines as what in your file since both sed and awk parse by lines. – terdon – 2013-07-28T15:23:00.913

Answers

0

If I were doing this often, I would be using XPath parsing via something like the Nokogiri gem for Ruby.

However, here's something that could work, but without a bash script to combine them will require a couple of steps per file ( I guess that's 20 in your case ).

Step 1: Convert the html into line by line as much as possible so that awk can process it.

Starting with your comment's html input into car.html, I did

cat car.html | awk -F"> " '{ for( i = 1; i <= NF; i++ ) printf( "%s>\n", $i ) } ' > new.html

which gave me a new.html file like

<table>
<tr><td width="90">
<div class="car_photo">
<div class="space">
<img src="../photos/veh5.jpeg">
</div>
</div>
</div>
</td>
<td align="right" class="car_details" width="400">
<table>
<tr>
<td class="line_bottom" width="190">
<div class="text_left">32325</a>
</div>
</td>
</tr>
</table>
<div class="line_bottom">
Make: </div>
<div class="line_bottom">Model: <br>Year: <br />
</div>
</td>
<td class="car_details" width="400">
<div class="line_bottom">Mileage:</div>
<div class="line_bottom">Oil Change: </div>
<div class="line_bottom">Registration:</div>
<br>
</td>
</tr>
</table>

Step 2: Take that file and put it through an awk script I put into it's own file called awko

#!/usr/bin/awk -f

BEGIN { FS=">" }

$1 ~ /<table/ { table_cnt++ }

$1 ~ /<\/table/ { table_cnt-- }

table_cnt > 0 {
    for( i = 1; i <= NF; i++ ) {
        split( $i, arr, "<" )
        if( length( arr[ i ] ) > 0 )
            printf( "%s\n", arr[ 1 ] )
    }

}

running this like

awko new.html 

gave me a result like:

32325
Make: 
Model: 
Mileage:
Oil Change: 
Registration:

The output in the awko could be modified to make a CSV styled output instead to make it easier to import into a DB. And again, these different steps could be combined in a shell script to to the "heavy filename lifting" in a proper loop, but I don't have time for that now.

awko is essentially searching for what is the text of each line with the start/end you specified.

Oops. I just noticed this question is old. Oh well, committing this answer anyway.

n0741337

Posted 2013-07-20T09:43:38.167

Reputation: 101