Bash: return all the characters between the nth occurence of two different strings within a string

1

In a bash script (on Ubuntu 14.04) I'm running the command:

WP055="$(wget -qO - http://alerts.weather.gov/cap/wwaatmget.php?x=CAZ055&y=1)"

Within the WP055 variable string there will be an unknown number of '<title>' and '</title>' pairs. I need to search within each of these pairs for the string 'by NWS' which means this particular string contains the start and end time of the particular weather advisory. This found string (all the characters between the opening and closing title tags) is what I'm looking to capture into another variable so that I can drop it into an index.html file the script is building.

I was planning on looping through the WP055 variable x number of times analyzing the text within each pair of tags until I find the correct one.

I can't search WP055 for 'by NWS' because there may be more than one occurrence within WP055 (multiple advisories within the WP055 string).

(The above wget command will definitely have a 'by NWS' string within the 2nd title pair until March 07 at 3:00AM PST, when the current wind advisory will be cancelled.)

wdavro

Posted 2016-03-06T07:15:47.120

Reputation: 13

Wow. Thanks a lot @G-Man. I've been struggling with this and string indexes for two full weekends (and failing). Your solution is so much cleaner. I'll work this into my program late tonight and next weekend. Thanks. – wdavro – 2016-03-06T21:51:40.943

You're welcome.  Just so you know, the system notified me that you accepted my answer, but it didn't alert me to your comment (above), even though you said "@G-Man".  You can "ping" a person that way only if you comment on a post that he wrote, or under one of his comments.  So, if you want to say something to somebody who answered your question, you should comment on the answer. – G-Man Says 'Reinstate Monica' – 2016-03-06T22:01:12.520

Answers

0

A little unpolished, but it seems to work:

WP055="$(wget -qO - http://alerts.weather.gov/cap/wwaatmget.php?x=CAZ055&y=1)"
remainder=${WP055#*<title>}
if [ "$WP055" = "$remainder" ]
then
        echo "No title found"
        exit
fi
while true
do
        this_title=${remainder%%</title>*}
        if [ "$remainder" = "$this_title" ]
        then
                echo "</title> not found"
                exit
        fi
        if [[ "$this_title" == *"by NWS"* ]]
        then
                echo "$this_title contains \"by NWS\""
                # You probably want to do something here, like return.
        fi
        new_remainder=${remainder#*<title>}
        if [ "$new_remainder" = "$remainder" ]
        then
                echo "No more titles"
                exit
        fi
        remainder=$new_remainder
done

remainder=${WP055#*<title>} is a form of parameter expansion that removes a matching prefix pattern.  Here, it sets remainder to

  • The first title in the string (excluding the introductory <title>),
  • the trailing </title>, and
  • all the rest of the string after that (including all the subsequent titles).

If "$WP055" = "$remainder", that means that the shell didn't find <title> in the string.

this_title=${remainder%%</title>*} similarly sets this_title to be $remainder up to but not including the first </title>.

if [[ something1 == something2 ]], with the double brackets ([[ … ]]) and double equal sign (==), does a pattern match.  Everything else is repetition.

This might behave oddly on malformed input; i.e., text where <title> and </title> do not occur in alternating pairs.

G-Man Says 'Reinstate Monica'

Posted 2016-03-06T07:15:47.120

Reputation: 6 509