sed: extracting value of a key-value pair in a URL query string

5

2

I am trying to use sed to extract the value part of one of the many key-value pairs in a URL's query string

This is what I am trying:

echo 'http://www.youtube.com/watch?v=abc&g=xyz' | sed 's@^https?://(www.)?youtube.com/(watch\\?)?.*?v(=|/)([a-zA-Z0-9\-_]*)(&.*)?$@$4@'

but it always outputs the input URL as is.

What am I doing wrong?

Update 1

To clarify some issues:

  1. The regex is more complicated than it has to be because I am also trying to check the validity of the input and generate the output only if the input is valid. So a stricter match.
  2. The desired output is the value of the key 'v' in the query string.
  3. Have been unable to find the version of sed that I am using, but it's the one that comes with Mac OS X (10.7.5).
  4. In my version of sed $1, $2 etc. seem to be the matches, \1, \2 etc. give the error: sed: 1: "s@^https?://(www.)?yout ...": \4 not defined in the RE Not correct! as I found out later. Apologies for causing the confusion.

Update 2

Have updated the sed RE to make it more specific based on suggestion by @slhck below, but the issue remains as before.

Update 3

Based on the man page for this version of sed it appears that this is a BSD-flavoured version.

markvgti

Posted 2013-06-04T12:48:10.263

Reputation: 503

1what is your desired output? – Endoro – 2013-06-04T13:02:13.690

1@Endoro I can only guess since the regex here is more complicated than it has to be, but I'd say since the OP wants the fourth capture group, they want everything between v= and the next &, so the video ID. – slhck – 2013-06-04T13:07:51.350

@Endoro: I've answered both points in the update to the original question above. Thanks for seeking the clarification. – markvgti – 2013-06-05T03:42:32.937

@markvgti -- If you continue to make questions please clarify soever, what OS and other programs you use. – Endoro – 2013-06-05T04:20:55.847

Answers

12

Even simpler, if you just want the abc:

 echo 'http://www.youtube.com/watch?v=abc&g=xyz' | awk -F'[=&]' '{print $2}'

If you want the xyz :

echo 'http://www.youtube.com/watch?v=abc&g=xyz' | awk -F'[=&]' '{print $4}'

EXPLANATION:

  • awk : is a scripting language that automatically processes input files line by line, splitting each line into fields. So, when you process a file with awk, for each line, the first field is $1, the second $2 etc up to $N. By default awk uses blanks as the field separator.

  • -F'[=&]' : -F is used to change the field delimiter from spaces to something else. In this case, I am giving it a class of characters. Square brackets ([ ]) are used by many languages to denote groups of characters. So, specifically, -F'[=&]' means that awk should use both & and = as field delimiters.

  • Therefore, given the input string from your question, using & and = as delimiters, awk will read the following fields:

    http://www.youtube.com/watch?v=abc&g=xyz
    |----------- $1 -------------| --- - ---      
                                    |  |  |
                                    |  |  ̣----- $4
                                    |  -------- $3
                                    ----------- $2
    

    So, all you need to do is print whichever one you want {print $4}.


You said you also want to check that the string is a valid youtube URL, you can't do that with sed since if it does not match the regex you give it, it will simply print the entire line. You can use a tool like Perl to only print if the regex matches:

echo 'http://www.youtube.com/watch?v=abc&g=xyz' | 
  perl -ne 's/http.*www.youtube.com\/watch\?v=(.+?)&.+/$1/ && print'

Finally, to simply print abc you can use the standard UNIX tool cut:

echo 'http://www.youtube.com/watch?v=abc&g=xyz' | 
  cut -d '=' -f 2 | cut -d '&' -f 1

terdon

Posted 2013-06-04T12:48:10.263

Reputation: 45 216

Cool, this works! Phew! On the downside this means I'll have to learn to use awk now :-(... thanks, but I am going to hold off on marking this as the accepted answer, in the hopes that with help I can debug sed. Many, many thanks though. – markvgti – 2013-06-05T03:51:14.390

2@markvgti Thing is sed is not the best tool for the job when it comes to capturing patterns, it is extremely powerful and fast and it can do it but it's more complicated than necessary. I added an explanation of how the awk command works, you might find it easier to understand now. I also added a Perl and a cut solution just for the sake of completeness :). – terdon – 2013-06-05T13:24:34.337

I have marked your answer as the accepted one, even though it wasn't a direct answer to my question. Your answer explained so much and provided so much knowledge that I thought it worthy of highlighting. Thanks! Using sed was essentially a case of "If all you have is a hammer, every problem looks like a nail" -- was reluctant to learn One More New Thing but awk seems fairly easy and so perhaps worth the time investment. – markvgti – 2013-06-06T12:54:41.390

@markvgti in the *nix world, you never just have a hammer, all *nix systems will have sed, awk and perl installed. – terdon – 2013-06-06T13:00:12.947

2

if you need "xyz" try this (GNU sed):

echo 'http://www.youtube.com/watch?v=abc&g=xyz' | sed 's/.*=\([[:alnum:]]*\).*/\1/'

Endoro

Posted 2013-06-04T12:48:10.263

Reputation: 2 036

Beware that the \w is not POSIX compatible, so this command isn't portable. – slhck – 2013-06-04T13:26:13.377

You are right, I changed it to [[:alnum:]], thanks! – Endoro – 2013-06-04T13:35:47.193

@Endoro Your answer put me on the right path (thanks!) and I was able to come up with the desired sed command (though I needed something more specific than [[:alnum:]]) and will be adding it as an answer. – markvgti – 2013-06-05T04:08:39.950

2

Experimenting with sed based off the answers given by @Endoro and @slhck led me to the final answer (the one I wanted). This is what works for me with the version of sed on Mac OS X (10.7.5):

echo 'http://www.youtube.com/watch?v=dnCkNz_xrpg' | sed -E 's@https?://(www\.)?youtube.com/(watch\?).*v=([-_a-zA-Z0-9]*).*@\3@'

Explanation:

  1. -E is to make sed use extended RE. In other versions of sed -r may be the equivalent option.
  2. The seemingly more-complicated-than-it-needs-to-be RE is to also verify that this is a valid YouTube link. Modify the beginning parts of this RE as required (e.g., https?://(www\.)?example.com/(.*\?).*key=([^&]*).*)
  3. The \3 matches the 3rd expression in parentheses and prints it out as the answer/match (which is what I want).
  4. Using 's@@@' instead of the usual 's///' so that I don't have to escape the many forward slashes (\) in a URL.

Hope this helps others too as I have been helped.

markvgti

Posted 2013-06-04T12:48:10.263

Reputation: 503

As I mentioned in my answer, -E is the BSD sed option for extended regular expressions, so I take it you're on OS X? -r is used for GNU sed which is standard on Linux. – slhck – 2013-06-05T06:04:54.717

@slhck Yes, as I mentioned in Update 1 to my question, I am on OS X (10.7.5). – markvgti – 2013-06-05T06:33:56.420

Didn't see your edit. Glad you got it figured out! – slhck – 2013-06-05T06:49:22.950

1

If you really just want the video ID – so, anything between v= and the next & – just use:

sed -r 's/.*v=([[:alnum:]]*).*/\1/'

Here's what's wrong with your command:

  • The -r is needed to use extended regular expressions. If you leave that out, sed interprets the parentheses literally, so there won't be any match groups. With BSD sed, use the -E option instead.

  • You use $1 to refer to matches, but you should use \1. $1 is actually a shell argument passed to the current script, for example.

  • You should use a character class like [[:alnum:]] (or [a-zA-Z0-9_] depending on how the IDs are set up) to match the parameter value, since otherwise the next & will be captured as well. The regex is greedy and will just match abc&g=xyz if you use .*?, since lazy quantification is not supported in BRE/ERE, and only in Perl regex or other "modern" flavors.

slhck

Posted 2013-06-04T12:48:10.263

Reputation: 182 472

As in the update above, 1) haven't been able to find sed version, and 2) \1, \2 etc. throws errors. This version of sed says that -r is an illegal argument. – markvgti – 2013-06-05T03:45:16.727

even when I changed the (.*?) of the desired match to ([a-zA-Z0-9\-_]*) (since this value is Base64 encoded), it still doesn't work. Good suggestion on making the match more specific though. – markvgti – 2013-06-05T03:48:33.787

0

It always display the URL because SED is not matching it.

    echo 'http://www.youtube.com/watch?v=abc&g=xyz' | sed 's!^http://www.youtube.com/watch\?\(.*=.*\)&\(.*=.*\)!\1!'

Will display v=abc

PraveenMak

Posted 2013-06-04T12:48:10.263

Reputation: 1