Shortest URL regex match in JavaScript

16

8

Create the shortest regular expression that will roughly match a URL in text when run in JavaScript

Example:

"some text exampley.com".match(/your regular expression goes here/);

The regular expression needs to

  • capture all valid URLS that are for http and https.
  • not worry about not matching for URL looking strings that aren't actually valid URLS like super.awesome/cool
  • be valid when run as a JavaScript regex

Test criteria:

Match:

Not Match:

  • example
  • super/cool
  • Good Morning
  • i:can
  • hello.

Here is a test that might help clarify a bit http://jsfiddle.net/MikeGrace/gsJyr/

I apologize for the lack of clarity, I hadn't realized how awful matching URLs was.

Mike Grace

Posted 2011-02-04T04:15:17.477

Reputation: 263

Ahgrrrr! I miss my edit privileges! I you're going to restrict the game to one language perhaps you should tag it with that language. – dmckee --- ex-moderator kitten – 2011-02-04T05:25:12.763

What constitute a valid URL character? because I can simply use \w for everything Do you expect backreferences for different URL components? – Ming-Tang – 2011-02-04T05:49:13.013

1

"A URI is a sequence of characters from a very limited set, i.e. the letters of the basic Latin alphabet, digits, and a few special characters," according to RFC 2396.

– RunnerRick – 2011-02-04T06:18:17.207

Mike: I guess there is still some clarification in order. As it stands now I can just use /:/ as the regular expression and match valid URIs and not match all your examples on the »Not match« list. As long as you're going that route it's simply the question: What is the shortest regular expression that will not match any of the example strings but still catch all URIs. – Joey – 2011-02-04T09:10:44.620

I think this questions seems to be a "give me teh codez" question. – None – 2011-02-04T17:32:07.670

@M28 the lack of clarity may seem that way but I did learn a lot from it and I'm still working on my own answer. If you think it should be deleted we can do that if it is better for the community. – Mike Grace – 2011-02-04T18:15:03.510

1Just try to write a longer challenge with more details. – None – 2011-02-04T22:06:58.420

Answers

1

/.+\.\w\w.*/

doesn't match 3 strings that it shouldn't, matches almost anything else ;)
upd: it still doesn't match all 5

www0z0k

Posted 2011-02-04T04:15:17.477

Reputation: 200

14

This one works:

var re = /(^|\s)((https?:\/\/)?[\w-]+(\.[\w-]+)+\.?(:\d+)?(\/\S*)?)/gi;

/*
(^|\s)                            : ensure that we are not matching an url 
                                    embeded in an other string
(https?:\/\/)?                    : the http or https schemes (optional)
[\w-]+(\.[\w-]+)+\.?              : domain name with at least two components;
                                    allows a trailing dot
(:\d+)?                           : the port (optional)
(\/\S*)?                          : the path (optional)
*/

Passes the tests at http://jsfiddle.net/9BYdp/1/

Also matches:

  • example.com. (trailing dot)
  • example.com:8080 (port)

Arnaud Le Blanc

Posted 2011-02-04T04:15:17.477

Reputation: 2 286

works for me. ty :) – STEEL – 2016-02-12T11:20:16.583

This allows spaces – brenjt – 2013-11-06T17:52:48.110

Works nice, but not for domains with user/password parts e.g. http://user:password@domain.com/path – Radon8472 – 2018-08-14T12:55:44.673

Sweetness!!!!!!! – Mike Grace – 2011-02-04T09:33:37.877

2Wouldn't you want to match a hostname with only one component as well (e.g. localhost)? – RunnerRick – 2011-02-04T17:20:25.933

5

This obviously doesn't do what you intend, but it meets your criteria:

 /.*/
  • "match all valid URLS that are for http and https."

    yep, definately will match.

  • "not worry about not matching for URL looking strings that aren't actually valid URLS like 'super.awesome/cool'"

    yeah, sure, there will be lots of false positives, but you said that doesn't matter.

  • be valid when run as a JavaScript regex

    sure as eggs works as you say it should.

If this result is NOT a right answer, then you need to be more selective with your criteria.

In order to be a rule that works as you intend, you actually do need to implement a full RFC compliant matcher, and a full RFC compliant matcher will "worry about not matching".

So, in terms of "permit not matching", you need to specify exactly which deviations from RFC are permissible.

Anything else, and this whole exercise is a sham, because people will just write whatever works for them, or how they like it, and sacrifice "making any sense" in favour of being short ( like I did ).

On your update

The most Naïve regex I can come up with that matches (and captures) all your pasted examples so far is:

/(\S+\.[^/\s]+(\/\S+|\/|))/g;

Its quite simple in nature, and assumes only 3 basic forms are possible.

x.y
x.y/
x.y/z 

z can be anthing not whitespace. x can be anything not whitespace. y can be anything that is neither whitespace or a '/' character.

There are a lot of things that will be valid to this rule, lots, but they'll at least look like a valid URI to a human, they just won't be specifications compatible.

eg:

hello.0/1  # valid 
1.2/1 # valid 
muffins://¥.µ/€  # probably valid

I think the sane approach is to extract things that are likely to be URI's, then validate them with something stricter, I'm looking at working out how to use the browsers URI class to validate them =).

But you can see the above reasoning working on this sample here: http://jsfiddle.net/mHbXx/

Kent Fredric

Posted 2011-02-04T04:15:17.477

Reputation: 181

Thanks Mike =). I don't wish to compete myself in a more serious manner, the other suggestions are more useful, I just wished to point out the problem with the initial premise so that the question quality could improve =) – Kent Fredric – 2011-02-06T05:02:08.727

Is it only me or is this matching "www .google .com"? – Schiavini – 2012-08-23T14:36:14.433

He changed the question, but you can do better anyway with /:/ even after the edit :-) – Joey – 2011-02-04T09:11:07.053

1

/https?\:\/\/\w+((\:\d+)?\/\S*)?/

Try that.

I'm including the leading and trailing slashes that delimit the regular expression, so hopefully that doesn't hurt my character count!

This pattern limits the protocol to either http or https, allows for an optional port number, and then allows any character except whitespace.

RunnerRick

Posted 2011-02-04T04:15:17.477

Reputation: 111