Copying text from YouTube to Clipboard introduces dashes?

3

1

Here's an example of a link I found on YouTube in the comments section of a video.

gnu.org/distros/free-distros.h­tml

This is the way it shows up in the comment.

If I highlight this link and copy to clipboard (ctrl+c), then go to a new browser tab and paste it (ctrl+v) in the address bar, then this is how it shows up.

gnu.org/distros/free-distros.h­tml

It looks the same, right? But if I hit Enter I get an error.

404 - Page Not Found

The page you were looking for could not be found on the GNU web server.

If you followed a link that turned out to be broken, and the page with the broken link mentions an explicit address to which to report bugs, please use that address.

The URL also changes to the following.

http://www.gnu.org/distros/free-distros.h%C2%ADtml%EF%BB%BF

If I remove %C2%ADtml%EF%BB%BF and type in tml so that I get back the address http://www.gnu.org/distros/free-distros.html and then hit Enter, well now it works, and the page loads.

I thought to myself that this is very strange so I tried pasting the same text from clipboard to a plain text editor (notepad) and this is what I got.

gnu.org/distros/free-distros.h­-tml

How was the dash between h and tml introduced? This is why I was getting the 404 error. But the URL appears correctly when pasted to the address bar. Is this some kind of hidden character perhaps?

Also, if I go back to YouTube and highlight the link, I can see that there is a bump on the last three letters. The highlighting is taller around "tml". You can see that in the screen capture below.

screen1

screen2

Why is this happening? What's going on? Could it be that Google is somehow intentionally salting the link?

Update

If I paste into Notepad++ (version 6.3) I get following.

gnu.org/distros/free-distros.h­tml?

If I try to paste into the address bar of the Google Chrome browser, there appears to be some kind of hidden character at the end of the URL. See scree capture below.

screen3

That's not a white space. It's something else... something alien! Something from planet X?

Note: The vertical line at the end is not the one I mean, that's just the text input cursor blinking.

Update 2

Inspecting the html code in Firefox by using the element inspection tool.

screen4

Why is there a square within the opening wbr tag?

Update 3

The "square" appears to be the soft hyphen character entity. Here follows the actual source code of this particular line.

<p>gnu.org/distros/free-distros.h<wbr>&shy;tml</p>

The soft hyphen is the &shy; you see here. HTML tags, such as or i.e. for bold text, are not selectable. When you highlight a text of a web page in a browser, you are not selecting the HTML tags. Nothing within <> is shown.

So it seems that soft hyphen is the root cause of the copy and paste issue. It is not displayed on the web page, but it is selected when you highlight the text.

Update 4

This is what it looks like when I paste the URL into Microsoft Word 2010 and view hidden characters.

screen5

To move the text cursor from .|html to .ht|ml requires pressing the arrow key three times. You can tell by the image above why that is. It's because of this hidden character. With the cursor in front of that strange looking character, pressing Alt+X shows 0068. With the cursor behind that character, and in front of the letter T reveals nothing at all. The 0068 is just the Unicode code page for the letter H.

Samir

Posted 2013-08-07T07:05:21.373

Reputation: 17 919

Is it possible to have a link to this Youtube page ? – Levans – 2013-08-07T07:09:15.300

I am using Firefox 22 on Windows Vista 64-bit SP2. But I just tried pasting into Google Chrome and I still get the 404 error. – Samir – 2013-08-07T07:10:02.643

@Levans It's "Richard Stallman Talks About Ubuntu" by Muktware. – Samir – 2013-08-07T07:11:43.807

@Levans http://youtu.be/CP8CNp-vksc

– Samir – 2013-08-07T07:12:05.433

Lesson learned: soft hyphens are nasty! =) – Samir – 2013-08-07T09:45:40.277

Answers

2

Yes it is a nuisance.

There are two hiphens The normal one \u2D, and the funny one. The funny one is used sometimes within youtube comments. \u00AD and comes up as hidden.

Paste into notepad(to remove formatting) and also, notepad shows it, and then into MS Word(or just in Ms Word do paste special..unformatted unicode), put your cursor to the right of the hiphen, or any character, and press ALT-x and you see the ASCII or unicode code for it.

This may seem strange. Be aware that there are a few characters with two different types. A type you use usually which is within the 0-7F range, and a type people tend to not use much or at all, which is >7F. The two types of spaces(a normal one and another called the non-breaking space, ascii code 160 \uA0 which can be of use). There two types of pipes 7C and A6 The A6 one is just asking for problems as it causes failures on the command line. And two types of hiphens, the second one you see, behaves funny too, as youtube comments sometimes use it and hide it and don't display it as a hiphen.

Another funny character I see which is used by youtube in comments is \uFEFF You can run notepad2(download it), choose file..encoding..UTF-8 then paste the text in, and search for \uFEFF replacing with nothing, (check the box that says transform backslashes).

Similarly you can open notepad2, search for \u00AD (that funny hiphen) and replace it with a regular hiphen. Editpad free might be able to do it, though I use the pro version for its regex support.

I'd note that charmap doesn't copy the funny hiphen correctly. (So if you want to experiment and you choose copy and paste it into a piece of software and it displays funny, blame charmap), but it copies fine(as in with the character) from your link in my browser(chrome). Better if the character wasn't there though, it is a nuisance! But you can see the ascii code of it in Ms Word, and you can search and remove it in notepad2

You see from charmap it(\u00AD) is called the "soft Hiphen" (i'm just glad they didn't hiphenate that title!)

In the pic I used Ms Word and did ALT-x

enter image description here

barlop

Posted 2013-08-07T07:05:21.373

Reputation: 18 677

I look at the source code now and I see <p>gnu.org/distros/free-distros.h<wbr>&shy;tml</p>. So the reason we get this problem is because of the soft hyphen and not so much because of the wbr tag? – Samir – 2013-08-07T08:37:22.633

http://www.ascii.cl/htmlcodes.htm Hex AD, ­ It's the ­ that is the soft hiphen(the weird hiphen!) which is the issue. And ­ is right there in the html you quoted there – barlop – 2013-08-07T08:50:14.080

If you look at the source in chrome for your question, where you included a failing link, there rather than an ­ it literally has the soft hiphen there within the letters html but displays nothing for it e.g. paste it into the URL bar so in edit type mode, and move your cursor through it. And if you move your cursor through it(with the arrow keys) you see there's a funny character between h and t of html. I experimented with these things once, you can fit tons of these characters in there consecutively, which show up in one program but in another occupy no space. – barlop – 2013-08-07T09:04:19.757

You have lost me. Define "edit type mode". I did try pasting into MS Word 2010 and I see that I have to press the arrow key three times to move the text cursor from .|html to .ht|ml. It should be enough to press it two times to move the cursor two steps. This is because there is a hidden character there. – Samir – 2013-08-07T09:19:56.217

Also, when the cursor is in front of the t (.h|tml) the Alt+X doesn't show any ASCII code in MS Word. But I can see it by viewing hidden characters in Word (see screen capture above). – Samir – 2013-08-07T09:30:10.883

ah, to get the ALT-x in ms word on that "hiphen". If you paste directly into ms word it comes up in silly formatting style like the first line in the screenshot. If you paste that into notepad then from notepad back into ms word(thus getting rid of the formatting), then do ALT-x on that hiphen glyph, it shows 0xAD(the soft hiphen). – barlop – 2013-08-07T10:01:23.260

it may be that youtube have changed their comments and don't have any of the really weird characters within [\u007F-\uFFFF] (unless somebody types a foreign language). Certainly the \uFEFF isn't in the comment anymore when it was before, so looks like they've definitely removed that one. – barlop – 2013-12-26T11:35:29.470

2

Looking at the source code of this portion of page, I see this :

<p>gnu.org/distros/free-distros.h<wbr>­tml</p>

It seems that Youtube automatically inserted a <wbr> tag. It's a word-break opportunity, it tells the browser that if needed, the word may be broken to insert a newline.

On UTF-8 encoded pages, this is displayed as a ZERO-WIDTH SPACE, not showing anything, but allowing a newline. That's what causes your encoding issue.

It looks like Youtube has an algorithm to auto-insert <wbr>into long words at good places (not cutting a syllabe in two) , but as the http:// was missing at the beginning of the URL, the algorithm didn't recognize it as such, and thus assumed it was a word that could be broken.

Levans

Posted 2013-08-07T07:05:21.373

Reputation: 2 010

The "dotted square" that you saw in Firefox was not a character and was not in the markup. It was a visual guide as part of your developer tools showing where you could click if you wanted to add attributes to that element. You don't see this on other elements because they already have attributes, and you have specifically selected the <wbr>. Firefox's and Firebug's UI have changed since 2013 so you wouldn't see this today. – AndrewF – 2014-12-30T23:33:06.207

But there is no line break? The dash is not seen on YouTube? Only when copy and pasted? – Samir – 2013-08-07T07:41:00.147

There is no line break because it is not needed to display the content, yet the invisible character is still here. The dash on copy-paste is probably the result of poor encoding translation from UTF-8 to the one used by Windows, while the URL is translated to URL encoding, with poor results as well. And I missed something, <wbr> is not supposed to insert a -. I'll correct. – Levans – 2013-08-07T07:46:25.723

Why is there a dotted square within the opening wbr tag? See the screen capture above. Shouldn't it just say "<wbr>" and nothing else? I would describe the above as "<wbr?>" where the ? marks the position of this strange looking, square like character. That's what I see when I inspect the element inside Firefox. I mean wbr alone should not cause this problem, right? – Samir – 2013-08-07T08:00:57.027

Right, "the <wbr> element does not introduce a hyphen at the line break point." – Samir – 2013-08-07T08:05:33.313

@Sammy Indeed, I looked a little more, and it seems this URL's encoding is quite screwed, and that is probably what caused Youtube to insert a <wbr>. Probably the one who posted it had an encoding issue with his own computer, and omitting the http:// caused Youtube algorithm to act strangely. – Levans – 2013-08-07T08:09:18.670

When I visit YouTube, Firefox uses UTF-8 encoding. While on other websites it might use Windows 1252. It chooses encoding automatically. It doesn't remove this character if I manually change encoding to Windows 1252, it rather introduces new strange characters. – Samir – 2013-08-07T08:18:01.863

@Sammy Your comment there gives me a thought. Using privoxy and writing a search and replace to remove it. – barlop – 2013-08-07T08:24:11.277

@barlop Haha! Yeah, that sounds like fun. – Samir – 2013-08-07T08:30:17.477

There is not an opening and closing wbr tag, there is just <wbr>. So the </wbr> is not necessary. "A wbr element must have a start tag but must not have an end tag." Source: W3C

– Samir – 2013-08-07T08:42:44.793

You have omitted the soft hyphen character (­) in the code you quoted above. This is most likely the cause of the problem, and not the wbr tag. – Samir – 2013-08-07T08:50:46.360

@Sammy I just checked again, my firefox (22.0) doesn't show a &shy; at all in source code. – Levans – 2013-08-07T08:53:19.197

@Sammy I suppose it depends How within the browser you are looking at it. It might show ­ or it might show nothing or as with notepad2 with utf-8 encoding, it might show a hiphen. Or charmap which when asked to copy it, messes it up and displays what look like a few hiphens. So, good to specify the method you view it where you see this or that. – barlop – 2013-08-07T09:02:53.000

How many lines of code do you see? Are you using the built-in source code viewer or an external, third party source viewer like Notepad, Notepad++ or Notepad2? If you click on "all comments" and then view the source for that page you should have about 37000 lines of code. You will find <p>gnu.org/distros/free-distros.h<wbr>&shy;tml</p> on line 1920. – Samir – 2013-08-07T09:06:04.580

@Sammy Built in. Here is a pic viewing source in chrome. http://i.imgur.com/6I9S1l0.png Here it is in firefox http://i.imgur.com/xg7Tjdf.png Both it displays the funny hiphen with the hiphen glyph(rather than the alternative ­). In pages shown in pics above, There are 1256 lines in Chrome http://i.imgur.com/Y71P1zc.png in FF http://i.imgur.com/cMb3kla.png 1224 lines. Not near 37000 so I guess we're doing something differently? hopefully my screenshots will help determine what.

– barlop – 2013-08-07T09:53:11.200