Javascript trick? How does scribd make it difficult to even copy & paste text

Lately, I have seen that scribd makes it very difficult for users (free users) to browse through a document hosted on their site. There is no ability to search within a document, let alone being able to download the same.

Using javascript, they load pages on demand in the browser, and so the browser's "save as" feature does not help much.

To my amazement, I saw that even copy/ pasting text copies gibberish to the clipboard! To check out what was wrong, I turned off javascript in the browser and then loaded the same document again. Voila, I did see the gibberish. And so, it looks like the javascript from scribd somehow decodes the gibberish text and then displays it in the browser.

Now, my question is, even after javascript is enabled, and the text is rendered properly in the browser, if I go and look at the DOM objects corresponding to the text I select, I still see the gibberish text.

So, now, I am confused. The text is displayed alright to the user, but the DOM objects still contain gibberish. So the question is, what kind of javascript hooks/ code is the site using, so as to be able to retain the gibberish in the DOM objects and still render the decoded text?

Is there a way I can access the decoded text? My intention is not to reverse engineer the algorithm to decode, but to locate where the decoded text is being stored?

Example document is:

http://www.scribd.com/doc/143886351/OCP-Upgrade-to-Oracle-Database-12c-Student-Guide-vol-1-Exam-1Z0-060

See what happens when you turn Javascript on/ off!

javascript

user2497064

Posted 2013-06-19T08:55:04.480

Reputation: 191

Its sort of simple. They created a javascript pdf viewer. Mozilla did something with Firefox. Since your PDF viewer is not actually being used to display the content they can control nearly every aspect of the viewing experience. – Ramhound – 2013-06-19T11:53:51.097

Answers

Look at the font-family for the span. They use a custom font (in this case ff6).

They must do this so that more PDF documents are displayed correctly. As in PDF documents there is no requirement that the text in the document must use a standard character set. It need only use one that has code that map to the glyphs in the embedded font.

Dan D.

Posted 2013-06-19T08:55:04.480

Reputation: 5 138

If you look at the displayed text vs. the "gibberish" you can see that some of the letters are the same, while some of the letters are substituted. For example, "Mltmrprfsm Jblbemr" is "Enterprise Manager". Given enough text, you should be able to build a quick translation table. Already, we know that M translates to E, L -> N, T, R, and P are clear, F -> R, etc. Given some time, detective work, and modest programming skills, one could translate the whole document.

Of course, there's no guarantee that the next document would use the same ff6 font that Dan D. mentioned, so grabbing that font for local use should be your next step if you want to save the text for later.

MattDMo

Posted 2013-06-19T08:55:04.480

Reputation: 4 968

You can try capturing a screenshot using the browser's crop tool and then use any of the readily available online ocr sites to convert it to text. That should sort you out.

guy841185

Posted 2013-06-19T08:55:04.480

Reputation: 1