Save website containing javascript after it was interpreted

7

4

There's a website I want to save that contains javascript (if that helps, the site is vyou.com; here's a link to Andrew W. K.'s user page) which updates the contents of what the user sees on the website. What I would like to do is save the site once the list of links to a user's video answers is fully expanded. I do not intend to also save the videos those links lead to, I just want to save the state my browser is in. What browser I use doesn't matter to me.

Has anyone done something similar or knows how to do achieve this?

G. Bach

Posted 2013-04-03T00:08:14.557

Reputation: 245

Answers

9

There are (at least) two reasons that your saved file will not look exactly like the live website you saved:

  1. Some or many of the links to images on the page could be "relative" links. Similarly, links to ".css" and ".js" files on the page could be "relative" links.
  2. Some links to images and other files can be contained inside these ".css" and ".js" files.

For example, lets say you look at a page:

http://example.com/something/index.php

On that page, is a "relative" link to an image file:

"../images/picture.jpg"

Also on that page is a "relative" link to a .css file:

"../css/style.css"

So, when you save the ".html" file for the page, it contains these "relative" links. When you open your saved page in your browser, it looks for these image and css files in the folder where you saved the .html file. If these image and css files are not in the folder where you saved the .html file, the page will not display properly.

There are a few things you can do to "resolve" this.

  1. Choose File-->Save as...-->Webpage, complete (or similar wording) when you save the webpage to your computer. This will save a copy of the image and .css/.js files on your computer, and modify the link in the saved html file to point to the image/file on your computer. This is not "foolproof". It seems this process will frequently "miss" some files. In this case, you will have to manually locate and download the missing files, and manually edit the links in your saved html file to "point" to the files saved on your computer.
  2. Save the html file as a "Web-archive" file (".mht")
  3. Add a "base href..." line to the <head> section in the saved copy of the html file. Using the above URL as an example:

    http://example.com/something/index.php
    

    Remove the "index.php" from the webpage URL gives you:

    http://example.com/something/
    

    Add this to the <head> section in the saved copy of the webpage, like this:

    <head>
    <base href="http://example.com/something/">
    <...>
    <...>
    </head>
     ...
    



Edit (2013-04-04):

Using Internet Explorer, the best way (perhaps not perfect) to save a page that also saves the "result" of the JavaScript on the page, is to use Microsoft Developer Tools, and then view and save the DOM source for the page.

I say "perhaps not perfect" ...

Suppose you have a webpage that uses JavaScript to "generate" HTML code that adds an image to the webpage.

If you view the webpage online, you will see the image. If you view the page source (View-->Source) or save the page source to a file (File->Save as...), you will see the JavaScript, but you won't see the HTML <img...> code.

Now, if you use the Developer Tools to view and save the DOM source for the page, and then open the saved file in a text editor, you will see the original JavaScript is included in the saved file, then below the JavaScript, you will see the <img...> code that was generated by the JavaScript.

Then, if you open the saved page in a browser, you will see the image twice. This is because when you open the saved page, the JavaScript will execute again and generate the code to show the image, and below that is the HTML code for the image that was saved to the file.

You can "fix" this by editing the saved DOM source, and then remove (or comment-out) the JavaScript. Then when you open the saved page in a browser, you will see the image only once.


Edit (2013-04-05):

It seems there may be some confusion about saving webpages that contain relative links, from the browser, so I decided to provide a working example.

Here is a webpage I created to demonstrate this:
Waterfall-Lighthouse pictures

Here is the HTML code of that page:

<html>
<head>
<title>Waterfall and Lighthouse</title>
</head>
<body>
<img src="../images/imagesCAIPHDL5.jpg" /><br />
<br /><hr align="left" width="284" /><br />
<script type="text/javascript">document.write("\n"+'<img src="../images/imagesCAG7M85E.jpg" /><br />');</script>
</body>
</html>


If you view the page with the browser (I'm using IE9), you'll see the expected webpage correctly, with 2 pictures.

While viewing the page, you can save the source of the page by clicking: View-->Source, or by clicking File-->Save as...-->Webpage, HTML only. Then save the file. Either way, you'll get the same HTML code:

<html>
<head>
<title>Waterfall and Lighthouse</title>
</head>
<body>
<img src="../images/imagesCAIPHDL5.jpg" /><br />
<br /><hr align="left" width="284" /><br />
<script type="text/javascript">document.write("\n"+'<img src="../images/imagesCAG7M85E.jpg" /><br />');</script>
</body>
</html>


But, if you view the saved file in the browser, you'll get a blank page, with no pictures. This is because the image link in the saved file, and the image link written by JavaScript are both "relative" links... the browser cannot tell the domain or path of where to find the pictures. You can see what that looks like here:
View-Source
HTML-only

If you edit this saved file and add the line:

<base href="http://viewthis.info/superuser577187/page/">


the file will look like this:

<html>
<head>
<title>Waterfall and Lighthouse</title>
<base href="http://viewthis.info/superuser577187/page/">
</head>
<body>
<img src="../images/imagesCAIPHDL5.jpg" /><br />
<br /><hr align="left" width="284" /><br />
<script type="text/javascript">document.write("\n"+'<img src="../images/imagesCAG7M85E.jpg" /><br />');</script>
</body>
</html>


Now, if you view the edited file in the browser, you'll get a page with both pictures displayed correctly. This is because the "base href" line tells the browser where to look (domain and path) for the "missing" pictures. You can see what that looks like here:
Source-with-base-href

While viewing the page online, you can also save the source of the page by clicking:
File-->Save as...-->Webpage, complete.

If you view the source of this saved file, you'll see this HTML code:

<!-- saved from url=(0042)http://viewthis.info/superuser577187/page/ -->
<html>
<head>
<title>Waterfall and Lighthouse</title>
<meta content="text/html; charset=windows-1252" http-equiv=Content-Type>
<meta name=GENERATOR content="MSHTML 9.00.8112.16470">
</head>
<body>
<img src="Waterfall-and-Lighthouse_files/imagesCAIPHDL5.jpg" /><br />
<br /><hr align=left width=284 /><br />
<script type=text/javascript>document.write("\n"+'<img src="../images/imagesCAG7M85E.jpg" /><br />');</script>
</body>
</html>


If you view this saved file in the browser, you'll get a page with the first (top) picture displayed correctly, but the second picture is not displayed (missing). This is because when saving with Webpage, complete, the browser saves a copy of the first image on your hard drive, and modifies the link in the saved file to point to the local copy of the image. The image link for the second picture is not present in the saved file. The JavaScript code that creates the second image link is saved in the file, but the actual link is not part of the source of the page so the second image link is not saved, and the second image file is also not saved.

Again, if you edit this saved file and add the line:

<base href="http://viewthis.info/superuser577187/page/">


and then view the edited file in the browser, you'll get a page with both pictures displayed correctly.

Another way you can save the page, while viewing the page online is by clicking:
File-->Save as...-->Web Archive, single file-->Save.

If you view this saved file in the browser, you'll get a page with both pictures displayed correctly. This is because the "Archive" format saves the first image inside the archive file (encoded), and saves the web address of where the webpage (and domain name/path) and second image file is located.

In all these example cases, the "result" of the JavaScript (the current state of the page after processing the JavaScript), which is the second image link, is not contained in the saved file, only the JavaScript is saved.

Keep in mind, that in these examples, the "result" of the JavaScript is very "simplistic", an almost "trivial" use of JavaScript. In "real" webpages, the JavaScript can be very complex, and can generate many many pages (limited only by the amount of available memory).

Now, how to save the page with the "result" from the JavaScript. We'll do this using Microsoft Developer Tools (the download link is shown earlier in this answer).

After installing the Developer Tools, and while viewing the page online, press the F12 key or click:
Tools-->F12 Developer Tools

Then on the window that opens, click:
View-->Source-->DOM (page).

A new window opens. Click File-->Save, and then save the file.

If you view the source of this saved file, you'll see this HTML code:

<html>
<head>
<title>Waterfall and Lighthouse</title>
</head>
<body>
<img src="../images/imagesCAIPHDL5.jpg" /><br />
<br /><hr width="284" align="left" /><br />
<script type="text/javascript">
 document.write("\n"+'<img src="../images/imagesCAG7M85E.jpg" /><br />');
</script>
<img src="../images/imagesCAG7M85E.jpg" /><br />
</body>
</html>


Notice in the source of this saved file, you'll see that the JavaScript is saved and the "result" of the JavaScript is also saved:

...
<script type="text/javascript">
 document.write("\n"+'<img src="../images/imagesCAG7M85E.jpg" /><br />');
</script>
<img src="../images/imagesCAG7M85E.jpg" /><br />
...


I think this is what you wanted. But, there are two problems.

First, as before, if you view this saved file in the browser, you'll get a blank page, with no pictures. This is because the image links in the saved file, are "relative" links... the browser cannot tell the domain or path of where to find the pictures. You can see what that looks like here:
DevTools-DOM

Again, if you edit this saved file and add the line:

<base href="http://viewthis.info/superuser577187/page/">


and then view the edited file in the browser, you'll get page with both pictures displayed. You can see what that looks like here:
DevTools-DOM-with-base-href

Here you'll notice the second problem. The first image (the waterfall) is shown correctly (once), but the second image (the Lighthouse) is shown twice. This happens because when the saved page is loaded, the JavaScript executes again generating an image link for the second image, and, the image link for the second image is also saved in the file.

To fix this you need to edit the saved file again, and remove the JavaScript (remove the <script...> and </script> tags and everything in between them). Now, the source of the edited file looks like this:

<html>
<head>
<title>Waterfall and Lighthouse</title>
<base href="http://viewthis.info/superuser577187/page/">
</head>
<body>
<img src="../images/imagesCAIPHDL5.jpg" /><br />
<br /><hr width="284" align="left" /><br />
<img src="../images/imagesCAG7M85E.jpg" /><br />
</body>
</html>


Now, the saved file contains the "result" of the JavaScript as you wanted, and if you view the edited file in the browser, you'll get a page with only one of each of the two pictures, displayed correctly. You can see what that looks like here:
DevTools-DOM-Final

Now, this may seem all very complicated, but it's really not...

After downloading and installing the Developer Tools, it's just 4 fairly simple steps... While viewing (in the browser) the page you want to save:

  1. Press the F12 key or click: Tools-->F12 Developer Tools
  2. On the window that opens, click: View-->Source-->DOM (page).
  3. On the new window, click File-->Save, and then save the file.
  4. Edit the file you saved and add the "base href" line, and remove the <script...> ... </script>

Kevin Fegan

Posted 2013-04-03T00:08:14.557

Reputation: 4 077

1Browsers automatically translate links when saving as HTML, so that's not a problem. OP is asking how to save the current state of page after processing JS. – gronostaj – 2013-04-03T05:32:03.793

@gronostaj - I have verified that for IE9 and (all or most) previous versions of IE, relative links are not "translated" as you have described. I have added a lot to my answer, including info about saving the current state of the page after processing the JavaScript. – Kevin Fegan – 2013-04-05T11:46:15.543

+1 longer and most complete answer I have ever seen! =) – Coops – 2013-05-02T11:27:06.387

2

When using Firefox, you can CTRL+A to select all, right click the screen and use View source code of selection. You will see the full HTML as it is displayed, containing the runtime inserted elements and all. From the source-code window you can save this HTML to a file.

There is also Firebug, a powerful tool to debug websites that allows you to inspect generated HTML code to achieve a similar result.

Havenard

Posted 2013-04-03T00:08:14.557

Reputation: 788

While I can't get it to fully look like the live website does, these options definitely helped; thanks! – G. Bach – 2013-04-03T02:17:40.023

2

Found that Firefox Add-on Mozilla Archive Format (http://maf.mozdev.org/) has Faithful Save option, which produces "effective CSS", and strips <script>'s (it can export to MHTML, MAFF, Complete Webpage, and convert within these). It has done the work for a simple page with few scripts I needed to snapshot in HTML format.

user221535

Posted 2013-04-03T00:08:14.557

Reputation: 21