Is there a way to dump a webpage source without having to interact directly with browser?

2

Is there a way to somehow dump the source of a webpage automatically, without interacting with a browser itself; without right clicking on page and selecting “view source”. So For example, I have an Internet Explorer opened and displaying certain content. I want to get the HTML source of that page in a file that I could read. Maybe there is an option to keep current page source somewhere on disk?

Few points:

  1. The webpage I am running is a local web service that is open in kiosk mode (no menu or address field).
  2. Since this service requires authentication, I can't just open and access the page directly. There is a certain procedure that I undertake to get to this page.
  3. I want to get the source of the current actual page as I constantly make changes on it and(filling in values, choosing combobox values, etc..). I want these changes to be reflected in the source, if I run the same page in new window, I won't have my edited fields in it.

Solutions like Selenium won’t help me because I don’t want to run the browser through it in the first place.

Eugene S

Posted 2015-03-06T05:42:56.063

Reputation: 2 088

Answers

1

I realize you are on Windows, but in the Linux/Mac OS X world one could use curl or wget if you know the target URL of a page and want to save it to a file. I see there is a Windows version of curl available here as well as other versions on the official curl site as well so maybe this would work for you if you are comfortable with the command line?

For example, using curl you could save the contents of the main Google index page like this from a command line:

curl -L google.com > google_index.html

The curl command is obvious and the -L command tells curl to simply resolve any redirects one might bump into when accessing a URL. The google.com is the target URL and the > tells the command to redirect the output of curl -L google.com to the file named google_index.html.

After running that command the contents of google_index.html will be 100% just like what you would see if you viewed source form a web browser.

But keep this in mind: All a curl command like that would do is fetch the raw contents returned by the URL. It would not give you any of the graphics, CSS, JavaScript or any other ancillary content that would be connected to that HTML.

For doing more complex and sophisticaed fetching of full site content, wget is the way to go. There appears to be a Windows version of wget hosted over here, but unsure how out of date it might be compared to the GNU core version of wget. So try at your own risk.

JakeGould

Posted 2015-03-06T05:42:56.063

Reputation: 38 217

Thanks you for your answer. Command line is perfect, however I'm not sure I will be able to implement your solution. 1. The webpage I am running is a local web service that is open in kiosk mode (no menu or address field) so there is no clear url I can copy. 2. Since this service requires authentication, I can't just open and access the page directly, even if I had the url. 3. I want to get the source of the current actual page as I make changes on it and I want these changes to be reflected in the source, if I run the same page in new window, I won't have my edited fields – Eugene S – 2015-03-06T06:01:07.947

1@EugeneS Well, as far as point 2. goes, curl allows for authentication from the command line, so that shouldn’t be an impediment. But it does seem like you have other idiosyncrasies that might stand in the way of simply accessing the content. I would recommend you add those details to your question so there is no confusion as to what you are attempting to do and what tools you might need. Good luck! – JakeGould – 2015-03-06T06:04:37.107

0

I assume you are trying to break into a kiosk, LOL?

Joke aside, you need Fiddler to be installed into client machine. IF using HTTPS its even harder, got to trust Fiddler certs. You might encounter untrusted cert while using Fiddler, it will hijack connections and listen to all htpp traffic, decode and stream back to to the browser. Pretty much a proxy actually, its used for web development/debugging.

This question should not be in Super User, its web development related.

afifio

Posted 2015-03-06T05:42:56.063

Reputation: 21

0

As of Powershell 3.0, you can use Invoke-WebRequest

Invoke-WebRequest

Gets content from a web page on the Internet.

Detailed Description

The Invoke-WebRequest cmdlet sends HTTP, HTTPS, FTP, and FILE requests to a web page or web service. It parses the response and returns collections of forms, links, images, and other significant HTML elements.

This cmdlet was introduced in Windows PowerShell 3.0.

The powershell alias for Invoke-WebRequest is actuall wget

Lieven Keersmaekers

Posted 2015-03-06T05:42:56.063

Reputation: 1 088

Hi and thanks for your valuable input. However this solution brings me to the same point as I discussed above which is that I have to launch the browser with that request in order to be able to retrieve the source from it. I however, have certain steps to be taken before I reach that page I want to view its source. Thanks – Eugene S – 2015-03-10T01:35:01.297

Your point was only made clear after we answered the original question, it was not at all clear before. I have been playing around with dumping the process and searching the memory for the entire page but I can not reliably automate it (the page is in memory sure enough though) Perhaps better if you explain in your question what your actuall goal is. Currently, this is sounding a bit like an XY Problem

– Lieven Keersmaekers – 2015-03-10T20:11:25.653

Hi, you are right. My initial question probably had to be more detailed. However I thought that the points I added later will do the job, I apologize if they didn't. The actual goal is being able to extract the source of a currently open webpage, regardless of what it took to get there(log in, fill data, click buttons). I have a test automation framework that interacts with the visual content only. That makes it a problem to find data on the page. If I had a way of dumping the source of the current webpage, that will give me the ability to parse that source to find the desired information. – Eugene S – 2015-03-11T02:00:32.027

Unfortunately, the page source doesn't seem to be in one contiguous memory block. I think your best option is to write a little application and use ReadProcessMemory to extract the source.

– Lieven Keersmaekers – 2015-03-11T06:33:26.677