Why is 'wget --page-requisites' extremely slow when compared to a real browser request?

0

My goal is to download a single webpage to be fully functional offline in the same time it takes a browser to request and show the page. But that doesn’t seem to be happening with the command I am using.

The following command downloads a page and makes it fully functional offline, but it takes approximately 35 seconds where the hard refreshed browser requests and shows the page in about 5 seconds.

Can someone please help me understand why my wget command is taking so much longer and how I can make it faster?

wget --page-requisites --span-hosts --convert-links --adjust-extension --execute robots=off --user-agent Mozilla --random-wait https://www.invisionapp.com/inside-design/essential-steps-designing-empathy/

More info and attempted solutions

  • I removed --random-wait because I thought it might be adding time for each file request, but this did nothing.
  • I thought the https protocol might slow it down with extra calls back and forth for each file so I added --no-check-certificate, but this did nothing.
  • I read there could be an issue with IPv6 so I added --inet4-only, but this did nothing.
  • I read the DNS could slow things down so I added --no-dns-cache, but this did nothing.
  • I thought perhaps wget was downloading the assets sequentially one at a time so I tried to run multiple commands concurrently with between 3 and 16 threads/processes by removing --convert-links adding --no-clobber in the hopes that with multiple files would be downloaded at the same time and after all files were downloaded that I could run the command again removing --no-clobber and --page-requisites and adding --convert-links to make it fully functional offline, but this did nothing. I also thought that multiple threads would speed things up because it would remove the latency of the https checks by doing multiple at a time, but I didn't observe this.
  • I read an article about running the command as root user in case there were any limits on a given user, but this did nothing.

JustCodin

Posted 2019-06-16T00:05:43.680

Reputation: 1

It could be that website is detecting a non-human browser—aka: a bot that uses curl or wget—accessing it’s content and is purposefully slowing down requests. Perhaps trying to us -e robots=off as a part of your string will work. But checking that site’s robots.txt shows nothing that indicates robots are requested to slow down?

– JakeGould – 2019-06-16T00:17:16.227

Or maybe adding a user agent string that makes the server believe you are a real web browser could help? Something like --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"? All browsers have different user agent strings so maybe pick one based on the browser you are testing with? Check yours here at this site.

– JakeGould – 2019-06-16T00:18:16.090

@JakeGould I posted the wget command I'm using in the question; I'm using both of those flags already. – JustCodin – 2019-06-16T05:23:51.090

Look at your --user-agent= string compared to mine. I don’t believe --user-agent Mozilla (which is in your string) will work. – JakeGould – 2019-06-16T15:01:57.163

No answers