Using wget to download PDF files from a site that requires cookies to be set

8

7

I want to access a newspaper site and then download their epaper copies (in PDF). The site requires me to log in using my email address and password and then it permits me to access those PDF URLs.

I'm having trouble 'setting my session' in Wget. When I login into the site from my browser, it sets two cookie values:

UserID=abc@gmail.com
Password=12345

I tried:

wget --post-data "UserID=abc@gmail.com&Password=12345" http://epaper.abc.com/login.aspx

However, that just downloaded the login page and saved it locally.

The FORM on the login page has two fields:

txtUserID
txtPassword

And radiobuttons like this:

<input id="rbtnManchester" type="radio" checked="checked" name="txtpub" value="44">

Another button:

<input id="rbtnLondon" type="radio" name="txtpub" value="64">

If I post this to the login.aspx page, I get the same output

wget --post-data "txtUserID=abc@gmail.com&txtPassword=12345&txtpub=44" http://epaper.abc.com/login.aspx

If I do:

--save-cookies abc_cookies.txt

it doesn't seem to have anything other than the default content.

For the last, if I do --debug as well, it says:

...
Set-Cookie: ASP.NET_SessionId=05kphcn4hjmblq45qgnjoe41; path=/; HttpOnly
...
Stored cookie epaper.abc.com -1 (ANY) / <session> <insecure> [expiry none] ASP.NET_SessionId 05kphcn4hjmblq45qgnjoe41
Length: 107253 (105K) [text/html]
Saving to: `login.aspx'
...
Saving cookies to abc_cookies.txt.

However, abc_cookies.txt shows ONLY the following:

# HTTP cookie file.
# Generated by Wget on 2011-08-16 08:03:05.
# Edit at your own risk.

(I am not sure why I'm not getting any responses on Stack Overflow - perhaps Super User is a better site - Using Wget to download PDF files from a site that requires cookies to be set.)


EDIT 1

C:\Temp>wget --cookies=on --keep-session-cookies --save-cookies abc_cookies.txt --post-data "txtUserID=abc%40gmail.com&txtPassword=password&txtpub=44&chkbox=checkbox&submit.x=48&submit.y=7" http://epaper.abc.com/login.aspx --debug
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
DEBUG output created by Wget 1.11.4 on Windows-MinGW.

--2011-08-18 08:15:59--  http://epaper.abc.com/login.aspx
Resolving epaper.abc.com... seconds 0.00, 999.999.99.99
Caching epaper.abc.com => 999.999.99.99
Connecting to epaper.abc.com|999.999.99.99|:80... seconds 0.00, connected.
Created socket 300.
Releasing 0x00a2ae80 (new refcount 1).

---request begin---
POST /login.aspx HTTP/1.0
User-Agent: Wget/1.11.4
Accept: */*
Host: epaper.abc.com
Connection: Keep-Alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 100

---request end---
[POST data: txtUserID=abc%40gmail.com&txtPassword=password&txtpub=44&chkbox=checkbox&submit.x=48&submit.y=7]
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Connection: keep-alive
Date: Thu, 18 Aug 2011 02:46:17 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Set-Cookie: ASP.NET_SessionId=owcrje55yl45kgmhn43gq145; path=/; HttpOnly
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Length: 107253

---response end---
200 OK
Registered socket 300 for persistent reuse.

Stored cookie epaper.abc.com -1 (ANY) / <session> <insecure> [expiry none] ASP.NET_SessionId owcrje55yl45kgmhn43gq145
Length: 107253 (105K) [text/html]
Saving to: `login.aspx.1'

100%[======================================================================================================================>] 107,253     24.9K/s   in 4.2s

2011-08-18 08:16:05 (24.9 KB/s) - `login.aspx.1' saved [107253/107253]

Saving cookies to abc_cookies.txt.
Done saving cookies.

C:\Temp>wget --referer=http://epaper.abc.com/login.aspx --cookies=on --load-cookies abc_cookies.txt --keep-session-cookies --save-cookies abc_cookies.txt http://epaper.abc.com/PagePrint/16_08_2011_001.pdf --debug
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
DEBUG output created by Wget 1.11.4 on Windows-MinGW.


Stored cookie epaper.abc.com -1 (ANY) / <session> <insecure> [expiry none] ASP.NET_SessionId owcrje55yl45kgmhn43gq145
--2011-08-18 08:16:12--  http://epaper.abc.com/PagePrint/16_08_2011_001.pdf
Resolving epaper.abc.com... seconds 0.00, 999.999.99.99
Caching epaper.abc.com => 999.999.99.99
Connecting to epaper.abc.com|999.999.99.99|:80... seconds 0.00, connected.
Created socket 300.
Releasing 0x00598290 (new refcount 1).

---request begin---
GET /PagePrint/16_08_2011_001.pdf HTTP/1.0
Referer: http://epaper.abc.com/login.aspx
User-Agent: Wget/1.11.4
Accept: */*
Host: epaper.abc.com
Connection: Keep-Alive
Cookie: ASP.NET_SessionId=owcrje55yl45kgmhn43gq145

---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Connection: keep-alive
Date: Thu, 18 Aug 2011 02:46:30 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
content-disposition: attachement; filename=Default_logo.gif
Cache-Control: private
Content-Type: image/GIF
Content-Length: 4568

---response end---
200 OK
Registered socket 300 for persistent reuse.
Length: 4568 (4.5K) [image/GIF]
Saving to: `16_08_2011_001.pdf'

100%[======================================================================================================================>] 4,568       7.74K/s   in 0.6s

2011-08-18 08:16:14 (7.74 KB/s) - `16_08_2011_001.pdf' saved [4568/4568]

Saving cookies to abc_cookies.txt.
Done saving cookies.

Contents of abc_cookies.txt

epaper.abc.com       FALSE   /       FALSE   0       ASP.NET_SessionId       owcrje55yl45kgmhn43gq145

siliconpi

Posted 2011-08-16T16:15:37.493

Reputation: 2 067

I suspect you're getting no response because there are few experts on wget's more advanced usage. :( – jcrawfordor – 2011-08-16T18:54:22.647

@Frank - try using --keep-session-cookies in the initial login wget, see my answer below. – EightBitTony – 2011-08-16T19:14:41.210

Is login.aspx the URL of the login page, or the URL that the login page submits to? – Edward Shtern – 2013-01-23T00:33:24.420

Answers

4

I think you need to use --keep-session-cookies to preserve session cookies, rather than just --save-cookies (you need both).

Basically, you

wget --keep-session-cookies --save-cookies ..... url

to login and get your session cookie.

then

wget --load-cookie ...... url

to download the PDF.

EightBitTony

Posted 2011-08-16T16:15:37.493

Reputation: 3 741

:( didnt work... no dice... – siliconpi – 2011-08-17T16:58:53.700

@Frank - So what happened at each stage, did you get the cookies on disk as expected, if you include headers what responses are you getting, etc. Can you update the question with what you've now tried and what got returned. – EightBitTony – 2011-08-17T18:41:33.627

Hi Tony - thanks for attempting to help out - I'm puzzled with this whole thing! – siliconpi – 2011-08-18T02:59:12.910

Hi Tony - did you get a chance to look at the detailed Edit1? – siliconpi – 2011-08-19T12:05:40.400

Yes, nothing leaps out. My only query is what's in login.aspx when you get it back from the first wget? Does it indicate you successfully logged in? – EightBitTony – 2011-08-19T12:21:25.243

2

Maybe this will help. The site I was trying to login into had some hidden fields that I needed to get before I could successfully login. So the first wget gets the login page to find the extra fields, the second wget logs into the site and saves the cookies, the third one then uses those cookies to get the page you're after.

#!/bin/bash

# get the login page to get the hidden field data
wget -a log.txt -O loginpage.html http://foobar/default.aspx
hiddendata=`cat loginpage.html | grep value | grep foobarhidden | tr '=' ' ' | awk '{print $9}' | sed s/\"//g`
rm loginpage.html

# login into the page and save the cookies
postData=user=fakeuser'&'pw=password'&'foobarhidden=${hiddendata}
wget -a log.txt -O /dev/null --post-data ${postData} --keep-session-cookies --save-cookies cookies.txt http://foobar/default.aspx

# get the page you're after
wget -a log.txt -O results.html --load-cookies cookies.txt http://foobar/lister.aspx?id=42
rm cookies.txt

There's some useful information on this other SO post:

tpow

Posted 2011-08-16T16:15:37.493

Reputation: 161

1Please try do de-personalize your answer (remove the "I"). Other than that, for one of you're first answers, your doing great. – wizlog – 2012-01-19T02:31:58.673