3
I'm trying to automate the extraction of some information from a website that first requires me to log in. I have done this in the past (years ago) using wget, but that method no longer seems to work - and I don't know why.
I used to do it like this:
MY_USERNAME=username # needs to be urlencoded, this can be done at http://lajm.eu/emil/dump/stringfunctions.php.
MY_PASSWORD=password # also has to be urlencoded
LOGIN_DATA="action=login&login_nick=$MY_USERNAME&login_pwd=$MY_PASSWORD"
wget --quiet --save-cookies cookiejar --keep-session-cookies --post-data $LOGIN_DATA --user-agent 'Firefox' -O um.htm http://ungdomar.se/index.php
Now when I try to run this, I just get sent back to the main page (so I'm not just feeding it the wrong password. If I had done that, I would get different markup back).
I've also tried doing it in Python using mechanize (this would be preferable to wget), but I seem to get the same result. It just boggles my mind why this won't work. This is the part of the website that's dealing with the form. To see the full markup, simply go to ungdomar.se.
<div id="loginLoginbox" style="display:none;">
<form name="login" method="post" action="/">
<table width="250" cellspacing="0" cellpadding="0" border="0">
<tr>
<td colspan="2">
<span class="page_login_text">Användarnamn</span><br />
<input name="login_nick" type="text" style="width:250px;height:16px;line-height:10px;font-size:9px;" maxLength="30">
</td>
</tr>
<tr>
<td colspan="2">
<span class="page_login_text">Lösenord</span><br />
<input name="login_pwd" type="password" style="width:250px;height:16px;line-height:10px;font-size:9px;" maxLength="25"><br />
<img src="/gfx/1x1.gif" width="1" height="5" alt="" />
</td>
</tr>
<tr>
<td width="42%" valign="top">
<span style="vertical-align:super;" class="page_login_text">
<label for="login_auto">Kom ihåg mig</label>
</span>
<input name="login_auto" id="login_auto" type="checkbox" value="1" style="width:12px; height:12px;">
</td>
<td width="58%" align="right" valign="top">
<a class="page_login_text" href="/sendpwd.php">Glömt lösen?</a>
<button class="button_active" type="submit">Logga in</button>
</td>
</tr>
</table>
</form>
</div>
If someone could tell me why this won't work, I would be eternally grateful.
EDIT: I just set up my own little web form (structured exactly like the one on the site), and it worked just fine. Now what the heck could they be doing that makes it so that I can't log in using either wget or mechanize?
2This question may be better suited for stackoverflow.com. – Tim S. Van Haren – 2010-11-18T14:46:42.233
Tim S. Van Haren: Really? I was going to post it there, but I was sure that they were going to refer me here. – Tommy Brunn – 2010-11-18T14:50:04.757
have you tried setting the user-agent string to something the website expects? sometimes web logins drop connections to specific UAs because they know their site is getting ripped(automatically read by not a human) – RobotHumans – 2010-11-18T15:02:31.420
I tried setting the user agent string to the same as my browser. No luck. – Tommy Brunn – 2010-11-18T15:16:45.190