Logging into webpage via script

3

I'm trying to automate the extraction of some information from a website that first requires me to log in. I have done this in the past (years ago) using wget, but that method no longer seems to work - and I don't know why.

I used to do it like this:

MY_USERNAME=username # needs to be urlencoded, this can be done at http://lajm.eu/emil/dump/stringfunctions.php.
MY_PASSWORD=password # also has to be urlencoded

LOGIN_DATA="action=login&login_nick=$MY_USERNAME&login_pwd=$MY_PASSWORD"

wget --quiet --save-cookies cookiejar --keep-session-cookies --post-data $LOGIN_DATA --user-agent 'Firefox' -O um.htm http://ungdomar.se/index.php

Now when I try to run this, I just get sent back to the main page (so I'm not just feeding it the wrong password. If I had done that, I would get different markup back).

I've also tried doing it in Python using mechanize (this would be preferable to wget), but I seem to get the same result. It just boggles my mind why this won't work. This is the part of the website that's dealing with the form. To see the full markup, simply go to ungdomar.se.

<div id="loginLoginbox" style="display:none;">
    <form name="login" method="post" action="/"> 
        <table width="250" cellspacing="0" cellpadding="0" border="0"> 
            <tr>
                <td colspan="2">
                    <span class="page_login_text">Användarnamn</span><br /> 
                    <input name="login_nick" type="text" style="width:250px;height:16px;line-height:10px;font-size:9px;" maxLength="30">
                </td>
            </tr> 
            <tr>
                <td colspan="2">
                    <span class="page_login_text">Lösenord</span><br /> 
                    <input name="login_pwd" type="password" style="width:250px;height:16px;line-height:10px;font-size:9px;" maxLength="25"><br />
                    <img src="/gfx/1x1.gif" width="1" height="5" alt="" />
                </td>
            </tr> 
            <tr>
                <td width="42%" valign="top">
                    <span style="vertical-align:super;" class="page_login_text">
                        <label for="login_auto">Kom ihåg mig</label>
                    </span>
                    &nbsp;
                    <input name="login_auto" id="login_auto" type="checkbox" value="1" style="width:12px; height:12px;">
                </td> 
                <td width="58%" align="right" valign="top">
                    <a class="page_login_text" href="/sendpwd.php">Glömt lösen?</a> 
                    <button class="button_active" type="submit">Logga in</button>
                </td>
            </tr> 
        </table>
    </form>
</div>

If someone could tell me why this won't work, I would be eternally grateful.

EDIT: I just set up my own little web form (structured exactly like the one on the site), and it worked just fine. Now what the heck could they be doing that makes it so that I can't log in using either wget or mechanize?

Tommy Brunn

Posted 2010-11-18T14:30:06.840

Reputation: 133

2This question may be better suited for stackoverflow.com. – Tim S. Van Haren – 2010-11-18T14:46:42.233

Tim S. Van Haren: Really? I was going to post it there, but I was sure that they were going to refer me here. – Tommy Brunn – 2010-11-18T14:50:04.757

have you tried setting the user-agent string to something the website expects? sometimes web logins drop connections to specific UAs because they know their site is getting ripped(automatically read by not a human) – RobotHumans – 2010-11-18T15:02:31.420

I tried setting the user agent string to the same as my browser. No luck. – Tommy Brunn – 2010-11-18T15:16:45.190

Answers

2

  1. Download Wireshark.
  2. Record a real browser hitting the website.
  3. Set your filter to tcp.port == 80 and find the request you just made.
  4. Right click on a packet and choose Follow TCP Stream and save this text somewhere.

Now you've got the complete, working conversation from your web browser to the website you want to scrape.

Repeat the process for your script and find out where they differ, then make the appropriate changes to fix it. Once they're identical the site cannot see the difference between you and your script.

If you need more flexibility, I suggest writing a simple Python script rather than using wget.

Gareth Davidson

Posted 2010-11-18T14:30:06.840

Reputation: 215

1Turns out they had changed the encoding of the username and/or password somehow. Comparing the logs showed my username being encoded slightly differently, which is what caused the login to fail. – Tommy Brunn – 2010-11-18T16:45:21.927