Send HTTP request to website with password and username, then record results

1

1

I need to record certain numbers (temperature and others) from a web-based monitoring service (LaCrosse Alerts). However, you must login to use this service. I have an account, and am starting to follow the steps outlined here. It only outlines the steps needed to parse a simple website using Cygwin, not a username and password locked website. I tried searching for anything that could do this, but had no luck. How can I login to the website, and then parse one page using the setup found above? Is the Cygwin way the best way to do it? Is there an easier way to parse the website and login, such as using a batch script? It also looks like I can use Wget to download the page, but I'm not sure how to parse it. This would look like:

# Now grab the page or pages we care about.
wget --load-cookies cookies.txt \
-p http://server.com/interesting/article.php

How would I have that run on a scheduled task, and also parse some of the <div> tags in the page?

hichris123

Posted 2013-12-25T22:31:34.920

Reputation: 149

Does it use cookies or do you need to login every time? – Thomas Weller – 2013-12-25T23:03:21.167

@ThomasW. If I click a Remember Me button when logging in, yes, it does since it automatically has me logged in. – hichris123 – 2013-12-25T23:08:35.183

There's a good answer for this question here: http://stackoverflow.com/questions/1324421/how-to-get-past-the-login-page-with-wget

– sahmeepee – 2013-12-25T23:38:15.620

Answers

1

It really depends on how easy/complex the information that is represented in the web page is. If it's something that can be grepped out, then you could use the SO answer here (from the comment above). However, if it's not something that can be easily grepped out, then you could write a Python script that can easily do this for you. You would need to use urllib2 and cookiejar, and then use something like lxml and BeautifulSoup to parse out the HTML. The SO answer here is an excellent guide on how you could potentially login. For ease, I'm going to copy paste the code here:

import cookielib
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup #you can also use lxml, if you wanted.

# Store the cookies and create an opener that will hold them
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

# Add our headers
opener.addheaders = [('User-agent', 'RedditTesting')]

# Install our opener (note that this changes the global opener to the one
# we just made, but you can also just call opener.open() if you want)
urllib2.install_opener(opener)

# The action/ target from the form
authentication_url = 'https://ssl.reddit.com/post/login'

# Input parameters we are going to send
payload = {
  'op': 'login-main',
  'user': '<username>',
  'passwd': '<password>'
  }

# Use urllib to encode the payload
data = urllib.urlencode(payload)

# Build our Request object (supplying 'data' makes it a POST)
req = urllib2.Request(authentication_url, data)

# Make the request and read the response
resp = urllib2.urlopen(req)
contents = resp.read()

# parse the page using BeautifulSoup. You'll have to look at the DOM
# structure to do this correctly, but there are resources all over the
# place that makes this really easy.
soup = BeatifulSoup(contents)
myTag = soup.find("<sometag>")

You can then run this every X number of minutes, or you could use Python itself to time the execution of the above function every X minutes, and post/email the results. Depending on what you're trying to do, it might be overkill, but when I've needed to do something similar in the past, this is the route I've taken.

Karthik Rangarajan

Posted 2013-12-25T22:31:34.920

Reputation: 181

Would a div tag in a HTML structure be easily grepped out? – hichris123 – 2013-12-26T02:22:00.753

Yes, it shouldn't be hard. It makes it easier if the div has an ID or similar unique characteristic. At that point, you would do something like soup.find("div", {"id": "uniqueid"}), and it would find the exact div you want. – Karthik Rangarajan – 2013-12-26T02:24:39.330