Download forum with wget with username and password?

4

2

I want to download a forum where I only have access via my username and my password.

I tried the following:

C:\wget.exe wget -k -m -E -p -np -R viewtopic.php*p=*,memberlist.php*,faq.php*,posting.php*,search.php*,ucp.php*,viewonline.php*,*sid*,*view=print*,*start=0* -o log.txt http://www.myforum1234.com/forum/categories/discussions

So this is the command I enter in my cmd. Because when I click on the wget.exe a black window appears, but very fast disappears. But I think this is also a correct way (I use Windows XP)?

My problem is that the results show that wget could not download the forum, because it could not login. So the login page is shown (downloaded), but nothing more. I was logged in, when I exercised the command.

I am not a wget professional, so I am not sure if my command is correct? I copied this from another post. A simple C:\wget.exe wget http://www.theforumurl.com did not work.

EDIT:

I now also tried

C:\wget.exe wget -k -m -E -p -np -R *start=0* -o log.txt http://www.myforum.com/forum/categories/discussions

But the same problem here.

2nd EDIT concerning the link in the first comment:

I now tried

C:\wget.exe wget -k -m -E -p -np -R *start=0* -o log.txt http://www.myforum.com/forum/categories/discussions --post-data="username&password=1234"

But again, same problem!

When I hoover over the login button I can see the following URL:

http://www.myforum.com/user/popupLogin

Do I have to use this one?

3rd EDIT:

I also tried to add the username:password@ before the www., so like this:

C:\wget.exe wget -k -m -E -p -np -R  *start=0* -o log.txt http://user:passw@www.myforum.com/forum/categories/

The result is the same, I can see that the login did not work.

4th EDIT:

I also tried according to this thread:

C:\wget.exe wget --save-cookies cookies.txt --post-data 'user=usern&password=passw' http://www.myforum.com/user/popupLogin

C:\wget.exe wget --load-cookies cookies.txt -p http://www.myforum.com/forum/categories/

But again, same problem!!

5th EDIT:

I think I now isolated the source code of the login-button:

<div class="forumSignup">
          <a href="http://www.myforum.com/user/popupLogin" class="Button SignInPopup">Login</a> </div>

6th EDIT:

I also tried it with HTTrack, but the problem is the same: The login does not work. Another problem seems to be that the forum itslef uses the URL www.mywebsite.com/forum, but the login is required for the www.mywebsite.com. So when I use e.g. something like username:pass@www.mywebsite.com the mywebsite is captured, but not the forum. When I use username:pass@www.mywebsite.com/forum the login does not work and nothing is captured.

Stat Tistician

Posted 2014-07-29T15:37:36.037

Reputation: 41

I once used some forum software to download a forum. I don't know if one can or if one always can but this may help http://stackoverflow.com/questions/5051153/wgetting-a-forum-as-a-registered-user

– barlop – 2014-07-29T15:41:51.193

If it's really important to download the forum and both wget and httrack fail it's probably time to consider using Selenium, possibly writing some code. – kqw – 2014-08-03T16:04:12.970

Answers

2

First of all, you would do C:\wget.exe -k -m …, you don't repeat wget name.

Since login into the forum seems complicated (it can get complex even for simple sites), the best solution is probably to log in with your browser and then give the cookies* to wget (either put in a file and use --load-cookies or pass them directly with --header "Cookie: name=value").

* The way of extracting them vary a bit depending on your browser.

Ángel

Posted 2014-07-29T15:37:36.037

Reputation: 960

0

It's difficult to mirror a site with login using wget. You need expert knowledge to use wget. Currently you pass user name and password, cookies, and needed switchs.
Additional things to do.
1. Avoid mirroring until everything is o.k., As recursively downloading page force the webserver to add your IP to blacklist. (try to save single page)
2. Fake wget as a browser as Most web forums hate download managers. see this answer for more info.

Best solution

The best and easiest way to mirror this kind of site is to use ** scrap book **. It's a firefox plugin. All you need to do is launch firefox, login to site, right click -> save page as, filter by domain. see this answer to efficiently mirror a site.

totti

Posted 2014-07-29T15:37:36.037

Reputation: 832

1Link-only answers are not acceptable on SU, since linked pages may change or disappear in the future. Please add more information to the answer. – harrymc – 2014-08-04T10:54:49.513

Thanks a lot for your help, I now tried scrapbook. Unfortunately the same problem here arises as I have with HTTrack: After a few pages are saved, it saves the pages from www.myurl.com/example1 or www.myurl.com/example2, but I only want the forum, so www.myurl.com/forum and the links there, like e.g. www.myurl.com/forum/discussion1 or www.myurls.com/forum/whatsnew. If I limit the depth of the links this does not help, since I also cut off the depth of the forum threads, but here I need every thread with every post. – Stat Tistician – 2014-08-05T20:31:21.280

I now found the way to limit scrapbook to a specific sub category. So I could limit it too /forum. Unfortunately the links are not saved correctly? So when I open the start page everything is correct. I see myself logged in and I can see the forum with the threads. When I now hoover over the thread link I can see that scrapbook correctly links this link to my offline destination. So like C:/sb/ and so on. When I click on it I also get redirected to the offline page. This page has the correct name, like C:/../discussions.html or /whatsnew.html, but it displays the myforumurl.com start page? – Stat Tistician – 2014-08-05T20:45:05.240

So no forum thread, but instead the regualr myforumurl.com start page? So also not the forum start page like myforumurl.com/forum where the actual overview of the forum is, but the other (wrong) myforumurl.com webpage. This is the same for every link I click on it. So somehow scrapbook did not get the rigth pages? Whats the problem here? I am not sure, but could the log off button be a problem? So that scrapbook follows the logout button and gets logged off? I think no, because when I check it, I am still logged in afterwards? – Stat Tistician – 2014-08-05T20:45:24.373

I think, while mirroring, the web site feel unsafe and treat you as an attacker. So try to slow down your mirroring, say 5 pages per minute. This can be achieved by delay/throttling your internet speed. For linux see wondershaper. – totti – 2014-08-06T10:03:38.050

@totti Thanks for the hint, but I use Windows? I am also not sure, if this is really the problem, since when I do not limit it to a subcategory I do not have this problem (but I want to limit it to this subcategory, since this is the forum with the interesting entries). – Stat Tistician – 2014-08-07T17:50:57.030

0

Wget interprets <pass>@serveraddress as port.

To specify a username and password, use the --user and --password switches :

wget --user username --password passw http://...

harrymc

Posted 2014-07-29T15:37:36.037

Reputation: 306 093

0

If you have access to cookie data on a browser (firefox has its own cookie browser under options->privacy, but there are plugins to ease this task), perform a manual login to the forum, search for all the cookies for that domain and store them in the cookies.txt file, it would probably work with your previous command:

C:\wget.exe wget --load-cookies cookies.txt -p http://www.myforum.com/forum/categories/

Some logins pages are way too complex to try to perform the task in a single command line.

Remember to include ALL the cookies for the whole domain (search for "myforum.com", not just "www.myforum.com")

NuTTyX

Posted 2014-07-29T15:37:36.037

Reputation: 2 448

After I have logged in and click on options=>privacy=>show/display cookies there are no cookies? Just for google and youtube, but not for myforum.com or something like this? – Stat Tistician – 2014-08-09T17:36:24.063

That is pretty strange, as any login must keep trace on the client side, either as a cookie (most common nowadays) or by writing something like ?sessionid=XXXXXXX at the end of the URL. If that's the case, you can pass it directly to wget. – NuTTyX – 2014-08-09T17:52:19.570

Well after I logged in there is just the http://www.myforum.com/forum/categories/discussions displayed, so no session id. When I click on extras, settings, privacy, show/display cookies there is just for google and youtube, as I said.

– Stat Tistician – 2014-08-10T19:03:36.403

I really would like to help, but I cannot think of a site that asks for login but does not use cookies neither a URL parameter to keep the session... I would suggest you use a proxy (like BURP: http://portswigger.net/burp/downloadfree.html) to capture how the login is done (should be a POST method you could easily find). That proxy would also mark any cookie sent by the server, so you could try reuse it in the wget command.

– NuTTyX – 2014-08-10T19:10:58.563