How can I download an entire (active) phpbb forum?

7

9

One of the forums that I frequent (and have added a LOT of quality content too) seems to be having problems with their server. I am not confident in their ability to sort out the problems they are having and in talking to one of the admins he mentioned that they don't back the data up.

As a complete fall back incase something goes horrifically wrong I want to download the entire forum. I am aware that I can't download the DB or the PHP files etc... I just want to make a locally browsable copy of the entire forum.

This means I could (when I have time) transfer the posts to the new site should they be starting fresh (on purpose or not).

Are there any tools that would allow this?

Side note: Obviously its really important I can browse it locally... which would be very difficult if each of the links still points to 'http://www.thesite.com/forum/specific_page.php' rather than '/forum/specific_page.php'.

user28163

Posted 2010-03-04T18:43:03.877

Reputation: 71

Answers

8

I am doing this right now. Here's the command I'm using:

wget -k -m -E -p -np -R memberlist.php*,faq.php*,viewtopic.php*p=*,posting.php*,search.php*,ucp.php*,viewonline.php*,*sid*,*view=print*,*start=0* -o log.txt http://www.example.com/forum/

I wanted to strip out those pesky session id things (sid=blahblahblah). They seem to get added automatically by the index page, and then get attached to all the links in a virus-like fashion. Except for one squirreled away somewhere - which links to a plain index.php which then continues with no sid= parameter. (Perhaps there's a way to force the recursive wget to start from index.php - I don't know).

I have also excluded some other pages that lead to a lot of cruft being saved. In particular memberlist.php and viewtopic.php where p= is specified can create thousands of files!

Due to this bug in wget http://savannah.gnu.org/bugs/?20808 it will still download an astounding number of those useless files - esepcially viewtopic.php?p= ones - before simply deleting them. So this is going to burn a lot of time and bandwidth.

Andrew Russell

Posted 2010-03-04T18:43:03.877

Reputation: 1 535

1In hindsight, I think that maybe a script that auto-incremented a wget to viewtopic?t=1 and viewforum?f=1 would work better. – Andrew Russell – 2010-03-05T14:29:25.447

To give you an idea of how much of a problem that bug in wget is, about 92% of the HTML files that were downloaded were deleted. (Although note that far, far fewer files are downloaded than if you mirror with no reject list at all.)

Also note that rejecting all these files makes links on the resulting pages virtually useless. You'd have to throw together a script to fixup links for you afterwards. – Andrew Russell – 2010-03-05T14:42:54.767

3

I recently faced a similar issue with a phpBB site I frequent facing imminent extinction (sadly, due to the admin passing away). With over 7 years of posts on the forum I didn't want to see it vanish, so I wrote a perl script to walk all the topics and save them to disk as flat HTML files. In case anyone else is facing a similar problem, the script is available here:

https://gist.github.com/2030469

It relies on a regex to extract the number of posts in a topic (needed to paginate) but other than that should generally work. Some of the regexes may need tweaking depending on your phpBB theme.

Evan

Posted 2010-03-04T18:43:03.877

Reputation: 41

1

Try some combination of wget flags like:

wget -m -k www.example.org/phpbb

Where -m is mirror, and -k is "convert links". You may also wish to add -p, to download images, as I can't recall whether -m does this.

Phoshi

Posted 2010-03-04T18:43:03.877

Reputation: 22 001

Will this allow me to specify my username and password or a cookie somehow? The stuff I'm most interested in backing up is the stuff that is non-annon logins only. – user28163 – 2010-03-04T18:53:20.900

Oh, of course. wget DOES have a --load-cookies argument, which apparently takes a filepath, but I have no idea how it works! – Phoshi – 2010-03-04T20:01:44.967

@user28163; http://www.gnu.org/software/wget/manual/html_node/HTTP-Options.html has a section on --load-cookies which explains it better than I could. Sounds like you might be able to accomplish this! :)

– Phoshi – 2010-03-04T20:11:43.933

0

here some added info to @andrew-russell

still lots of noise but a start if you need to login.

This project looks promising but didn't quite work for me: https://github.com/lairdshaw/fups

Example with login:

PHPBB_URL=http://www.someserver.com/phpbb
USER=MyUser
PASS=MyPass

wget --save-cookies=./session-cookies-$USER $PHPBB_URL/ucp.php?mode=login -O - 1> /dev/null 2> /dev/null

SID=`cat ./session-cookies-$USER | grep _sid | cut -d$'\011' -f7`

echo "Login $USER --> $PHPBB_URL SID=$SID"

wget --save-cookies=./session-cookies-$USER \
 --post-data="username=$USER&password=$PASS&redirect=index.php&sid=$SID&login=Login" \
 $PHPBB_URL/ucp.php?mode=login --referer="$PHPBB_URL/ucp.php?mode=login" \
 -O - 1> /dev/null 2> /dev/null

wget --load-cookies ./session-cookies-$USER -k -m -E -p -np -R memberlist.php*,faq.php*,viewtopic.php*p=*,posting.php*,search.php*,ucp.php*,viewonline.php*,*sid*,*view=print*,*start=0* $PHPBB_URL/viewtopic.php?t=27704

######## loop thru topics see below(but above should get most with the options. 
#wget --load-cookies ./session-cookies-$USER -k -m -E -p -np -R $PHPBB_URL/viewtopic.php?t={1..29700}

Tilo

Posted 2010-03-04T18:43:03.877

Reputation: 181

-1

HTTrack is a tool that might help you out. I am not sure if it will work on forums though.

Sakamoto Kazuma

Posted 2010-03-04T18:43:03.877

Reputation: 833

Of course it does. – user598527 – 2017-05-05T19:02:28.507