Extract text from web

0

I'm looking for an easy toll which can crawl the web I give it to and extract all text elements from that. It will be nice if it can do: plain text, alt and title for images, header section. All of the separately if possible. The output should be somehow searchable or text files (xml) for every page it crawled. I need these text to pass them to translators.

martin.malek

Posted 2010-09-29T09:24:51.870

Reputation: 191

There are plenty of web crawlers. Here are a few - opensource and python

– Praveen – 2012-10-18T13:09:22.080

Answers

2

Good old Lynx can provide you most features you requested. Try lynx -dump http://superuser.com/, for example.

You can also use wget for recursively crawling the sites you need, and then process the files with various converters available, such as htmltidy.

Janne Pikkarainen

Posted 2010-09-29T09:24:51.870

Reputation: 6 717