How can I scrape specific data from a website

2

1

I'm trying to scrape data from a website for research.

The urls are nicely organized in an example.com/x format, with x as an ascending number and all of the pages are structured in the same way. I just need to grab certain headings and a few numbers which are always in the same locations. I'll then need to get this data into structured form for analysis in Excel.

I have used wget before to download pages, but I can't figure out how to grab specific lines of text.

Excel has a feature to grab data from the web (Data->From Web) but from what I can see it only allows me to download tables. Unfortunately, the data I need is not in tables.

Stoney

Posted 2012-09-12T15:47:45.833

Reputation: 224

In the end, I created a .txt file with a list of all the urls I needed and had wget download all the pages in that file. I then used iMacros as suggested by @Lamb to extract data locally. I found the trial of the fulll version easiest to start out with. Some useful beginner features are not available in the firefox plugin gui, even though the same code will work.

– Stoney – 2012-11-15T11:02:07.167

Answers

2

You can use iMacros to automate this task. It is a tool to automate repetative tasks, it can be used to extract data from a site and save it as a CSV file.

As x in example.com/x follows a pattern (i.e. ascending), you can easily program (even without programming exeprience) it to visit each page and extract data from it.

Read more at iMacros Tutorial.

Alternatively you can also try ScraperWiki (some programming experience required).

Ankit

Posted 2012-09-12T15:47:45.833

Reputation: 4 082

Thank you! iMacros appears to be the solution I need. After asking the original question, I had found ScraperWiki. It appears to be a powerful tool, but is a bit complicated programmingwise for me. – Stoney – 2012-09-12T18:48:08.930

1

You can paste this code in a module:

Option Explicit
Sub get_data()
Dim result As String
Dim myURL As String
Dim winHttpReq As Object
Set winHttpReq = CreateObject("WinHttp.WinHttpRequest.5.1")
myURL = "example.com/x format"

winHttpReq.Open "GET", myURL, False
winHttpReq.Send
result = winHttpReq.responseText
Application.ScreenUpdating = True

Range("Sheet1!A1").value = result
End sub

You can simply parse the long string in cell A1 to extract the data you need.

user157938

Posted 2012-09-12T15:47:45.833

Reputation: 69

1It would help if you explain what the code does. – Baarn – 2012-11-15T12:46:34.650