Search for text in web pages given a list of URLs

0

I have a list of several thousand URLs, and I'd like to search each of these pages for a given word. How can I do this programmatically on Windows, preferably using VBScript or Powershell?

Mark Richman

Posted 2011-07-12T16:05:18.513

Reputation: 252

Answers

1

Edit: The original question didn't specify VBScript & Powershell. I'm leaving this Python suggestion in hopes that someone in the future will benefit.

What is the quickest way to do this programmatically on Windows? I guess 'quickest' is a function of your abilities.

With my skills, I would whip up a python script for that, as that would be the quickest way for me. The script, as I would write it, would looks kind of like

search_string = ""                 #String you're search for
sites_with_str = {}                #List that'll contain URLs with search_string in them
file = fopen("c:\sites.txt", "r")
for site in file:
  html = wget(site)
  if html.contains(search_string):
     sites_with_str.add(site)
file.fclose()                      #it's just polite to close your read handles


#Print out the sites with the search string in them
print "\n\nSites Containing Search String \""+search_string+"\":"
for each in sites_with_str:
  print each

Of course, that's sort of Pseudo-Python. You'll have to find a library that'll grab a site for you. And obviously it'd require a little recursive function and some string parsing if you wanted to search all pages within each site referenced in the input file.

James T Snell

Posted 2011-07-12T16:05:18.513

Reputation: 5 726

Thanks for the suggestion. I've updated my question to indicate VBScript or Powershell. – Mark Richman – 2011-07-12T16:34:57.553

@Mark -- cries – James T Snell – 2011-07-12T16:42:18.670

Yes, I'm crying too, not having access to a real OS ;) – Mark Richman – 2011-07-12T16:48:04.950

@Mark and you're being forced to not use Python?? What a Saddistic situation my friend :P – James T Snell – 2011-07-12T16:48:50.317

0

I solved my own problem, in case anyone else faces the same requirement:

$webClient = new-object System.Net.WebClient
$webClient.Headers.Add("user-agent", "PowerShell Script")

$info = get-content c:\path\to\file\urls.txt

foreach ($i in $info) {
  $output = ""

  $startTime = get-date
  $output = $webClient.DownloadString($i)
  $endTime = get-date

  if ($output -like "*some dirty word*") {
    "Success`t`t" + $i + "`t`t" + ($endTime - $startTime).TotalSeconds + " seconds"
  } 

}

Mark Richman

Posted 2011-07-12T16:05:18.513

Reputation: 252