A Nice, Simple Web Scraping Code Golf

4

2

Introduction

This is a web scraping Code Golf challenge that attempts to take stock Name and price columns from the 'Most Active' list on this New York Times URL.

Why is this challenge interesting?

Challenge: Scrape these figures from this NYT page and turn it into a data frame using the smallest number of bytes possible (IE: import pandas as pd (should you choose to use Python) would give you a score of 19 (19 bytes). The data frame must be printed or returned as output.

enter image description here

Shortest Code by bytes wins the challenge. That's it!

Input:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url='https://markets.on.nytimes.com/research/markets/overview/overview.asp'

def pull_active(url):
    # your code
    response = requests.get(url)
    if response.status_code == 200:
      print("Response Successful")
    else:
      print("Response Unsuccessful")
      return()
    bs_object = BeautifulSoup(response.content,'lxml')
    st_value = bs_object.find_all('td',class_='colPrimary')
    st_name = bs_object.find_all('a',class_="truncateMeTo1")
    x = []
    y = []
    for tag in st_name:
      x.append(tag.get_text())
    for tag in st_value:
      y.append(tag.get_text())
    data = {'Name': x[0:8],
            'Price': y[0:8]}
    df = pd.DataFrame(data)
    return(df)
pull_active(url)

Score: 773 bytes To beat it, any code under 773 bytes.

Output:

  Name    Price
0 General Electric Co 10.11
1 Advanced Micro Devices Inc  33.13
2 PG&E Corp   6.14
3 Fiat Chrysler Automobiles NV    14.98
4 Neuralstem Inc  1.83
5 AT&T Inc    38.20
6 GrubHub Inc 34.00
7 Twitter Inc 29.86

...

VBAmazing

Posted 2019-10-30T22:34:34.630

Reputation: 57

3Language restrictions like "Only responses coded in Python, please" are generally discouraged on this site. – pppery – 2019-10-30T22:57:55.100

Fair enough-- I'm still relatively new. I'll modify the post to encourage more participation. – VBAmazing – 2019-10-30T23:02:18.770

Also, great! I'll reflect that as well. – VBAmazing – 2019-10-30T23:12:12.613

Is this JavaScript answer valid? Open browser's console in same page and use this code in console: document.querySelector('tbody').innerText – Night2 – 2019-10-31T12:54:04.750

Or even shorter since jQuery is already available in that page: $('tbody')[0].innerText – Night2 – 2019-10-31T13:04:51.090

Answers

4

R, 97 bytes

library(rvest)
matrix(html_text(html_nodes(html("http://nyti.ms/2BY72Ph"),"td"))[2:25],,3,T)[,-3]

Try it online!

Gets all the td elements of the page. We want elements 2 to 25, which gives everything we need plus the % change, which is removed by omitting the 3rd column of the final matrix.

Outputs as a character matrix, since this is less bytes than a data frame:

     [,1]                  [,2]   
[1,] "General Electric Co" "10.11"
[2,] "PG&E Corp"           "6.14" 
[3,] "Neuralstem Inc"      "1.83" 
[4,] "GrubHub Inc"         "34.00"
[5,] "Twitter Inc"         "29.86"
[6,] "Zynga Inc"           "6.21" 
[7,] "Iveric Bio Inc"      "3.56" 
[8,] "Pfizer Inc"          "38.48"

Robin Ryder

Posted 2019-10-30T22:34:34.630

Reputation: 6 625

3

Bash + lynx, 73

lynx -dump http://nyti.ms/2BY72Ph|grep -Po '\[(1[6-9]|2[0-3])]\K[^.]*...'

Output is as follows:

General Electric Co          10.11
Advanced Micro Devices Inc   33.13
PG&E Corp                    6.14 
Fiat Chrysler Automobiles NV 14.98
Neuralstem Inc               1.83 
AT&T Inc                     38.20
GrubHub Inc                  34.00
Twitter Inc                  29.86

Digital Trauma

Posted 2019-10-30T22:34:34.630

Reputation: 64 644

@RobinRyder, somehow I'd always missed that paragraph before now. – Shaggy – 2019-10-31T10:08:15.890

@640KB It works for me. – Robin Ryder – 2019-11-01T07:28:30.767

2

PHP, 154 153 130 bytes

foreach(DOMDocument::loadHTMLFile('http://nyti.ms/2BY72Ph')->getElementsByTagName(td)as$i)$x<24&&print$x++%3?"$i->nodeValue ":"
";

Try it online!

-23 bytes thx to @Night2!

$ php nyt.php

Agile Therapeutics Inc 1.38
Advanced Micro Devices Inc 34.05
Facebook Inc 191.82
Kraft Heinz Co 32.34
Sirius XM Holdings Inc 6.72
Apple Inc 247.40
Zynga Inc 6.24
Snap Inc 15.11

640KB

Posted 2019-10-30T22:34:34.630

Reputation: 7 149

0

F# (.NET Core), 271 257 251 233 bytes

 let l=HtmlDocument.Load("http://nyti.ms/2BY72Ph").Descendants["td"]|>Seq.filter(fun n->n.HasClass("colText")||n.HasClass("colPrimary"))|>Seq.take 16|>Seq.map(fun n->n.InnerText())|>Seq.toList
 [for i in 0..7-> l.[i*2]+" "+l.[i*2+1]]

Try it online!

Delta

Posted 2019-10-30T22:34:34.630

Reputation: 543