Subscribe To Corey's Blog Subscribe To Goldblog goldb.org home

Python | Screen Scraping Web Pages

by Corey Goldberg - September 2007

This tutorial shows how to programmatically retrieve a stock quote from Google Finance. It uses Python's high level Web API and screen scraping with regular expressions.


First, lets look at the page we want to get our content from. To get finance data for the ticker "IBM", we use a URL like this:

http://finance.google.com/finance?q=IBM

If you enter this URL in your browser, you can see the page we are going to scrape from.


To retrieve the content of the page, we can use Python's urllib module:

import urllib
content = urllib.urlopen("http://finance.google.com/finance?q=IBM").read()

Now that we have the content stored, we want to scrape some data from it. If we look inside the content (this is the same as if you did a "View Source" in your browser for this page), we see a line that contains the price quote:

<span class="pr" id="ref_18241_l">116.26</span>

To extract the price quote, we use a regular expression (regex) with matching groups. Regular expressions are a powerful tool for doing pattern matching and text extraction/parsing. Regexes may seem a little arcane (unless you are a Perl hacker), but they allow you to search and manipulate text using a concise syntax.

In our regex, we use a Matching Group. This is the piece of the regex enclosed in parenthesis that we want to extract data from. In this case, we use (.*?) to define the matching group. The group contains metacharacters that match a range of literal characters. This is the pattern that matches our stock quote value:

class="pr".*?>(.*?)<

Once we do the search, we can get the text we extracted with the matching group. Since we only used one group, our text will be contained in m.group(1):

import re
m = re.search('class="pr".*?>(.*?)<', content)
if m:
    quote = m.group(1)

Regular expressions are compiled into RegexObject instances. If we are going to use a regex frequently, we can optimize it by compiling it once and then using our compiled version:

regex = re.compile('class="pr".*?>(.*?)<')
m = regex.search(content)

We can put all of this into a function. We pass in a ticker symbol, and it will return a price quote:

import urllib
import re

def get_quote(symbol):
    base_url = 'http://finance.google.com/finance?q='
    content = urllib.urlopen(base_url + symbol).read()
    m = re.search('class="pr".*?>(.*?)<', content)
    if m:
        quote = m.group(1)
    else:
        quote = 'no quote available for: ' + symbol
    return quote

About the author:
Corey Goldberg is a software engineer from Boston with over 10 years experience as a developer and tester. He has contributed to many open source projects and developed and maintained several on his own. Corey has a Master's Degree in Computer Information Systems from Boston University. For questions and more information, visit his web site and blog at: www.goldb.org

Copyright © 2006-2007 Corey Goldberg  |