Python and HTML Processing

Trying to harvest some links from the Coursea Lectures, in a fast and efficient way saving me valuable time, I decided to use Python with Beautiful Soup.

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. It parses anything you give it, and does the tree traversal stuff for you. You can tell it “Find all the links”, or “Find all the links of class externalLink”, or “Find all the links whose urls match “foo.com”, or “Find the table heading that’s got bold text, then give me that text.”

First, you need to download the library from here and untar the bs4 folder in your workspace.
The library is really simple and intuitive to use, since represent the HTML document as a nested data structure.
Too much chatter, now I’m going to show you the python code 🙂

#!/usr/bin/env python

import urllib, sgmllib

# Get a file-like object for the Coursea Web site course page.
f = urllib.urlopen("https://class.coursera.org/gametheory/lecture/preview")
# Read from the object, storing the page's contents in 's'.
s = f.read()

# Here is the import for the library above described.
from bs4 import BeautifulSoup
   # BeautifulSoup object, which represents the document as a nested data structure
   soup = BeautifulSoup(s)

   # Find all the link with rel 'lecture-link'
   for videos in soup.find_all('a', rel="lecture-link"):
      link = videos.get('href')

      # Get the page apointed by the previous link in the same way
      f1 = urllib.urlopen(link)
      s1 = f1.read()
      soup1 = BeautifulSoup(s1)

      # Find all the source tags in the document and filter for 'video/mp4'
      for link in soup1.find_all('source'):
         if link.get('type') == "video/mp4":
         # You can concate 'wget' to build immediately an script
            print 'wget '+link.get('src')

A full documentation can be found here.

Ensure you have the bs4 folder and a file with the previous code in the same workspace. Then you can execute your python file as usual and get the links to download the video lectures that you want.

Cheers!

Advertisements

2 comments on “Python and HTML Processing

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s