Table of Contents
- Part-1 : Introduction to Beautiful Soup (this blog)
- Part-2 : Real world web scraping example (here)
There are multiple libraries/tools to get this job done but BeautifulSoup shines in this area with its relatively easy learning curve and powerful search capabilities.
Beautiful Soup is a Python web scraping library which makes it easy to extract data from Web pages. It is meant to parse a single webpage and analyse and extract its data. It differs in this aspect from Scrapy a web crawling library, which has advanced functionality and is meant to crawl across websites. You can check out the blog I have written on Scrapy here.
Reference HTML
We will be using the below HTML document as our reference HTML for this tutorial.<html> <head><title>Beautiful Soup Tutorial</title></head> <body> <p>I am in p-1</p> <p>I am in p-2</p> <p>I am in p-3</p> <div id="card1" class="card" style="padding:0.1em 0.6em 0.5em;"> <a href="/c1-link1" title="C1-Link1">This is Card1-link1</a> <ul> <li><p>card1 - List index -1</p></li> <li><p>card1 - List index -2</p></li> <li><p>card1 - List index -3</p></li> </ul> <span customattr="Card1-Custom-Attr">Card1-Custom-Attribute</span> <div id="card1-footer" class="otd-footer hlist noprint" style="text-align: right;"> <ul><li><b><a href="/c1-link2" title="C1-Link2">This is Card1-link2</a></b></li> <li><b><a href="/c1-link3" class="email" title="C1-Link3">This is Card1-link3</a></b></li> <li><b><a href="/c1-link4" title="C1-Link4">This is Card1-link4</a></b></li></ul> </div> </div> <div id="card2" class="card" style="padding:0.1em 0.6em 0.5em;"> <a href="/link2" title="Link2">This is link2</a> <ul> <li><p>card2 - List index -1</p></li> <li><p>card2 - List index -2</p></li> <li><p>card2 - List index -3</p></li> </ul> <span customattr="Card2-Custom-Attr">Card2-Custom-Attribute</span> <div id="card2-footer" class="otd-footer hlist noprint" style="text-align: right;"> <ul><li><b><a href="/c2-link2" title="c2-Link2">This is Card2-link2</a></b></li> <li><b><a href="/c2-link3" class="email" title="c2-Link3">This is Card2-link3</a></b></li> <li><b><a href="/c2-link4" title="c2-Link4">This is Card2-link4</a></b></li></ul> </div> </div> </body> </html>
In order to use BeautifulSoup in python you need to install the library.
pip install beautifulsoup4
Python Code to parse the html
from bs4 import BeautifulSoup with open("bs.html") as file: bs = BeautifulSoup(file, 'html.parser') print bs.prettify()
This will result in the following output:
<html> <head> <title> Beautiful Soup Tutorial </title> </head> <body> <!-- The p tags become an array of objects --> <p> I am in p-1 </p> <p> I am in p-2 </p> <p> I am in p-3 </p> <div class="card" id="card1" style="padding:0.1em 0.6em 0.5em;"> <a href="/c1-link1" title="C1-Link1"> This is Card1-link1 </a> <ul> <li> <p> card1 - List index -1 </p> </li> <li> <p> card1 - List index -2 </p> </li> <li> <p> card1 - List index -3 </p> </li> </ul> <span customattr="Card1-Custom-Attr"> Card1-Custom-Attribute </span> <div class="otd-footer hlist noprint" id="card1-footer" style="text-align: right;"> <ul> <li> <b> <a href="/c1-link2" title="C1-Link2"> This is Card1-link2 </a> </b> </li> <li> <b> <a class="email" href="/c1-link3" title="C1-Link3"> This is Card1-link3 </a> </b> </li> <li> <b> <a href="/c1-link4" title="C1-Link4"> This is Card1-link4 </a> </b> </li> </ul> </div> </div> <div class="card" id="card2" style="padding:0.1em 0.6em 0.5em;"> <a href="/link2" title="Link2"> This is link2 </a> <ul> <li> <p> card2 - List index -1 </p> </li> <li> <p> card2 - List index -2 </p> </li> <li> <p> card2 - List index -3 </p> </li> </ul> <span customattr="Card2-Custom-Attr"> Card2-Custom-Attribute </span> <div class="otd-footer hlist noprint" id="card2-footer" style="text-align: right;"> <ul> <li> <b> <a href="/c2-link2" title="c2-Link2"> This is Card2-link2 </a> </b> </li> <li> <b> <a class="email" href="/c2-link3" title="c2-Link3"> This is Card2-link3 </a> </b> </li> <li> <b> <a href="/c2-link4" title="c2-Link4"> This is Card2-link4 </a> </b> </li> </ul> </div> </div> </body> </html>
When BeautifulSoup parses this document it constructs 4 types of objects internally:
- BeautifulSoup : Parsed document as a whole.
- Tag : Each tag such as <head>, <body>, <div>, <p>, <a> e.t.c becomes an object with its corresponding attributes and pointers to its children/parent.
- NavigableString : captures the text inside a tag. For ex: <a href="/somelink"> Captured Text </a>
- Comment : All comments in the html get captured as Comment objects
Lets start by navigating the constructed BeautifulSoup object:
bs.title #Beautiful Soup Tutorial type(bs.title) #bs.title.text # u'Beautiful Soup Tutorial' bs.head # Beautiful Soup Tutorial bs.head.parent.name (This will return the Tag object of the current Tag's parent) # u'html' [child.name for child in bs.head.children] # [u'title'] --> head has only one child. BeautifulSoup.tag will return the first child with that tag. bs.p # I am in p-1 bs.p[1] --> bs.p returns only a single Tag. It does not return an array. # *** KeyError: 1
Searching the Tree
Find all <p> tags in the document
bs.find_all('p') #[I am in p-1 # I am in p-2 # I am in p-3 # card1 - List index -1 # card1 - List index -2 # card1 - List index -3 # card2 - List index -1 # card2 - List index -2 # card2 - List index -3 ]
Find the element having id='card1' in the document
bs.find(id='card1') #\nThis is Card1-link1\n\n
\nCard1-Custom-Attribute\n \n- card1 - List # index -1
\n- card1 - List index -2
\n- card1 - List index -3
\n
find vs find_all
Find fetches the first element that matches a given search criteria bs.find(class_='email') (Note : use class_ when searching for class attributes) # This is Card1-link3 Find_all fetches all the elements matching a given search criteria bs.find_all(class_='email') # [This is Card1-link3, # This is Card2-link3]
Advanced Search
BeautifulSoup support searching for elements in nested hierarchies using CSS Selector SyntaxFind all <b> tags in the <div id="card1-footer"/>
This can be achieved in multiple ways as shown below:- bs.select('div[id=card1-footer] b')
- bs.select('[id=card1-footer] b')
- bs.select('#card1 div.otd-footer b')
# [This is Card1-link2, # This is Card1-link3, # This is Card1-link4]
Extract the text of the class="email" element inside <div id=Card2-footer/>
bs.select('#card2-footer a.email')[0].string # u'This is Card2-link3'
Comments