Table of Contents
- Part-1 : Introduction to Beautiful Soup (this blog)
- Part-2 : Real world web scraping example (here)
There are multiple libraries/tools to get this job done but BeautifulSoup shines in this area with its relatively easy learning curve and powerful search capabilities.
Beautiful Soup is a Python web scraping library which makes it easy to extract data from Web pages. It is meant to parse a single webpage and analyse and extract its data. It differs in this aspect from Scrapy a web crawling library, which has advanced functionality and is meant to crawl across websites. You can check out the blog I have written on Scrapy here.
Reference HTML
We will be using the below HTML document as our reference HTML for this tutorial.<html> <head><title>Beautiful Soup Tutorial</title></head> <body> <p>I am in p-1</p> <p>I am in p-2</p> <p>I am in p-3</p> <div id="card1" class="card" style="padding:0.1em 0.6em 0.5em;"> <a href="/c1-link1" title="C1-Link1">This is Card1-link1</a> <ul> <li><p>card1 - List index -1</p></li> <li><p>card1 - List index -2</p></li> <li><p>card1 - List index -3</p></li> </ul> <span customattr="Card1-Custom-Attr">Card1-Custom-Attribute</span> <div id="card1-footer" class="otd-footer hlist noprint" style="text-align: right;"> <ul><li><b><a href="/c1-link2" title="C1-Link2">This is Card1-link2</a></b></li> <li><b><a href="/c1-link3" class="email" title="C1-Link3">This is Card1-link3</a></b></li> <li><b><a href="/c1-link4" title="C1-Link4">This is Card1-link4</a></b></li></ul> </div> </div> <div id="card2" class="card" style="padding:0.1em 0.6em 0.5em;"> <a href="/link2" title="Link2">This is link2</a> <ul> <li><p>card2 - List index -1</p></li> <li><p>card2 - List index -2</p></li> <li><p>card2 - List index -3</p></li> </ul> <span customattr="Card2-Custom-Attr">Card2-Custom-Attribute</span> <div id="card2-footer" class="otd-footer hlist noprint" style="text-align: right;"> <ul><li><b><a href="/c2-link2" title="c2-Link2">This is Card2-link2</a></b></li> <li><b><a href="/c2-link3" class="email" title="c2-Link3">This is Card2-link3</a></b></li> <li><b><a href="/c2-link4" title="c2-Link4">This is Card2-link4</a></b></li></ul> </div> </div> </body> </html>
In order to use BeautifulSoup in python you need to install the library.
pip install beautifulsoup4
Python Code to parse the html
from bs4 import BeautifulSoup
with open("bs.html") as file:
bs = BeautifulSoup(file, 'html.parser')
print bs.prettify()
This will result in the following output:
<html>
<head>
<title>
Beautiful Soup Tutorial
</title>
</head>
<body>
<!-- The p tags become an array of objects -->
<p>
I am in p-1
</p>
<p>
I am in p-2
</p>
<p>
I am in p-3
</p>
<div class="card" id="card1" style="padding:0.1em 0.6em 0.5em;">
<a href="/c1-link1" title="C1-Link1">
This is Card1-link1
</a>
<ul>
<li>
<p>
card1 - List index -1
</p>
</li>
<li>
<p>
card1 - List index -2
</p>
</li>
<li>
<p>
card1 - List index -3
</p>
</li>
</ul>
<span customattr="Card1-Custom-Attr">
Card1-Custom-Attribute
</span>
<div class="otd-footer hlist noprint" id="card1-footer" style="text-align: right;">
<ul>
<li>
<b>
<a href="/c1-link2" title="C1-Link2">
This is Card1-link2
</a>
</b>
</li>
<li>
<b>
<a class="email" href="/c1-link3" title="C1-Link3">
This is Card1-link3
</a>
</b>
</li>
<li>
<b>
<a href="/c1-link4" title="C1-Link4">
This is Card1-link4
</a>
</b>
</li>
</ul>
</div>
</div>
<div class="card" id="card2" style="padding:0.1em 0.6em 0.5em;">
<a href="/link2" title="Link2">
This is link2
</a>
<ul>
<li>
<p>
card2 - List index -1
</p>
</li>
<li>
<p>
card2 - List index -2
</p>
</li>
<li>
<p>
card2 - List index -3
</p>
</li>
</ul>
<span customattr="Card2-Custom-Attr">
Card2-Custom-Attribute
</span>
<div class="otd-footer hlist noprint" id="card2-footer" style="text-align: right;">
<ul>
<li>
<b>
<a href="/c2-link2" title="c2-Link2">
This is Card2-link2
</a>
</b>
</li>
<li>
<b>
<a class="email" href="/c2-link3" title="c2-Link3">
This is Card2-link3
</a>
</b>
</li>
<li>
<b>
<a href="/c2-link4" title="c2-Link4">
This is Card2-link4
</a>
</b>
</li>
</ul>
</div>
</div>
</body>
</html>
When BeautifulSoup parses this document it constructs 4 types of objects internally:
- BeautifulSoup : Parsed document as a whole.
- Tag : Each tag such as <head>, <body>, <div>, <p>, <a> e.t.c becomes an object with its corresponding attributes and pointers to its children/parent.
- NavigableString : captures the text inside a tag. For ex: <a href="/somelink"> Captured Text </a>
- Comment : All comments in the html get captured as Comment objects
Lets start by navigating the constructed BeautifulSoup object:
bs.title #Beautiful Soup Tutorial type(bs.title) #bs.title.text # u'Beautiful Soup Tutorial' bs.head # Beautiful Soup Tutorial bs.head.parent.name (This will return the Tag object of the current Tag's parent) # u'html' [child.name for child in bs.head.children] # [u'title'] --> head has only one child. BeautifulSoup.tag will return the first child with that tag. bs.p # I am in p-1 bs.p[1] --> bs.p returns only a single Tag. It does not return an array. # *** KeyError: 1
Searching the Tree
Find all <p> tags in the document
bs.find_all('p')
#[I am in p-1
# I am in p-2
# I am in p-3
# card1 - List index -1
# card1 - List index -2
# card1 - List index -3
# card2 - List index -1
# card2 - List index -2
# card2 - List index -3
]
Find the element having id='card1' in the document
bs.find(id='card1') #\nThis is Card1-link1\n\n
\nCard1-Custom-Attribute\n \n- card1 - List # index -1
\n- card1 - List index -2
\n- card1 - List index -3
\n
find vs find_all
Find fetches the first element that matches a given search criteria bs.find(class_='email') (Note : use class_ when searching for class attributes) # This is Card1-link3 Find_all fetches all the elements matching a given search criteria bs.find_all(class_='email') # [This is Card1-link3, # This is Card2-link3]
Advanced Search
BeautifulSoup support searching for elements in nested hierarchies using CSS Selector SyntaxFind all <b> tags in the <div id="card1-footer"/>
This can be achieved in multiple ways as shown below:- bs.select('div[id=card1-footer] b')
- bs.select('[id=card1-footer] b')
- bs.select('#card1 div.otd-footer b')
# [This is Card1-link2, # This is Card1-link3, # This is Card1-link4]
Extract the text of the class="email" element inside <div id=Card2-footer/>
bs.select('#card2-footer a.email')[0].string
# u'This is Card2-link3'
Comments