Table of Contents
- Part-1 : Introduction to Beautiful Soup (this blog)
- Part-2 : Real world web scraping example (here)
There are multiple libraries/tools to get this job done but BeautifulSoup shines in this area with its relatively easy learning curve and powerful search capabilities.
Beautiful Soup is a Python web scraping library which makes it easy to extract data from Web pages. It is meant to parse a single webpage and analyse and extract its data. It differs in this aspect from Scrapy a web crawling library, which has advanced functionality and is meant to crawl across websites. You can check out the blog I have written on Scrapy here.
Reference HTML
We will be using the below HTML document as our reference HTML for this tutorial.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | < html > < head >< title >Beautiful Soup Tutorial</ title ></ head > < body > < p >I am in p-1</ p > < p >I am in p-2</ p > < p >I am in p-3</ p > < div id = "card1" class = "card" style = "padding:0.1em 0.6em 0.5em;" > < a href = "/c1-link1" title = "C1-Link1" >This is Card1-link1</ a > < ul > < li >< p >card1 - List index -1</ p ></ li > < li >< p >card1 - List index -2</ p ></ li > < li >< p >card1 - List index -3</ p ></ li > </ ul > < span customattr = "Card1-Custom-Attr" >Card1-Custom-Attribute</ span > < div id = "card1-footer" class = "otd-footer hlist noprint" style = "text-align: right;" > < ul >< li >< b >< a href = "/c1-link2" title = "C1-Link2" >This is Card1-link2</ a ></ b ></ li > < li >< b >< a href = "/c1-link3" class = "email" title = "C1-Link3" >This is Card1-link3</ a ></ b ></ li > < li >< b >< a href = "/c1-link4" title = "C1-Link4" >This is Card1-link4</ a ></ b ></ li ></ ul > </ div > </ div > < div id = "card2" class = "card" style = "padding:0.1em 0.6em 0.5em;" > < a href = "/link2" title = "Link2" >This is link2</ a > < ul > < li >< p >card2 - List index -1</ p ></ li > < li >< p >card2 - List index -2</ p ></ li > < li >< p >card2 - List index -3</ p ></ li > </ ul > < span customattr = "Card2-Custom-Attr" >Card2-Custom-Attribute</ span > < div id = "card2-footer" class = "otd-footer hlist noprint" style = "text-align: right;" > < ul >< li >< b >< a href = "/c2-link2" title = "c2-Link2" >This is Card2-link2</ a ></ b ></ li > < li >< b >< a href = "/c2-link3" class = "email" title = "c2-Link3" >This is Card2-link3</ a ></ b ></ li > < li >< b >< a href = "/c2-link4" title = "c2-Link4" >This is Card2-link4</ a ></ b ></ li ></ ul > </ div > </ div > </ body > </ html > |
In order to use BeautifulSoup in python you need to install the library.
1 | pip install beautifulsoup4 |
Python Code to parse the html
1 2 3 4 5 | from bs4 import BeautifulSoup with open ( "bs.html" ) as file : bs = BeautifulSoup( file , 'html.parser' ) print bs.prettify() |
This will result in the following output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | < html > < head > < title > Beautiful Soup Tutorial </ title > </ head > < body > <!-- The p tags become an array of objects --> < p > I am in p-1 </ p > < p > I am in p-2 </ p > < p > I am in p-3 </ p > < div class = "card" id = "card1" style = "padding:0.1em 0.6em 0.5em;" > < a href = "/c1-link1" title = "C1-Link1" > This is Card1-link1 </ a > < ul > < li > < p > card1 - List index -1 </ p > </ li > < li > < p > card1 - List index -2 </ p > </ li > < li > < p > card1 - List index -3 </ p > </ li > </ ul > < span customattr = "Card1-Custom-Attr" > Card1-Custom-Attribute </ span > < div class = "otd-footer hlist noprint" id = "card1-footer" style = "text-align: right;" > < ul > < li > < b > < a href = "/c1-link2" title = "C1-Link2" > This is Card1-link2 </ a > </ b > </ li > < li > < b > < a class = "email" href = "/c1-link3" title = "C1-Link3" > This is Card1-link3 </ a > </ b > </ li > < li > < b > < a href = "/c1-link4" title = "C1-Link4" > This is Card1-link4 </ a > </ b > </ li > </ ul > </ div > </ div > < div class = "card" id = "card2" style = "padding:0.1em 0.6em 0.5em;" > < a href = "/link2" title = "Link2" > This is link2 </ a > < ul > < li > < p > card2 - List index -1 </ p > </ li > < li > < p > card2 - List index -2 </ p > </ li > < li > < p > card2 - List index -3 </ p > </ li > </ ul > < span customattr = "Card2-Custom-Attr" > Card2-Custom-Attribute </ span > < div class = "otd-footer hlist noprint" id = "card2-footer" style = "text-align: right;" > < ul > < li > < b > < a href = "/c2-link2" title = "c2-Link2" > This is Card2-link2 </ a > </ b > </ li > < li > < b > < a class = "email" href = "/c2-link3" title = "c2-Link3" > This is Card2-link3 </ a > </ b > </ li > < li > < b > < a href = "/c2-link4" title = "c2-Link4" > This is Card2-link4 </ a > </ b > </ li > </ ul > </ div > </ div > </ body > </ html > |
When BeautifulSoup parses this document it constructs 4 types of objects internally:
- BeautifulSoup : Parsed document as a whole.
- Tag : Each tag such as <head>, <body>, <div>, <p>, <a> e.t.c becomes an object with its corresponding attributes and pointers to its children/parent.
- NavigableString : captures the text inside a tag. For ex: <a href="/somelink"> Captured Text </a>
- Comment : All comments in the html get captured as Comment objects
Lets start by navigating the constructed BeautifulSoup object:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | bs.title # < title >Beautiful Soup Tutorial</ title > type(bs.title) # < class bs4.element.tag = "" > bs.title.text # u'Beautiful Soup Tutorial' bs.head # < title >Beautiful Soup Tutorial</ title > bs.head.parent.name (This will return the Tag object of the current Tag's parent) # u'html' [child.name for child in bs.head.children] # [u'title'] --> head has only one child. BeautifulSoup.tag will return the first child with that tag. bs.p # I am in p-1 bs.p[1] --> bs.p returns only a single Tag. It does not return an array. # *** KeyError: 1 </ class > |
Searching the Tree
Find all <p> tags in the document
1 2 3 4 5 6 7 8 9 10 11 | bs.find_all('p') #[I am in p-1 # I am in p-2 # I am in p-3 # card1 - List index -1 # card1 - List index -2 # card1 - List index -3 # card2 - List index -1 # card2 - List index -2 # card2 - List index -3 ] |
Find the element having id='card1' in the document
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | bs.find(id='card1') # < div class = "card" id = "card1" style = "padding: 0.1em 0.6em 0.5em;" >\n< a href = "https://www.blogger.com/c1-link1" title = "C1-Link1" >This is Card1-link1</ a >\n < ul >\n < li >card1 - List # index -1</ li > \n < li >card1 - List index -2</ li > \n < li >card1 - List index -3</ li > \n</ ul >\n< span customattr = "Card1-Custom-Attr" >Card1-Custom-Attribute</ span >\n < div class = "otd-footer hlist noprint" id = "card1-footer" style = "text-align: # right;" >\n < ul >< li >< b >< a href = "https://www.blogger.com/c1-link2" title = "C1-Link2" >This is Card1-link2</ a ></ b ></ li > \n < li >< b >< a class = "email" href = "https://www.blogger.com/c1-link3" title = "C1-Link3" >This is # Card1-link3</ a ></ b ></ li > \n </ ul >\n</ div >\n</ div > |
find vs find_all
1 2 3 4 5 6 7 8 | Find fetches the first element that matches a given search criteria bs.find(< b >class_</ b >='email') (Note : use class_ when searching for class attributes) Find_all fetches all the elements matching a given search criteria bs.find_all(class_='email') # [< a class = "email" href = "https://www.blogger.com/c1-link3" title = "C1-Link3" >This is Card1-link3</ a >, # < a class = "email" href = "https://www.blogger.com/c2-link3" title = "c2-Link3" >This is Card2-link3</ a >] |
Advanced Search
BeautifulSoup support searching for elements in nested hierarchies using CSS Selector SyntaxFind all <b> tags in the <div id="card1-footer"/>
This can be achieved in multiple ways as shown below:- bs.select('div[id=card1-footer] b')
- bs.select('[id=card1-footer] b')
- bs.select('#card1 div.otd-footer b')
1 2 3 | # < b >< a class = "email" href = "https://www.blogger.com/c1-link3" title = "C1-Link3" >This is Card1-link3</ a ></ b >, |
Extract the text of the class="email" element inside <div id=Card2-footer/>
1 2 | bs.select('#card2-footer a.email')[0].string # u'This is Card2-link3' |
Comments