A gentle introduction to BeautifulSoup (Web Scraping Library)

Part-1 : Introduction to Beautiful Soup (this blog)
Part-2 : Real world web scraping example (here)

In our day to day lives, we are required to extract data from the web for data mining and analytics. It can even be for a simple personal project where you create scripts to get the latest price of your stocks, weather report, check the confirmation status of your ticket e.t.c

There are multiple libraries/tools to get this job done but BeautifulSoup shines in this area with its relatively easy learning curve and powerful search capabilities.

Beautiful Soup is a Python web scraping library which makes it easy to extract data from Web pages. It is meant to parse a single webpage and analyse and extract its data. It differs in this aspect from Scrapy a web crawling library, which has advanced functionality and is meant to crawl across websites. You can check out the blog I have written on Scrapy here.

Reference HTML

We will be using the below HTML document as our reference HTML for this tutorial.

<html>
<head><title>Beautiful Soup Tutorial</title></head>
<body>
<p>I am in p-1</p>
<p>I am in p-2</p>
<p>I am in p-3</p>
<div id="card1" class="card" style="padding:0.1em 0.6em 0.5em;">
 <a href="/c1-link1" title="C1-Link1">This is Card1-link1</a>
 <ul>
 <li><p>card1 - List index -1</p></li>
 <li><p>card1 - List index -2</p></li>
 <li><p>card1 - List index -3</p></li>
 </ul>
 <span customattr="Card1-Custom-Attr">Card1-Custom-Attribute</span>
 <div id="card1-footer" class="otd-footer hlist noprint" style="text-align: right;">
  <ul><li><b><a href="/c1-link2" title="C1-Link2">This is Card1-link2</a></b></li>
  <li><b><a href="/c1-link3" class="email" title="C1-Link3">This is Card1-link3</a></b></li>
  <li><b><a href="/c1-link4" title="C1-Link4">This is Card1-link4</a></b></li></ul>
 </div>
</div>
<div id="card2" class="card" style="padding:0.1em 0.6em 0.5em;">
 <a href="/link2" title="Link2">This is link2</a>
 <ul>
 <li><p>card2 - List index -1</p></li>
 <li><p>card2 - List index -2</p></li>
 <li><p>card2 - List index -3</p></li>
 </ul>
 <span customattr="Card2-Custom-Attr">Card2-Custom-Attribute</span>
 <div id="card2-footer" class="otd-footer hlist noprint" style="text-align: right;">
  <ul><li><b><a href="/c2-link2" title="c2-Link2">This is Card2-link2</a></b></li>
  <li><b><a href="/c2-link3" class="email" title="c2-Link3">This is Card2-link3</a></b></li>
  <li><b><a href="/c2-link4" title="c2-Link4">This is Card2-link4</a></b></li></ul>
 </div>
</div>
</body>
</html>

In order to use BeautifulSoup in python you need to install the library.

pip install beautifulsoup4

Python Code to parse the html

from bs4 import BeautifulSoup

with open("bs.html") as file:
    bs = BeautifulSoup(file, 'html.parser')
    print bs.prettify()

This will result in the following output:

<html>
 <head>
  <title>
   Beautiful Soup Tutorial
  </title>
 </head>
 <body>
  <!-- The p tags become an array of objects -->
  <p>
   I am in p-1
  </p>
  <p>
   I am in p-2
  </p>
  <p>
   I am in p-3
  </p>
  <div class="card" id="card1" style="padding:0.1em 0.6em 0.5em;">
   <a href="/c1-link1" title="C1-Link1">
    This is Card1-link1
   </a>
   <ul>
    <li>
     <p>
      card1 - List index -1
     </p>
    </li>
    <li>
     <p>
      card1 - List index -2
     </p>
    </li>
    <li>
     <p>
      card1 - List index -3
     </p>
    </li>
   </ul>
   <span customattr="Card1-Custom-Attr">
    Card1-Custom-Attribute
   </span>
   <div class="otd-footer hlist noprint" id="card1-footer" style="text-align: right;">
    <ul>
     <li>
      <b>
       <a href="/c1-link2" title="C1-Link2">
        This is Card1-link2
       </a>
      </b>
     </li>
     <li>
      <b>
       <a class="email" href="/c1-link3" title="C1-Link3">
        This is Card1-link3
       </a>
      </b>
     </li>
     <li>
      <b>
       <a href="/c1-link4" title="C1-Link4">
        This is Card1-link4
       </a>
      </b>
     </li>
    </ul>
   </div>
  </div>
  <div class="card" id="card2" style="padding:0.1em 0.6em 0.5em;">
   <a href="/link2" title="Link2">
    This is link2
   </a>
   <ul>
    <li>
     <p>
      card2 - List index -1
     </p>
    </li>
    <li>
     <p>
      card2 - List index -2
     </p>
    </li>
    <li>
     <p>
      card2 - List index -3
     </p>
    </li>
   </ul>
   <span customattr="Card2-Custom-Attr">
    Card2-Custom-Attribute
   </span>
   <div class="otd-footer hlist noprint" id="card2-footer" style="text-align: right;">
    <ul>
     <li>
      <b>
       <a href="/c2-link2" title="c2-Link2">
        This is Card2-link2
       </a>
      </b>
     </li>
     <li>
      <b>
       <a class="email" href="/c2-link3" title="c2-Link3">
        This is Card2-link3
       </a>
      </b>
     </li>
     <li>
      <b>
       <a href="/c2-link4" title="c2-Link4">
        This is Card2-link4
       </a>
      </b>
     </li>
    </ul>
   </div>
  </div>
 </body>
</html>

When BeautifulSoup parses this document it constructs 4 types of objects internally:

BeautifulSoup : Parsed document as a whole.
Tag : Each tag such as <head>, <body>, <div>, <p>, <a> e.t.c becomes an object with its corresponding attributes and pointers to its children/parent.
NavigableString : captures the text inside a tag. For ex: <a href="/somelink"> Captured Text </a>
Comment : All comments in the html get captured as Comment objects

Lets start by navigating the constructed BeautifulSoup object:

bs.title
# Beautiful Soup Tutorial

type(bs.title)
# 

bs.title.text
# u'Beautiful Soup Tutorial'

bs.head
# Beautiful Soup Tutorial

bs.head.parent.name (This will return the Tag object of the current Tag's parent)
# u'html'

[child.name for child in bs.head.children]
# [u'title'] --> head has only one child.

BeautifulSoup.tag will return the first child with that tag. 

bs.p
# I am in p-1
bs.p[1]  --> bs.p returns only a single Tag. It does not return an array.
# *** KeyError: 1

Searching the Tree

Find all <p> tags in the document

bs.find_all('p')
#[I am in p-1
# I am in p-2
# I am in p-3
# card1 - List index -1
# card1 - List index -2
# card1 - List index -3
# card2 - List index -1
# card2 - List index -2
# card2 - List index -3
]

Find the element having id='card1' in the document

bs.find(id='card1')
# \nThis is Card1-link1\n


\n
card1 - List  # index -1
\n
card1 - List index -2
\n
card1 - List index -3
\n
\nCard1-Custom-Attribute\n


\n


This is Card1-link2
\n
This is   # Card1-link3
\n
This is Card1-link4
\n
\n

find vs find_all

Find fetches the first element that matches a given search criteria
bs.find(class_='email') (Note : use class_ when searching for class attributes)
# This is Card1-link3

Find_all fetches all the elements matching a given search criteria
bs.find_all(class_='email')
# [This is Card1-link3,
#  This is Card2-link3]

Advanced Search

BeautifulSoup support searching for elements in nested hierarchies using CSS Selector Syntax

Find all <b> tags in the <div id="card1-footer"/>

This can be achieved in multiple ways as shown below:

bs.select('div[id=card1-footer] b')
bs.select('[id=card1-footer] b')
bs.select('#card1 div.otd-footer b')

# [This is Card1-link2,
#  This is Card1-link3,
#  This is Card1-link4]

Extract the text of the class="email" element inside <div id=Card2-footer/>

bs.select('#card2-footer a.email')[0].string
# u'This is Card2-link3'

QuickBite: Tap Vs Veth

Linux supports virtual networking via various artifacts such as: Soft Switches (Linux Bridge, OpenVSwitch) Virtual Network Adapters (tun, tap, veth and a few more) In this blog, we will look at the virtual network adapters tap and veth. From a practical view point, both seem to be having the same functionality and its a bit confusing as to where to use what. A quick definition of tap/veth is as follows: TAP A TAP is a simulated interface which exists only in the kernel and has no physical component associated with it. It can be viewed as a simple Point-to-Point or Ethernet device, which instead of receiving packets from a physical media, receives them from user space program and instead of sending packets via physical media writes them to the user space program. When a user space program (in our case the VM) gets attached to the tap interface it gets hold of a file descriptor, reading from which gives it the data being sent on the tap interface. Writing to the file descri...

Clouds, Networking and Technology

Search This Blog