Skip to main content

A gentle introduction to BeautifulSoup (Web Scraping Library)

Table of Contents

  •  Part-1 : Introduction to Beautiful Soup (this blog)
  •  Part-2 : Real world web scraping example (here)
In our day to day lives, we are required to extract data from the web for data mining and analytics. It can even be for a simple personal project where you create scripts to get the latest price of your stocks, weather report, check the confirmation status of your ticket e.t.c

There are multiple libraries/tools to get this job done but BeautifulSoup shines in this area with its relatively easy learning curve and powerful search capabilities.

Beautiful Soup is a Python web scraping library which makes it easy to extract data from Web pages. It is meant to parse a single webpage and analyse and extract its data. It differs in this aspect from Scrapy a web crawling library, which has advanced functionality and is meant to crawl across websites. You can check out the blog I have written on Scrapy here.

Reference HTML

We will be using the below HTML document as our reference HTML for this tutorial.
<html>
<head><title>Beautiful Soup Tutorial</title></head>
<body>
<p>I am in p-1</p>
<p>I am in p-2</p>
<p>I am in p-3</p>
<div id="card1" class="card" style="padding:0.1em 0.6em 0.5em;">
 <a href="/c1-link1" title="C1-Link1">This is Card1-link1</a>
 <ul>
 <li><p>card1 - List index -1</p></li>
 <li><p>card1 - List index -2</p></li>
 <li><p>card1 - List index -3</p></li>
 </ul>
 <span customattr="Card1-Custom-Attr">Card1-Custom-Attribute</span>
 <div id="card1-footer" class="otd-footer hlist noprint" style="text-align: right;">
  <ul><li><b><a href="/c1-link2" title="C1-Link2">This is Card1-link2</a></b></li>
  <li><b><a href="/c1-link3" class="email" title="C1-Link3">This is Card1-link3</a></b></li>
  <li><b><a href="/c1-link4" title="C1-Link4">This is Card1-link4</a></b></li></ul>
 </div>
</div>
<div id="card2" class="card" style="padding:0.1em 0.6em 0.5em;">
 <a href="/link2" title="Link2">This is link2</a>
 <ul>
 <li><p>card2 - List index -1</p></li>
 <li><p>card2 - List index -2</p></li>
 <li><p>card2 - List index -3</p></li>
 </ul>
 <span customattr="Card2-Custom-Attr">Card2-Custom-Attribute</span>
 <div id="card2-footer" class="otd-footer hlist noprint" style="text-align: right;">
  <ul><li><b><a href="/c2-link2" title="c2-Link2">This is Card2-link2</a></b></li>
  <li><b><a href="/c2-link3" class="email" title="c2-Link3">This is Card2-link3</a></b></li>
  <li><b><a href="/c2-link4" title="c2-Link4">This is Card2-link4</a></b></li></ul>
 </div>
</div>
</body>
</html>

In order to use BeautifulSoup in python you need to install the library.
pip install beautifulsoup4

Python Code to parse the html
from bs4 import BeautifulSoup

with open("bs.html") as file:
    bs = BeautifulSoup(file, 'html.parser')
    print bs.prettify()

This will result in the following output:
<html>
 <head>
  <title>
   Beautiful Soup Tutorial
  </title>
 </head>
 <body>
  <!-- The p tags become an array of objects -->
  <p>
   I am in p-1
  </p>
  <p>
   I am in p-2
  </p>
  <p>
   I am in p-3
  </p>
  <div class="card" id="card1" style="padding:0.1em 0.6em 0.5em;">
   <a href="/c1-link1" title="C1-Link1">
    This is Card1-link1
   </a>
   <ul>
    <li>
     <p>
      card1 - List index -1
     </p>
    </li>
    <li>
     <p>
      card1 - List index -2
     </p>
    </li>
    <li>
     <p>
      card1 - List index -3
     </p>
    </li>
   </ul>
   <span customattr="Card1-Custom-Attr">
    Card1-Custom-Attribute
   </span>
   <div class="otd-footer hlist noprint" id="card1-footer" style="text-align: right;">
    <ul>
     <li>
      <b>
       <a href="/c1-link2" title="C1-Link2">
        This is Card1-link2
       </a>
      </b>
     </li>
     <li>
      <b>
       <a class="email" href="/c1-link3" title="C1-Link3">
        This is Card1-link3
       </a>
      </b>
     </li>
     <li>
      <b>
       <a href="/c1-link4" title="C1-Link4">
        This is Card1-link4
       </a>
      </b>
     </li>
    </ul>
   </div>
  </div>
  <div class="card" id="card2" style="padding:0.1em 0.6em 0.5em;">
   <a href="/link2" title="Link2">
    This is link2
   </a>
   <ul>
    <li>
     <p>
      card2 - List index -1
     </p>
    </li>
    <li>
     <p>
      card2 - List index -2
     </p>
    </li>
    <li>
     <p>
      card2 - List index -3
     </p>
    </li>
   </ul>
   <span customattr="Card2-Custom-Attr">
    Card2-Custom-Attribute
   </span>
   <div class="otd-footer hlist noprint" id="card2-footer" style="text-align: right;">
    <ul>
     <li>
      <b>
       <a href="/c2-link2" title="c2-Link2">
        This is Card2-link2
       </a>
      </b>
     </li>
     <li>
      <b>
       <a class="email" href="/c2-link3" title="c2-Link3">
        This is Card2-link3
       </a>
      </b>
     </li>
     <li>
      <b>
       <a href="/c2-link4" title="c2-Link4">
        This is Card2-link4
       </a>
      </b>
     </li>
    </ul>
   </div>
  </div>
 </body>
</html>

When BeautifulSoup parses this document it constructs 4 types of objects internally:
  • BeautifulSoup  : Parsed document as a whole.
  • Tag : Each tag such as <head>, <body>, <div>, <p>, <a> e.t.c becomes an object with its corresponding attributes and pointers to its children/parent.
  • NavigableString : captures the text inside a tag. For ex: <a href="/somelink"> Captured Text </a>
  • Comment : All comments in the html get captured as Comment objects

Lets start by navigating the constructed BeautifulSoup object:
bs.title
# Beautiful Soup Tutorial

type(bs.title)
# 

bs.title.text
# u'Beautiful Soup Tutorial'

bs.head
# Beautiful Soup Tutorial

bs.head.parent.name (This will return the Tag object of the current Tag's parent)
# u'html'

[child.name for child in bs.head.children]
# [u'title'] --> head has only one child.

BeautifulSoup.tag will return the first child with that tag. 

bs.p
# I am in p-1
bs.p[1]  --> bs.p returns only a single Tag. It does not return an array.
# *** KeyError: 1

Searching the Tree

Find all <p> tags in the document

bs.find_all('p')
#[I am in p-1
# I am in p-2
# I am in p-3
# card1 - List index -1
# card1 - List index -2
# card1 - List index -3
# card2 - List index -1
# card2 - List index -2
# card2 - List index -3
]

Find the element having id='card1' in the document

bs.find(id='card1')
# 
\nThis is Card1-link1\n
    \n
  • card1 - List # index -1
  • \n
  • card1 - List index -2
  • \n
  • card1 - List index -3
  • \n
\nCard1-Custom-Attribute\n \n

find vs find_all

Find fetches the first element that matches a given search criteria
bs.find(class_='email') (Note : use class_ when searching for class attributes)
# 

Find_all fetches all the elements matching a given search criteria
bs.find_all(class_='email')
# [,
#  ]

Advanced Search

BeautifulSoup support searching for elements in nested hierarchies using CSS Selector Syntax

Find all <b> tags in the <div id="card1-footer"/>

This can be achieved in multiple ways as shown below:
  • bs.select('div[id=card1-footer] b')
  • bs.select('[id=card1-footer] b')
  • bs.select('#card1 div.otd-footer b')
# [This is Card1-link2,
#  ,
#  This is Card1-link4]


Extract the text of the class="email" element inside <div id=Card2-footer/>


bs.select('#card2-footer a.email')[0].string
# u'This is Card2-link3'

Comments

Popular posts from this blog

Solved: Fix for Git clone failure due to GnuTLS recv error (-9)

My devstack installation was failing with an error reported by the GnuTLS module as shown below: $ git clone https://github.com/openstack/horizon.git /opt/stack/horizon --branch master Cloning into '/opt/stack/horizon'... remote: Counting objects: 154213, done. remote: Compressing objects: 100% (11/11), done. error: RPC failed; curl 56 GnuTLS recv error (-9): A TLS packet with unexpected length was received. fatal: The remote end hung up unexpectedly fatal: early EOF fatal: index-pack failed The following Git config changes fixed the issue for me. Am hoping it will be useful for someone out there: $ git config http.sslVerify false $ git config --global http.postBuffer 1048576000

Openstack : Fixing Failed to create network. No tenant network is available for allocation issue.

Assumptions : You are using ML2 plugin configured to use Vlans If you try to create a network for a tenant and it fails with the following error: Error: Failed to create network "Test": 503-{u'NeutronError': {u'message': u'Unable to create the network. No tenant network is available for allocation.', u'type': u'NoNetworkAvailable', u'detail': u''}} The problem can be due to missing configuration in the below files: In /etc/neutron/plugins/ml2/ml2_conf.ini network_vlan_ranges =physnet1:1000:2999 (1000:2999 is the Vlan range allocation) In /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini bridge_mappings = physnet1:br-eth1 (in OVS we map the physical network to the OVS bridge) Note You should have created a bridge br-eth1 manually and mapped it to a port ovs-vsctl add-br br-eth1 ovs-vsctl add-port br-eth1 eth1 Once configuration is done, restart the neutron ovs agent on the compute node(s): ...

QuickBite: Tap Vs Veth

Linux supports virtual networking via various artifacts such as: Soft Switches (Linux Bridge, OpenVSwitch) Virtual Network Adapters (tun, tap, veth and a few more) In this blog, we will look at the virtual network adapters tap and veth. From a practical view point, both seem to be having the same functionality and its a bit confusing as to where to use what. A quick definition of tap/veth is as follows: TAP A TAP is a simulated interface which exists only in the kernel and has no physical component associated with it. It can be viewed as a simple Point-to-Point or Ethernet device, which instead of receiving packets from a physical media, receives them from user space program and instead of sending packets via physical media writes them to the user space program. When a user space program (in our case the VM) gets attached to the tap interface it gets hold of a file descriptor, reading from which gives it the data being sent on the tap interface. Writing to the file descri...