Skip to main content

[Part 2] A gentle introduction to BeautifulSoup (Web Scraping Library)

In Part-1 of this series, I have walked through BeautifulSoup's generated object structure and the techniques to follow to search and extract data from the html tree.


In this blog, I will demonstrate how data can be scraped from live websites. Python requests library can be used fetch data from websites and then give it to BeautifulSoup for parsing. The code for this part is as shown below.

from bs4 import BeautifulSoup
import requests
URL = 'https://www.huffpost.com/'
req = requests.get(URL)
bs = BeautifulSoup(req.content, 'html.parser')

Huffington Post


Lets fetch the latest news from HuffingtonPost. In order to do that, lets study the html structure used by this site.







For the "Latest News " section, there is a div id="zone-a" which has two elements under it:
  • zone title (section)
  • zone content (section) : this section has cards, each containing one news item
The code to parse the cards and display their text will be as below:

from bs4 import BeautifulSoup
import requests
from faker import Faker
fake = Faker()

headers = {'User-Agent': fake.firefox()}

req = requests.get('https://huffpost.com', headers=headers)
bs = BeautifulSoup(req.content, 'html.parser')

cards = bs.select('div[id=zone-a] div.zone__content div.card')

for card in cards:
 section = card.find(class_='card__label').text
 text =  card.find(class_='card__headlines').h3.text
 link =  card.find(class_='card__headlines').a['href']

 print """
 %s :
 %s
 %s
 """%(section.encode('utf-8'), text.encode('utf-8'), link.encode('utf-8'))

The results:
Politics :
        Carl Bernstein Issues Ominous Warning To GOP Over ΓÇÿAmazingΓÇÖ Trump Support
        https://www.huffpost.com/entry/carl-bernstein-gop-senators-warning-trump_n_5dd8de88e4b0d50f32901e31


        Politics :
        Facebook Helped Bill OΓÇÖReilly Shill For A Company Accused Of Scamming Customers
        https://www.huffpost.com/entry/bill-oreilly-facebook-advertisement_n_5dd88ccae4b0913e6f6ca254


        Politics :
        John Bolton Hints HeΓÇÖs Ready To Spill His ΓÇÿBackstoryΓÇÖ On White House Departure
        https://www.huffpost.com/entry/john-bolton-twitter-white-house-departure_n_5dd82c74e4b00149f71c5f72


        Education :
        Syracuse Officials: ThereΓÇÖs No Evidence That Students Received Racist Manifesto
        https://www.huffpost.com/entry/syracuse-no-evidence-racist-manifesto_n_5dd88181e4b00149f71cedae


        Health :
        40 E. Coli Infections Linked To Romaine Lettuce, CDC Says
        https://www.huffpost.com/entry/romaine-e-coli-salinas-valley_n_5dd87179e4b0d50f328fde16


        Politics :
        Rick Perry Emerges As Fault Line In Closely Watched Texas Race
        https://www.huffpost.com/entry/rick-perry-impeachment-henry-cuellar-jessica-cisneros-house-democratic-primary_n_5dd7fc39e4b0913e6f6b4abe

Times Of India

Lets extract the news from the 'Top Stories' and 'Latest News' section of Times Of India.

from bs4 import BeautifulSoup
import requests
#Figlet libray is used to enhance output text
from pyfiglet import Figlet 

figlet = Figlet(font='univers')

req = requests.get('https://timesofindia.indiatimes.com/')
bs = BeautifulSoup(req.content, 'html.parser')


def print_news(news_items):
 for news in news_items:
  print '*', news.text.strip().encode('utf-8')

print figlet.renderText('Top Stories')
print_news(bs.select('div.top-story > ul > li'))

print figlet.renderText('Latest News')
print_news(bs.select('div.latestNewContainer > ul > li'))

The results:
888888888888
     88
     88
     88  ,adPPYba,  8b,dPPYba,
     88 a8"     "8a 88P'    "8a
     88 8b       d8 88       d8
     88 "8a,   ,a8" 88b,   ,a8"
     88  `"YbbdP"'  88`YbbdP"'
                    88
                    88

 ad88888ba                              88
d8"     "8b ,d                          ""
Y8,         88
`Y8aaaaa, MM88MMM ,adPPYba,  8b,dPPYba, 88  ,adPPYba, ,adPPYba,
  `"""""8b, 88   a8"     "8a 88P'   "Y8 88 a8P_____88 I8[    ""
        `8b 88   8b       d8 88         88 8PP"""""""  `"Y8ba,
Y8a     a8P 88,  "8a,   ,a8" 88         88 "8b,   ,aa aa    ]8I
 "Y88888P"  "Y888 `"YbbdP"'  88         88  `"Ybbd8"' `"YbbdP"'



* Maharashtra issue: SC to pass order on Tuesday
* Security men manhandled women MPs: Cong
* One of the toughest days of my life: Supriya Sule
* Filmfare Awards 2020 to be hosted in Guwahati
* Sensex at fresh all-time high, Nifty above 12,050
* Sena-NCP-Congress stake claim to form govt
* Democracy murdered in Maharashtra: Rahul

88
88                       ,d                          ,d
88                       88                          88
88          ,adPPYYba, MM88MMM ,adPPYba, ,adPPYba, MM88MMM
88          ""     `Y8   88   a8P_____88 I8[    ""   88
88          ,adPPPPP88   88   8PP"""""""  `"Y8ba,    88
88          88,    ,88   88,  "8b,   ,aa aa    ]8I   88,
88888888888 `"8bbdP"Y8   "Y888 `"Ybbd8"' `"YbbdP"'   "Y888



888b      88
8888b     88
88 `8b    88
88  `8b   88  ,adPPYba, 8b      db      d8 ,adPPYba,
88   `8b  88 a8P_____88 `8b    d88b    d8' I8[    ""
88    `8b 88 8PP"""""""  `8b  d8'`8b  d8'   `"Y8ba,
88     `8888 "8b,   ,aa   `8bd8'  `8bd8'   aa    ]8I
88      `888  `"Ybbd8"'     YP      YP     `"YbbdP"'



* Quick Edit: Early floor test the solution in Maha
* Jyotiraditya removes Cong from his Twitter bio
* Raj's oil boom fuelling a gen of entrepreneurs
* K'taka: BSY's future, HDK's credo, Sidda's image
* Shootout in DelhiΓÇÖs Govindpuri
* I am NCP: Ajit Pawar's lawyer in SC
* Twitterati support Bhogle after Manjrekar spat
* Should you talk to children on financial issues?
* With a gem like Reno2, OPPO finishes 2019 in style
* Tharoor visits Tihar Jail to meet P Chidambaram
* The rise of alternative holiday homes
* Blog: Sharad Pawar's power politics
* Kunal KapoorΓÇÖs take on gaming laptops
* Nithya case: Yoginis speak only Eng, cops clueless
* 19 popular apps that are ΓÇÿriskyΓÇÖ, claims research
* Taapsee's reply on being asked to speak in Hindi
* Top10: Is Maha a repeat of K'taka 2006 or 2018?
* Maharashtra: Parties test strength in hotel lobbies

Comments

Popular posts from this blog

Openstack : Fixing Failed to create network. No tenant network is available for allocation issue.

Assumptions : You are using ML2 plugin configured to use Vlans If you try to create a network for a tenant and it fails with the following error: Error: Failed to create network "Test": 503-{u'NeutronError': {u'message': u'Unable to create the network. No tenant network is available for allocation.', u'type': u'NoNetworkAvailable', u'detail': u''}} The problem can be due to missing configuration in the below files: In /etc/neutron/plugins/ml2/ml2_conf.ini network_vlan_ranges =physnet1:1000:2999 (1000:2999 is the Vlan range allocation) In /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini bridge_mappings = physnet1:br-eth1 (in OVS we map the physical network to the OVS bridge) Note You should have created a bridge br-eth1 manually and mapped it to a port ovs-vsctl add-br br-eth1 ovs-vsctl add-port br-eth1 eth1 Once configuration is done, restart the neutron ovs agent on the compute node(s):

Solved: Fix for Git clone failure due to GnuTLS recv error (-9)

My devstack installation was failing with an error reported by the GnuTLS module as shown below: $ git clone https://github.com/openstack/horizon.git /opt/stack/horizon --branch master Cloning into '/opt/stack/horizon'... remote: Counting objects: 154213, done. remote: Compressing objects: 100% (11/11), done. error: RPC failed; curl 56 GnuTLS recv error (-9): A TLS packet with unexpected length was received. fatal: The remote end hung up unexpectedly fatal: early EOF fatal: index-pack failed The following Git config changes fixed the issue for me. Am hoping it will be useful for someone out there: $ git config http.sslVerify false $ git config --global http.postBuffer 1048576000

QuickBite: Tap Vs Veth

Linux supports virtual networking via various artifacts such as: Soft Switches (Linux Bridge, OpenVSwitch) Virtual Network Adapters (tun, tap, veth and a few more) In this blog, we will look at the virtual network adapters tap and veth. From a practical view point, both seem to be having the same functionality and its a bit confusing as to where to use what. A quick definition of tap/veth is as follows: TAP A TAP is a simulated interface which exists only in the kernel and has no physical component associated with it. It can be viewed as a simple Point-to-Point or Ethernet device, which instead of receiving packets from a physical media, receives them from user space program and instead of sending packets via physical media writes them to the user space program. When a user space program (in our case the VM) gets attached to the tap interface it gets hold of a file descriptor, reading from which gives it the data being sent on the tap interface. Writing to the file descri