Clouds, Networking and Technology

Posts

Showing posts with the label beautifulsoup

[Part 2] A gentle introduction to BeautifulSoup (Web Scraping Library)

In Part-1 of this series, I have walked through BeautifulSoup's generated object structure and the techniques to follow to search and extract data from the html tree. In this blog, I will demonstrate how data can be scraped from live websites. Python requests library can be used fetch data from websites and then give it to BeautifulSoup for parsing. The code for this part is as shown below. from bs4 import BeautifulSoup import requests URL = 'https://www.huffpost.com/' req = requests.get(URL) bs = BeautifulSoup(req.content, 'html.parser') Huffington Post Lets fetch the latest news from HuffingtonPost. In order to do that, lets study the html structure used by this site. For the "Latest News " section, there is a div id="zone-a" which has two elements under it: zone title (section) zone content (section) : this section has cards, each containing one news item The code to parse the cards and display their text will be as below: ...

A gentle introduction to BeautifulSoup (Web Scraping Library)

Table of Contents Part-1 : Introduction to Beautiful Soup (this blog) Part-2 : Real world web scraping example ( here ) In our day to day lives, we are required to extract data from the web for data mining and analytics. It can even be for a simple personal project where you create scripts to get the latest price of your stocks, weather report, check the confirmation status of your ticket e.t.c There are multiple libraries/tools to get this job done but BeautifulSoup shines in this area with its relatively easy learning curve and powerful search capabilities. Beautiful Soup is a Python web scraping library which makes it easy to extract data from Web pages. It is meant to parse a single webpage and analyse and extract its data. It differs in this aspect from Scrapy a web crawling library, which has advanced functionality and is meant to crawl across websites. You can check out the blog I have written on Scrapy here. Reference HTML We will be using the below HTML docu...