In Part-1 of this series, I have walked through BeautifulSoup's generated object structure and the techniques to follow to search and extract data from the html tree.
In this blog, I will demonstrate how data can be scraped from live websites. Python requests library can be used fetch data from websites and then give it to BeautifulSoup for parsing. The code for this part is as shown below.
Lets fetch the latest news from HuffingtonPost. In order to do that, lets study the html structure used by this site.
For the "Latest News " section, there is a div id="zone-a" which has two elements under it:
The results:
The results:
In this blog, I will demonstrate how data can be scraped from live websites. Python requests library can be used fetch data from websites and then give it to BeautifulSoup for parsing. The code for this part is as shown below.
from bs4 import BeautifulSoup import requests URL = 'https://www.huffpost.com/' req = requests.get(URL) bs = BeautifulSoup(req.content, 'html.parser')
Huffington Post
Lets fetch the latest news from HuffingtonPost. In order to do that, lets study the html structure used by this site.
For the "Latest News " section, there is a div id="zone-a" which has two elements under it:
- zone title (section)
- zone content (section) : this section has cards, each containing one news item
from bs4 import BeautifulSoup import requests from faker import Faker fake = Faker() headers = {'User-Agent': fake.firefox()} req = requests.get('https://huffpost.com', headers=headers) bs = BeautifulSoup(req.content, 'html.parser') cards = bs.select('div[id=zone-a] div.zone__content div.card') for card in cards: section = card.find(class_='card__label').text text = card.find(class_='card__headlines').h3.text link = card.find(class_='card__headlines').a['href'] print """ %s : %s %s """%(section.encode('utf-8'), text.encode('utf-8'), link.encode('utf-8'))
The results:
Politics : Carl Bernstein Issues Ominous Warning To GOP Over ΓÇÿAmazingΓÇÖ Trump Support https://www.huffpost.com/entry/carl-bernstein-gop-senators-warning-trump_n_5dd8de88e4b0d50f32901e31 Politics : Facebook Helped Bill OΓÇÖReilly Shill For A Company Accused Of Scamming Customers https://www.huffpost.com/entry/bill-oreilly-facebook-advertisement_n_5dd88ccae4b0913e6f6ca254 Politics : John Bolton Hints HeΓÇÖs Ready To Spill His ΓÇÿBackstoryΓÇÖ On White House Departure https://www.huffpost.com/entry/john-bolton-twitter-white-house-departure_n_5dd82c74e4b00149f71c5f72 Education : Syracuse Officials: ThereΓÇÖs No Evidence That Students Received Racist Manifesto https://www.huffpost.com/entry/syracuse-no-evidence-racist-manifesto_n_5dd88181e4b00149f71cedae Health : 40 E. Coli Infections Linked To Romaine Lettuce, CDC Says https://www.huffpost.com/entry/romaine-e-coli-salinas-valley_n_5dd87179e4b0d50f328fde16 Politics : Rick Perry Emerges As Fault Line In Closely Watched Texas Race https://www.huffpost.com/entry/rick-perry-impeachment-henry-cuellar-jessica-cisneros-house-democratic-primary_n_5dd7fc39e4b0913e6f6b4abe
Times Of India
Lets extract the news from the 'Top Stories' and 'Latest News' section of Times Of India.from bs4 import BeautifulSoup import requests #Figlet libray is used to enhance output text from pyfiglet import Figlet figlet = Figlet(font='univers') req = requests.get('https://timesofindia.indiatimes.com/') bs = BeautifulSoup(req.content, 'html.parser') def print_news(news_items): for news in news_items: print '*', news.text.strip().encode('utf-8') print figlet.renderText('Top Stories') print_news(bs.select('div.top-story > ul > li')) print figlet.renderText('Latest News') print_news(bs.select('div.latestNewContainer > ul > li'))
The results:
888888888888 88 88 88 ,adPPYba, 8b,dPPYba, 88 a8" "8a 88P' "8a 88 8b d8 88 d8 88 "8a, ,a8" 88b, ,a8" 88 `"YbbdP"' 88`YbbdP"' 88 88 ad88888ba 88 d8" "8b ,d "" Y8, 88 `Y8aaaaa, MM88MMM ,adPPYba, 8b,dPPYba, 88 ,adPPYba, ,adPPYba, `"""""8b, 88 a8" "8a 88P' "Y8 88 a8P_____88 I8[ "" `8b 88 8b d8 88 88 8PP""""""" `"Y8ba, Y8a a8P 88, "8a, ,a8" 88 88 "8b, ,aa aa ]8I "Y88888P" "Y888 `"YbbdP"' 88 88 `"Ybbd8"' `"YbbdP"' * Maharashtra issue: SC to pass order on Tuesday * Security men manhandled women MPs: Cong * One of the toughest days of my life: Supriya Sule * Filmfare Awards 2020 to be hosted in Guwahati * Sensex at fresh all-time high, Nifty above 12,050 * Sena-NCP-Congress stake claim to form govt * Democracy murdered in Maharashtra: Rahul 88 88 ,d ,d 88 88 88 88 ,adPPYYba, MM88MMM ,adPPYba, ,adPPYba, MM88MMM 88 "" `Y8 88 a8P_____88 I8[ "" 88 88 ,adPPPPP88 88 8PP""""""" `"Y8ba, 88 88 88, ,88 88, "8b, ,aa aa ]8I 88, 88888888888 `"8bbdP"Y8 "Y888 `"Ybbd8"' `"YbbdP"' "Y888 888b 88 8888b 88 88 `8b 88 88 `8b 88 ,adPPYba, 8b db d8 ,adPPYba, 88 `8b 88 a8P_____88 `8b d88b d8' I8[ "" 88 `8b 88 8PP""""""" `8b d8'`8b d8' `"Y8ba, 88 `8888 "8b, ,aa `8bd8' `8bd8' aa ]8I 88 `888 `"Ybbd8"' YP YP `"YbbdP"' * Quick Edit: Early floor test the solution in Maha * Jyotiraditya removes Cong from his Twitter bio * Raj's oil boom fuelling a gen of entrepreneurs * K'taka: BSY's future, HDK's credo, Sidda's image * Shootout in DelhiΓÇÖs Govindpuri * I am NCP: Ajit Pawar's lawyer in SC * Twitterati support Bhogle after Manjrekar spat * Should you talk to children on financial issues? * With a gem like Reno2, OPPO finishes 2019 in style * Tharoor visits Tihar Jail to meet P Chidambaram * The rise of alternative holiday homes * Blog: Sharad Pawar's power politics * Kunal KapoorΓÇÖs take on gaming laptops * Nithya case: Yoginis speak only Eng, cops clueless * 19 popular apps that are ΓÇÿriskyΓÇÖ, claims research * Taapsee's reply on being asked to speak in Hindi * Top10: Is Maha a repeat of K'taka 2006 or 2018? * Maharashtra: Parties test strength in hotel lobbies
Comments