Skip to main content

Posts

Showing posts from 2019

[Part 2] A gentle introduction to BeautifulSoup (Web Scraping Library)

In Part-1 of this series, I have walked through BeautifulSoup's generated object structure and the techniques to follow to search and extract data from the html tree. In this blog, I will demonstrate how data can be scraped from live websites. Python requests library can be used fetch data from websites and then give it to BeautifulSoup for parsing. The code for this part is as shown below. from bs4 import BeautifulSoup import requests URL = 'https://www.huffpost.com/' req = requests.get(URL) bs = BeautifulSoup(req.content, 'html.parser') Huffington Post Lets fetch the latest news from HuffingtonPost. In order to do that, lets study the html structure used by this site. For the "Latest News " section, there is a div id="zone-a" which has two elements under it: zone title (section) zone content (section) : this section has cards, each containing one news item The code to parse the cards and display their text will be as below:

A gentle introduction to BeautifulSoup (Web Scraping Library)

Table of Contents  Part-1 : Introduction to Beautiful Soup (this blog)  Part-2 : Real world web scraping example ( here ) In our day to day lives, we are required to extract data from the web for data mining and analytics. It can even be for a simple personal project where you create scripts to get the latest price of your stocks, weather report, check the confirmation status of your ticket e.t.c There are multiple libraries/tools to get this job done but BeautifulSoup shines in this area with its relatively easy learning curve and powerful search capabilities. Beautiful Soup is a Python web scraping library which makes it easy to extract data from Web pages. It is meant to parse a single webpage and analyse and extract its data. It differs in this aspect from Scrapy a web crawling library, which has advanced functionality and is meant to crawl across websites. You can check out the blog I have written on Scrapy here. Reference HTML We will be using the below HTML document as o

SyntaxHighlighter Test

This blogs is a part-2 of How to add SyntaxHighlighter to Blogger where I will be testing the syntax highlighting abilities of this library. SyntaxHighlighter provides various configuration parameters to be used with the <pre> tag. Java (With Rulers -- not working?) <pre code="brush: java; ruler: true;"> import java.io.*; public class HelloWorld{ public static void main(String[] args){ System.out.println("Hello World!"); } } </pre> import java.util.*; import java.io.*; public class HelloWorld{ public static void main(String[] args){ System.out.println("Hello World!"); } } Python (With the first line number set to 100) <pre class="brush: python; first-line: 100"> import os import sys import time class Box: def method1(self, x, y): try: print x, y except: pass </pre> import os import sys import time class Box: def method1(self, x, y): try:

How to add SyntaxHighlighter to Blogger

SyntaxHighlighter is a code syntax highlighter developed in Javascript. It can be used to highlight code snippets in your blogs to improve the readability. This blog looks at how SyntaxHighlighter can be installed in Blogger. As per the official website , the installation instructions are as follows: 1. Install the base files: * https://alexgorbatchev.com/pub/sh/current/scripts/shCore.js * https://alexgorbatchev.com/pub/sh/current/styles/shCore.css 2. Install the theme * https://alexgorbatchev.com/pub/sh/current/styles/shThemeDefault.css 3. Install the brushes you want * https://alexgorbatchev.com/pub/sh/current/scripts/shBrushBash.js * https://alexgorbatchev.com/pub/sh/current/scripts/shBrushPlain.js * https://alexgorbatchev.com/pub/sh/current/scripts/shBrushPython.js 4. Invoke SyntaxHighlighter during page load and execute the following functions for blogger: * SyntaxHighlighter.config.bloggerMode = true; * SyntaxHighlighter.all(); The resulta

How to setup your own CA and sign your digital certificates

Openssl is a cryptographic library that can be used to generate digital certificates. In this blog, I will walk you through the process of creating a Root CA and signing the generated digital certificates with them. For a quick primer on digital certificates take a look at this article. To begin with lets generate a Root CA. This process will require generating a CA private key and a CA certificate. Generate a 4096 bit long RSA key for Root CA $ openssl genrsa -out rootCA.key 4096 Generating RSA private key, 4096 bit long modulus .........++ .............................++ e is 65537 (0x010001) Generate Root CA certificate  $ openssl req -x509 -new -key rootCA.key -sha256 -days 1825 -out rootCA.crt You are about to be asked to enter information that will be incorporated into your certificate request. What you are about to enter is what is called a Distinguished Name or a DN. There are quite a few fields but you can leave some blank For some fields there will be a default value,