Scrapy : A python framework for web crawling

Scrapy in the words of its creators:

"Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival."

A screenshot grabbed from the site shows how concise the working code can be:

Scrapy works only with Python 2.7. The objective of this blog is to get you started with Scrapy and provide you with enough information to carry further on your own. In this blog, I will setup a scrapy project and retrieve some data off my blog site.

Pre-requsites

Python 2.7
pip and setuptools Python packages.
lxml. Most Linux distributions ships prepackaged versions of lxml. Otherwise refer to http://lxml.de/installation.html
OpenSSL. This comes preinstalled in all operating systems, except Windows where the Python installer ships it bundled.

Setup
pip install scrapy

Create a Project
scrapy startproject blog

This command will create a 'blog' directory with the following structure:

The file scrapy.cfg can be used to configure the project. Items.py holds the models necessary for the project. The spiders that need to crawl the internet and fetch data are defined under spiders folder.

Lets say our goal is to extract the following information from this blog site:

Blog Title
Blog Description
Blogs listed in Popular Posts section

We can define the following model in items.py:

import scrapy

class BlogItem(scrapy.Item):
 title  = scrapy.Field()
 desc   = scrapy.Field()
 pop1   = scrapy.Field()
 pop2   = scrapy.Field()
 pop3   = scrapy.Field()
 pop4   = scrapy.Field()
 pass

We can then proceed and create the spider to fetch the required data.

import scrapy
from blog.items import BlogItem

class BlogSpider(scrapy.Spider):
    name = "blog"
    allowed_domains = ["blogspot.in"]
    start_urls = [
        "http://sarathblogs.blogspot.in",
    ]

    def parse(self, response):
  b = BlogItem()
  b['title'] = response.xpath('//h1[@class="title"]/text()').extract()[0]
  b['desc'] = response.xpath('//p[@class="description"]/span/text()').extract()[0]
  b['pop1'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[1]/div[1]/div[2]/a/b/text()').extract()[0]
  b['pop2'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[2]/div[1]/div[2]/a/b/text()').extract()[0]
  b['pop3'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[3]/div[1]/div[1]/a/b/text()').extract()[0]
  b['pop4'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[4]/div[1]/div[1]/a/b/text()').extract()[0]
  return b

The spider by default crawls to the url mentioned in the start_urls list and hands over the response to the parse method. The response object provides selectors to parse the html response using XPATH, regex e.t.c. The method response.xpath() is the short hand notation for the same. I am using XPATH to traverse the html response and extract the data required. This data is stored in a BlogItem object and is returned back as the response.

Running the Spider
You can run the spider by executing the following command from the root directory:
scrapy crawl blog

The following snapshot shows the output:

Saving the results
Scrapy provides various handy options to save the data in the db, as json, csv. Lets checkout the commands for the last two options:

scrapy crawl blog -o json
scrapy crawl blog -o csv

The json output is as follows:

[
    {
        "title": "\nSarath Chandra Mekala\n",
        "pop1": "\nOpenStack Kilo MultiNode VM Installation using Centos 7 on VirtualBox\n",        
        "pop2": "\nHow to run Juniper Firefly (vSRX) on KVM -- SRX in a box setup\n",
        "pop3": "\nOpenstack : Fixing Failed to create network. No tenant network is available for allocation issue.\n",        
        "pop4": "\nFixing Openstack VM spawning issue: No suitable host found/vif_type=binding_failed error\n",
        "desc": "Openstack, Cloud Orchestration, Networking, Java & J2EE"
    }
]

Hope this was helpful and motivated you at giving a shot at Scrapy (www.scrapy.org)

Comments

Outsource BigData said…

Its really helpful thank you for sharing with us. Our web crawling services are designed to ensure close to 100% accuracy between original source and digital formats and ensure the content is ready at your disposal.

February 25, 2021 at 5:01 PM

Clouds, Networking and Technology

Search This Blog

Scrapy : A python framework for web crawling

Labels

Comments

Popular posts from this blog

Solved: Fix for Git clone failure due to GnuTLS recv error (-9)

Openstack : Fixing Failed to create network. No tenant network is available for allocation issue.

QuickBite: Tap Vs Veth