Scrapy in the words of its creators:
Scrapy works only with Python 2.7. The objective of this blog is to get you started with Scrapy and provide you with enough information to carry further on your own. In this blog, I will setup a scrapy project and retrieve some data off my blog site.
Pre-requsites
Create a Project
This command will create a 'blog' directory with the following structure:
The file scrapy.cfg can be used to configure the project. Items.py holds the models necessary for the project. The spiders that need to crawl the internet and fetch data are defined under spiders folder.
Lets say our goal is to extract the following information from this blog site:
We can then proceed and create the spider to fetch the required data.
The spider by default crawls to the url mentioned in the start_urls list and hands over the response to the parse method. The response object provides selectors to parse the html response using XPATH, regex e.t.c. The method response.xpath() is the short hand notation for the same. I am using XPATH to traverse the html response and extract the data required. This data is stored in a BlogItem object and is returned back as the response.
Running the Spider
You can run the spider by executing the following command from the root directory:
scrapy crawl blog
The following snapshot shows the output:
Saving the results
Scrapy provides various handy options to save the data in the db, as json, csv. Lets checkout the commands for the last two options:
The json output is as follows:
Hope this was helpful and motivated you at giving a shot at Scrapy (www.scrapy.org)
"Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival."A screenshot grabbed from the site shows how concise the working code can be:
Scrapy works only with Python 2.7. The objective of this blog is to get you started with Scrapy and provide you with enough information to carry further on your own. In this blog, I will setup a scrapy project and retrieve some data off my blog site.
Pre-requsites
- Python 2.7
- pip and setuptools Python packages.
- lxml. Most Linux distributions ships prepackaged versions of lxml. Otherwise refer to http://lxml.de/installation.html
- OpenSSL. This comes preinstalled in all operating systems, except Windows where the Python installer ships it bundled.
pip install scrapy
Create a Project
scrapy startproject blog
This command will create a 'blog' directory with the following structure:
Lets say our goal is to extract the following information from this blog site:
- Blog Title
- Blog Description
- Blogs listed in Popular Posts section
import scrapy class BlogItem(scrapy.Item): title = scrapy.Field() desc = scrapy.Field() pop1 = scrapy.Field() pop2 = scrapy.Field() pop3 = scrapy.Field() pop4 = scrapy.Field() pass
We can then proceed and create the spider to fetch the required data.
import scrapy from blog.items import BlogItem class BlogSpider(scrapy.Spider): name = "blog" allowed_domains = ["blogspot.in"] start_urls = [ "http://sarathblogs.blogspot.in", ] def parse(self, response): b = BlogItem() b['title'] = response.xpath('//h1[@class="title"]/text()').extract()[0] b['desc'] = response.xpath('//p[@class="description"]/span/text()').extract()[0] b['pop1'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[1]/div[1]/div[2]/a/b/text()').extract()[0] b['pop2'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[2]/div[1]/div[2]/a/b/text()').extract()[0] b['pop3'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[3]/div[1]/div[1]/a/b/text()').extract()[0] b['pop4'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[4]/div[1]/div[1]/a/b/text()').extract()[0] return b
The spider by default crawls to the url mentioned in the start_urls list and hands over the response to the parse method. The response object provides selectors to parse the html response using XPATH, regex e.t.c. The method response.xpath() is the short hand notation for the same. I am using XPATH to traverse the html response and extract the data required. This data is stored in a BlogItem object and is returned back as the response.
Running the Spider
You can run the spider by executing the following command from the root directory:
scrapy crawl blog
The following snapshot shows the output:
Saving the results
Scrapy provides various handy options to save the data in the db, as json, csv. Lets checkout the commands for the last two options:
- scrapy crawl blog -o json
- scrapy crawl blog -o csv
The json output is as follows:
[ { "title": "\nSarath Chandra Mekala\n", "pop1": "\nOpenStack Kilo MultiNode VM Installation using Centos 7 on VirtualBox\n", "pop2": "\nHow to run Juniper Firefly (vSRX) on KVM -- SRX in a box setup\n", "pop3": "\nOpenstack : Fixing Failed to create network. No tenant network is available for allocation issue.\n", "pop4": "\nFixing Openstack VM spawning issue: No suitable host found/vif_type=binding_failed error\n", "desc": "Openstack, Cloud Orchestration, Networking, Java & J2EE" } ]
Hope this was helpful and motivated you at giving a shot at Scrapy (www.scrapy.org)
Comments