Introduction
Web scraping is the process of extracting data from websites. It often involves automating the web browsing experience and parsing the HTML of a website to extract useful information. Scrapy is a powerful and flexible web scraping framework built with Python. In this tutorial, we will learn how to create a web crawler with Scrapy and Python.
Prerequisites
To follow along with this tutorial, you will need the following:
- Python 3.5 or above installed on your system.
- The
pip
package manager, which typically comes with Python. - Basic understanding of HTML and CSS.
- Familiarity with Python programming.
Install Scrapy
Let’s start by installing Scrapy. Open a terminal or command prompt and run the following command:
pip install scrapy
Scrapy and its dependencies will be installed in your Python environment.
Set up a New Scrapy Project
To create a new Scrapy project, open a terminal or command prompt and navigate to the directory where you want to create the project. Run the following command:
scrapy startproject tutorial
This will create a new directory named tutorial
with the basic structure of a Scrapy project.
Define a Spider
A spider is the main component of a Scrapy project. It defines how to follow links and what information to extract from the web pages. In Scrapy, spiders are Python classes that define how requests are made and how responses are processed.
Open the tutorial/spiders
directory and create a new Python file named example_spider.py
. In this file, define a class named ExampleSpider
that extends scrapy.Spider
. Here’s an example:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['https://example.com']
def parse(self, response):
# Extract information from the response
# using XPath or CSS selectors
pass
In the ExampleSpider
class, we provide a name for our spider using the name
attribute. This name will be used to identify the spider when running Scrapy commands.
The start_urls
attribute is a list of URLs that the spider will start crawling from. In this example, we only have a single URL, but you can add more if needed.
The parse
method is called for each response received by the spider. Inside this method, we can extract information from the response using XPath or CSS selectors. For now, let’s leave it empty as pass
.
Make Requests
To start making requests to the URLs defined in start_urls
, we need to run the spider. Open a terminal or command prompt and navigate to the project’s directory (tutorial
in our case). Run the following command:
scrapy crawl example
This command runs the spider named example
. Scrapy will make requests to the URLs defined in start_urls
and call the parse
method for each response received.
Extract Data
In the parse
method, we can use XPath or CSS selectors to extract data from the web pages. Scrapy provides a convenient way to select elements from the HTML using these selectors.
For example, let’s say we want to extract the titles of all the articles on a web page. We can use the following CSS selector to accomplish this:
def parse(self, response):
for article in response.css('article'):
title = article.css('h2 a::text').get()
yield {
'title': title
}
In this example, we are selecting all the <article>
elements and then selecting the text of the <a>
element inside the <h2>
element, which represents the title of each article. We use the yield
keyword to return a dictionary with the extracted data. The yield
keyword is used because Scrapy supports asynchronous processing of requests and responses.
Follow Links
To crawl multiple pages, we can make requests to the links found on the current page. We can achieve this by calling the follow
method and passing it the URL of the link.
For example, let’s say we have a web page with pagination links at the bottom. We want to follow these links to scrape all the pages. We can modify our spider as follows:
def parse(self, response):
for article in response.css('article'):
title = article.css('h2 a::text').get()
yield {
'title': title
}
next_page_url = response.css('a.next-page-link::attr(href)').get()
if next_page_url:
yield response.follow(next_page_url, self.parse)
In this example, we select the URL of the next page using a CSS selector. If the URL exists, we call response.follow
and pass it the URL and the parse
method. This creates a new request to the next page and calls the parse
method for that response.
Storing the Scraped Data
By default, Scrapy outputs the scraped data to the terminal. However, we can configure Scrapy to store the data in different formats like JSON, CSV, or a database.
To configure the output format, open the settings.py
file inside the tutorial/tutorial
directory. Uncomment the following lines and modify them as desired:
ITEM_PIPELINES = {
'tutorial.pipelines.ExamplePipeline': 300,
}
FEED_FORMAT = 'json'
FEED_URI = 'output.json'
In this example, we set the output format to JSON and specify the output file name as output.json
. We also specify ExamplePipeline
as the pipeline to process the items. Pipelines in Scrapy are used to process and store the scraped data.
To create a pipeline, open the pipelines.py
file inside the tutorial/tutorial
directory. Define a class named ExamplePipeline
and implement the process_item
method. Here’s an example:
import json
class ExamplePipeline:
def open_spider(self, spider):
self.file = open('output.json', 'w')
self.file.write('[n') # Start a JSON list
def close_spider(self, spider):
self.file.write('n]') # End the JSON list
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item))
self.file.write(line + ',n') # Write a JSON object to a new line
return item
In this example, we open the output file in the open_spider
method and write a start indicator for the JSON list. In the close_spider
method, we write an end indicator and close the file. Finally, in the process_item
method, we convert the item to a JSON string and write it to the file.
Conclusion
In this tutorial, we learned how to create a web crawler with Scrapy and Python. We covered the basic structure of a Scrapy project, how to define a spider, how to make requests and extract data, how to follow links, and how to store the scraped data. Scrapy provides a robust and easy-to-use framework for web scraping automation, making it an ideal choice for scraping data from websites.