How to Create a Web Crawler with Scrapy and Python

Introduction

Web scraping is the process of extracting data from websites. It often involves automating the web browsing experience and parsing the HTML of a website to extract useful information. Scrapy is a powerful and flexible web scraping framework built with Python. In this tutorial, we will learn how to create a web crawler with Scrapy and Python.

Prerequisites

To follow along with this tutorial, you will need the following:

Python 3.5 or above installed on your system.
The pip package manager, which typically comes with Python.
Basic understanding of HTML and CSS.
Familiarity with Python programming.

Install Scrapy

Let’s start by installing Scrapy. Open a terminal or command prompt and run the following command:

pip install scrapy

Scrapy and its dependencies will be installed in your Python environment.

Set up a New Scrapy Project

To create a new Scrapy project, open a terminal or command prompt and navigate to the directory where you want to create the project. Run the following command:

scrapy startproject tutorial

This will create a new directory named tutorial with the basic structure of a Scrapy project.

Define a Spider

A spider is the main component of a Scrapy project. It defines how to follow links and what information to extract from the web pages. In Scrapy, spiders are Python classes that define how requests are made and how responses are processed.

Open the tutorial/spiders directory and create a new Python file named example_spider.py. In this file, define a class named ExampleSpider that extends scrapy.Spider. Here’s an example:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']

    def parse(self, response):
        # Extract information from the response
        # using XPath or CSS selectors
        pass

In the ExampleSpider class, we provide a name for our spider using the name attribute. This name will be used to identify the spider when running Scrapy commands.

The start_urls attribute is a list of URLs that the spider will start crawling from. In this example, we only have a single URL, but you can add more if needed.

The parse method is called for each response received by the spider. Inside this method, we can extract information from the response using XPath or CSS selectors. For now, let’s leave it empty as pass.

Make Requests

To start making requests to the URLs defined in start_urls, we need to run the spider. Open a terminal or command prompt and navigate to the project’s directory (tutorial in our case). Run the following command:

scrapy crawl example

This command runs the spider named example. Scrapy will make requests to the URLs defined in start_urls and call the parse method for each response received.

Extract Data

In the parse method, we can use XPath or CSS selectors to extract data from the web pages. Scrapy provides a convenient way to select elements from the HTML using these selectors.

For example, let’s say we want to extract the titles of all the articles on a web page. We can use the following CSS selector to accomplish this:

def parse(self, response):
    for article in response.css('article'):
        title = article.css('h2 a::text').get()
        yield {
            'title': title
        }

In this example, we are selecting all the <article> elements and then selecting the text of the <a> element inside the <h2> element, which represents the title of each article. We use the yield keyword to return a dictionary with the extracted data. The yield keyword is used because Scrapy supports asynchronous processing of requests and responses.

Follow Links

To crawl multiple pages, we can make requests to the links found on the current page. We can achieve this by calling the follow method and passing it the URL of the link.

For example, let’s say we have a web page with pagination links at the bottom. We want to follow these links to scrape all the pages. We can modify our spider as follows:

def parse(self, response):
    for article in response.css('article'):
        title = article.css('h2 a::text').get()
        yield {
            'title': title
        }

    next_page_url = response.css('a.next-page-link::attr(href)').get()
    if next_page_url:
        yield response.follow(next_page_url, self.parse)

In this example, we select the URL of the next page using a CSS selector. If the URL exists, we call response.follow and pass it the URL and the parse method. This creates a new request to the next page and calls the parse method for that response.

Storing the Scraped Data

By default, Scrapy outputs the scraped data to the terminal. However, we can configure Scrapy to store the data in different formats like JSON, CSV, or a database.

To configure the output format, open the settings.py file inside the tutorial/tutorial directory. Uncomment the following lines and modify them as desired:

ITEM_PIPELINES = {
   'tutorial.pipelines.ExamplePipeline': 300,
}

FEED_FORMAT = 'json'
FEED_URI = 'output.json'

In this example, we set the output format to JSON and specify the output file name as output.json. We also specify ExamplePipeline as the pipeline to process the items. Pipelines in Scrapy are used to process and store the scraped data.

To create a pipeline, open the pipelines.py file inside the tutorial/tutorial directory. Define a class named ExamplePipeline and implement the process_item method. Here’s an example:

import json

class ExamplePipeline:
    def open_spider(self, spider):
        self.file = open('output.json', 'w')
        self.file.write('[n')  # Start a JSON list

    def close_spider(self, spider):
        self.file.write('n]')  # End the JSON list
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item))
        self.file.write(line + ',n')  # Write a JSON object to a new line
        return item

In this example, we open the output file in the open_spider method and write a start indicator for the JSON list. In the close_spider method, we write an end indicator and close the file. Finally, in the process_item method, we convert the item to a JSON string and write it to the file.

Conclusion

In this tutorial, we learned how to create a web crawler with Scrapy and Python. We covered the basic structure of a Scrapy project, how to define a spider, how to make requests and extract data, how to follow links, and how to store the scraped data. Scrapy provides a robust and easy-to-use framework for web scraping automation, making it an ideal choice for scraping data from websites.