Docs

web scraper

๐ŸŒ Web Scraper

A configurable async web scraper demonstrating Python's asyncio, aiohttp, BeautifulSoup, and data extraction patterns.

๐ŸŽฏ Learning Objectives

After working through this project, you'll understand:

  • โ€ขAsync/Await Patterns - Non-blocking I/O with asyncio
  • โ€ขHTTP Clients - Making requests with aiohttp
  • โ€ขHTML Parsing - Extracting data with BeautifulSoup and CSS selectors
  • โ€ขConfiguration Management - YAML configuration with dataclasses
  • โ€ขData Export - Saving data in multiple formats (JSON, CSV)
  • โ€ขRate Limiting - Polite scraping with delays
  • โ€ขCaching - Avoiding redundant requests
  • โ€ขContext Managers - Async context managers for resource management

Features

  • โ€ขAsync HTTP requests with connection pooling
  • โ€ขCSS selector-based data extraction
  • โ€ขRate limiting and exponential backoff
  • โ€ขResponse caching to reduce network load
  • โ€ขMultiple export formats (JSON, CSV)
  • โ€ขYAML configuration for easy customization
  • โ€ขTable and link extraction utilities
  • โ€ขComprehensive error handling

๐Ÿš€ Quick Start

Installation

cd 03_web_scraper

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Usage

# Run with configuration file
python -m scraper config.yaml

# Quick mode - single URL
python -m scraper --url "https://example.com" --selector "h1"

# With custom output file
python -m scraper config.yaml --output results.json

๐Ÿ“ Project Structure

03_web_scraper/
โ”œโ”€โ”€ scraper/
โ”‚   โ”œโ”€โ”€ __init__.py       # Package marker
โ”‚   โ”œโ”€โ”€ main.py           # CLI entry point
โ”‚   โ”œโ”€โ”€ config.py         # Configuration dataclasses
โ”‚   โ”œโ”€โ”€ crawler.py        # Async HTTP client with rate limiting
โ”‚   โ”œโ”€โ”€ parser.py         # HTML parsing and data extraction
โ”‚   โ””โ”€โ”€ exporters.py      # JSON and CSV export
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ test_scraper.py   # Unit tests
โ”œโ”€โ”€ config.yaml           # Example configuration
โ”œโ”€โ”€ requirements.txt      # Python dependencies
โ””โ”€โ”€ README.md

๐Ÿ“‹ Configuration

YAML Configuration File

# config.yaml
targets:
  - name: 'example_site'
    url: 'https://example.com'
    selectors:
      title: 'h1'
      paragraphs: 'p'
      links: 'a[href]'
    follow_links: false
    max_pages: 1

  - name: 'news_site'
    url: 'https://news.ycombinator.com'
    selectors:
      headlines: '.titleline > a'
      scores: '.score'

settings:
  rate_limit: 1.0 # Seconds between requests
  timeout: 30 # Request timeout in seconds
  user_agent: 'MyScraper/1.0'
  output_format: 'json' # 'json' or 'csv'
  output_file: 'data.json'
  max_retries: 3 # Retry failed requests
  cache_enabled: true # Cache responses locally
  cache_dir: '.cache'

Configuration Options

SettingDefaultDescription
rate_limit1.0Seconds to wait between requests
timeout30Request timeout in seconds
user_agent'PythonScraper/1.0'HTTP User-Agent header
output_format'json'Output format: json or csv
output_file'output.json'Path to output file
max_retries3Number of retry attempts
cache_enabledtrueEnable response caching
cache_dir'.cache'Directory for cached responses

๐Ÿ”‘ Key Concepts Explained

Async Context Manager

class Crawler:
    """Async crawler with proper resource management."""

    async def __aenter__(self):
        """Setup on entering 'async with'."""
        self.session = aiohttp.ClientSession(
            timeout=ClientTimeout(total=30),
            headers={'User-Agent': self.settings.user_agent}
        )
        return self

    async def __aexit__(self, exc_type, exc_val, exc_tb):
        """Cleanup on exiting 'async with'."""
        await self.session.close()

# Usage:
async with Crawler(settings) as crawler:
    result = await crawler.scrape(target)
# Session is automatically closed

Rate Limiting

async def _rate_limit(self):
    """Wait between requests to be polite."""
    elapsed = time.time() - self.last_request_time
    if elapsed < self.settings.rate_limit:
        await asyncio.sleep(self.settings.rate_limit - elapsed)
    self.last_request_time = time.time()

CSS Selector Extraction

from bs4 import BeautifulSoup

def parse_html(html: str, selectors: dict) -> dict:
    """Extract data using CSS selectors."""
    soup = BeautifulSoup(html, 'html.parser')
    result = {}

    for field, selector in selectors.items():
        elements = soup.select(selector)
        if len(elements) == 1:
            result[field] = elements[0].get_text(strip=True)
        else:
            result[field] = [el.get_text(strip=True) for el in elements]

    return result

Retry with Exponential Backoff

async def _fetch(self, url: str) -> str:
    """Fetch with retries and exponential backoff."""
    for attempt in range(self.settings.max_retries):
        try:
            async with self.session.get(url) as response:
                response.raise_for_status()
                return await response.text()
        except aiohttp.ClientError as e:
            if attempt == self.settings.max_retries - 1:
                raise
            await asyncio.sleep(2 ** attempt)  # 1s, 2s, 4s...

๐Ÿ’ก Usage Examples

Basic Scraping

import asyncio
from scraper.config import ScraperConfig, Target, Settings
from scraper.crawler import Crawler

async def main():
    target = Target(
        name='example',
        url='https://example.com',
        selectors={'title': 'h1', 'text': 'p'}
    )

    settings = Settings()

    async with Crawler(settings) as crawler:
        result = await crawler.scrape(target)
        print(result)

asyncio.run(main())

Extracting Tables

from scraper.parser import extract_table

html = """
<table>
    <tr><th>Name</th><th>Price</th></tr>
    <tr><td>Item A</td><td>$10</td></tr>
    <tr><td>Item B</td><td>$20</td></tr>
</table>
"""

rows = extract_table(html)
# [{'Name': 'Item A', 'Price': '$10'}, {'Name': 'Item B', 'Price': '$20'}]

Extracting All Links

from scraper.parser import extract_links

links = extract_links(html, base_url='https://example.com')
# Converts relative URLs to absolute:
# ['/page' -> 'https://example.com/page']

๐Ÿงช Running Tests

# Install test dependencies
pip install pytest pytest-asyncio

# Run tests
pytest tests/ -v

# Run specific test
pytest tests/test_scraper.py::TestParser -v

โš–๏ธ Ethical Scraping Guidelines

IMPORTANT: Always scrape responsibly!

Do's โœ…

  • โ€ขCheck robots.txt before scraping any site
  • โ€ขIdentify your scraper with a descriptive User-Agent
  • โ€ขRespect rate limits - don't overwhelm servers
  • โ€ขCache responses to avoid redundant requests
  • โ€ขHandle errors gracefully without retrying infinitely
  • โ€ขOnly scrape public data you have permission to access

Don'ts โŒ

  • โ€ขDon't ignore robots.txt directives
  • โ€ขDon't make rapid-fire requests (use rate limiting)
  • โ€ขDon't scrape login-protected pages without permission
  • โ€ขDon't redistribute copyrighted content
  • โ€ขDon't pretend to be a browser if you're a bot

Example robots.txt Check

from urllib.robotparser import RobotFileParser

def can_scrape(url: str, user_agent: str = '*') -> bool:
    """Check if URL is allowed by robots.txt."""
    rp = RobotFileParser()
    rp.set_url(f"{url.rstrip('/')}/robots.txt")
    rp.read()
    return rp.can_fetch(user_agent, url)

๐Ÿ“ˆ Extending the Scraper

Add JavaScript Rendering

For JavaScript-heavy sites, you might need a browser:

# Using playwright (async browser automation)
from playwright.async_api import async_playwright

async def render_js(url: str) -> str:
    """Render JavaScript and return HTML."""
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url)
        await page.wait_for_load_state('networkidle')
        html = await page.content()
        await browser.close()
        return html

Add Proxy Support

async def _fetch_with_proxy(self, url: str, proxy: str) -> str:
    """Fetch through a proxy server."""
    async with self.session.get(url, proxy=proxy) as response:
        return await response.text()

Add Progress Tracking

from tqdm.asyncio import tqdm

async def scrape_all(self, targets: list[Target]) -> list:
    results = []
    async for target in tqdm(targets, desc="Scraping"):
        result = await self.scrape(target)
        results.append(result)
    return results

๐Ÿ“š Related Learning

โœ… Project Checklist

  • โ€ข Run the scraper with the example config
  • โ€ข Create a custom config for a site you want to scrape
  • โ€ข Try different CSS selectors
  • โ€ข Export to both JSON and CSV formats
  • โ€ข Implement caching and verify it works
  • โ€ข Add a new extraction function (e.g., for images)
  • โ€ข Write tests for your custom extractors
  • โ€ข Add robots.txt checking to the crawler

โš ๏ธ Disclaimer

This tool is for educational purposes only. Always:

  • โ€ขRespect website terms of service
  • โ€ขCheck and follow robots.txt
  • โ€ขObtain permission when required
  • โ€ขUse scraped data ethically
Web Scraper - Python Tutorial | DeepML