🌐 Web Scraper

A configurable async web scraper demonstrating Python's asyncio, aiohttp, BeautifulSoup, and data extraction patterns.

🎯 Learning Objectives

After working through this project, you'll understand:

•Async/Await Patterns - Non-blocking I/O with asyncio
•HTTP Clients - Making requests with aiohttp
•HTML Parsing - Extracting data with BeautifulSoup and CSS selectors
•Configuration Management - YAML configuration with dataclasses
•Data Export - Saving data in multiple formats (JSON, CSV)
•Rate Limiting - Polite scraping with delays
•Caching - Avoiding redundant requests
•Context Managers - Async context managers for resource management

Features

•Async HTTP requests with connection pooling
•CSS selector-based data extraction
•Rate limiting and exponential backoff
•Response caching to reduce network load
•Multiple export formats (JSON, CSV)
•YAML configuration for easy customization
•Table and link extraction utilities
•Comprehensive error handling

🚀 Quick Start

Installation

cd 03_web_scraper

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Usage

# Run with configuration file
python -m scraper config.yaml

# Quick mode - single URL
python -m scraper --url "https://example.com" --selector "h1"

# With custom output file
python -m scraper config.yaml --output results.json

📁 Project Structure

03_web_scraper/
├── scraper/
│   ├── __init__.py       # Package marker
│   ├── main.py           # CLI entry point
│   ├── config.py         # Configuration dataclasses
│   ├── crawler.py        # Async HTTP client with rate limiting
│   ├── parser.py         # HTML parsing and data extraction
│   └── exporters.py      # JSON and CSV export
├── tests/
│   ├── __init__.py
│   └── test_scraper.py   # Unit tests
├── config.yaml           # Example configuration
├── requirements.txt      # Python dependencies
└── README.md

📋 Configuration

YAML Configuration File

# config.yaml
targets:
  - name: 'example_site'
    url: 'https://example.com'
    selectors:
      title: 'h1'
      paragraphs: 'p'
      links: 'a[href]'
    follow_links: false
    max_pages: 1

  - name: 'news_site'
    url: 'https://news.ycombinator.com'
    selectors:
      headlines: '.titleline > a'
      scores: '.score'

settings:
  rate_limit: 1.0 # Seconds between requests
  timeout: 30 # Request timeout in seconds
  user_agent: 'MyScraper/1.0'
  output_format: 'json' # 'json' or 'csv'
  output_file: 'data.json'
  max_retries: 3 # Retry failed requests
  cache_enabled: true # Cache responses locally
  cache_dir: '.cache'

Configuration Options

Setting	Default	Description
`rate_limit`	1.0	Seconds to wait between requests
`timeout`	30	Request timeout in seconds
`user_agent`	'PythonScraper/1.0'	HTTP User-Agent header
`output_format`	'json'	Output format: json or csv
`output_file`	'output.json'	Path to output file
`max_retries`	3	Number of retry attempts
`cache_enabled`	true	Enable response caching
`cache_dir`	'.cache'	Directory for cached responses

🔑 Key Concepts Explained

Async Context Manager

class Crawler:
    """Async crawler with proper resource management."""

    async def __aenter__(self):
        """Setup on entering 'async with'."""
        self.session = aiohttp.ClientSession(
            timeout=ClientTimeout(total=30),
            headers={'User-Agent': self.settings.user_agent}
        )
        return self

    async def __aexit__(self, exc_type, exc_val, exc_tb):
        """Cleanup on exiting 'async with'."""
        await self.session.close()

# Usage:
async with Crawler(settings) as crawler:
    result = await crawler.scrape(target)
# Session is automatically closed

Rate Limiting

async def _rate_limit(self):
    """Wait between requests to be polite."""
    elapsed = time.time() - self.last_request_time
    if elapsed < self.settings.rate_limit:
        await asyncio.sleep(self.settings.rate_limit - elapsed)
    self.last_request_time = time.time()

CSS Selector Extraction

from bs4 import BeautifulSoup

def parse_html(html: str, selectors: dict) -> dict:
    """Extract data using CSS selectors."""
    soup = BeautifulSoup(html, 'html.parser')
    result = {}

    for field, selector in selectors.items():
        elements = soup.select(selector)
        if len(elements) == 1:
            result[field] = elements[0].get_text(strip=True)
        else:
            result[field] = [el.get_text(strip=True) for el in elements]

    return result

Retry with Exponential Backoff

async def _fetch(self, url: str) -> str:
    """Fetch with retries and exponential backoff."""
    for attempt in range(self.settings.max_retries):
        try:
            async with self.session.get(url) as response:
                response.raise_for_status()
                return await response.text()
        except aiohttp.ClientError as e:
            if attempt == self.settings.max_retries - 1:
                raise
            await asyncio.sleep(2 ** attempt)  # 1s, 2s, 4s...

💡 Usage Examples

Basic Scraping

import asyncio
from scraper.config import ScraperConfig, Target, Settings
from scraper.crawler import Crawler

async def main():
    target = Target(
        name='example',
        url='https://example.com',
        selectors={'title': 'h1', 'text': 'p'}
    )

    settings = Settings()

    async with Crawler(settings) as crawler:
        result = await crawler.scrape(target)
        print(result)

asyncio.run(main())

Extracting Tables

from scraper.parser import extract_table

html = """
<table>
    <tr><th>Name</th><th>Price</th></tr>
    <tr><td>Item A</td><td>$10</td></tr>
    <tr><td>Item B</td><td>$20</td></tr>
</table>
"""

rows = extract_table(html)
# [{'Name': 'Item A', 'Price': '$10'}, {'Name': 'Item B', 'Price': '$20'}]

Extracting All Links

from scraper.parser import extract_links

links = extract_links(html, base_url='https://example.com')
# Converts relative URLs to absolute:
# ['/page' -> 'https://example.com/page']

🧪 Running Tests

# Install test dependencies
pip install pytest pytest-asyncio

# Run tests
pytest tests/ -v

# Run specific test
pytest tests/test_scraper.py::TestParser -v

⚖️ Ethical Scraping Guidelines

IMPORTANT: Always scrape responsibly!

Do's ✅

•Check robots.txt before scraping any site
•Identify your scraper with a descriptive User-Agent
•Respect rate limits - don't overwhelm servers
•Cache responses to avoid redundant requests
•Handle errors gracefully without retrying infinitely
•Only scrape public data you have permission to access

Don'ts ❌

•Don't ignore robots.txt directives
•Don't make rapid-fire requests (use rate limiting)
•Don't scrape login-protected pages without permission
•Don't redistribute copyrighted content
•Don't pretend to be a browser if you're a bot

Example robots.txt Check

from urllib.robotparser import RobotFileParser

def can_scrape(url: str, user_agent: str = '*') -> bool:
    """Check if URL is allowed by robots.txt."""
    rp = RobotFileParser()
    rp.set_url(f"{url.rstrip('/')}/robots.txt")
    rp.read()
    return rp.can_fetch(user_agent, url)

📈 Extending the Scraper

Add JavaScript Rendering

For JavaScript-heavy sites, you might need a browser:

# Using playwright (async browser automation)
from playwright.async_api import async_playwright

async def render_js(url: str) -> str:
    """Render JavaScript and return HTML."""
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url)
        await page.wait_for_load_state('networkidle')
        html = await page.content()
        await browser.close()
        return html

Add Proxy Support

async def _fetch_with_proxy(self, url: str, proxy: str) -> str:
    """Fetch through a proxy server."""
    async with self.session.get(url, proxy=proxy) as response:
        return await response.text()

Add Progress Tracking

from tqdm.asyncio import tqdm

async def scrape_all(self, targets: list[Target]) -> list:
    results = []
    async for target in tqdm(targets, desc="Scraping"):
        result = await self.scrape(target)
        results.append(result)
    return results

📚 Related Learning

✅ Project Checklist

• Run the scraper with the example config
• Create a custom config for a site you want to scrape
• Try different CSS selectors
• Export to both JSON and CSV formats
• Implement caching and verify it works
• Add a new extraction function (e.g., for images)
• Write tests for your custom extractors
• Add robots.txt checking to the crawler

⚠️ Disclaimer

This tool is for educational purposes only. Always:

•Respect website terms of service
•Check and follow robots.txt
•Obtain permission when required
•Use scraped data ethically