Docs
README
๐ Web Scraper
A configurable async web scraper demonstrating Python's asyncio, aiohttp, BeautifulSoup, and data extraction patterns.
๐ฏ Learning Objectives
After working through this project, you'll understand:
- โขAsync/Await Patterns - Non-blocking I/O with asyncio
- โขHTTP Clients - Making requests with aiohttp
- โขHTML Parsing - Extracting data with BeautifulSoup and CSS selectors
- โขConfiguration Management - YAML configuration with dataclasses
- โขData Export - Saving data in multiple formats (JSON, CSV)
- โขRate Limiting - Polite scraping with delays
- โขCaching - Avoiding redundant requests
- โขContext Managers - Async context managers for resource management
Features
- โขAsync HTTP requests with connection pooling
- โขCSS selector-based data extraction
- โขRate limiting and exponential backoff
- โขResponse caching to reduce network load
- โขMultiple export formats (JSON, CSV)
- โขYAML configuration for easy customization
- โขTable and link extraction utilities
- โขComprehensive error handling
๐ Quick Start
Installation
cd 03_web_scraper
# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Usage
# Run with configuration file
python -m scraper config.yaml
# Quick mode - single URL
python -m scraper --url "https://example.com" --selector "h1"
# With custom output file
python -m scraper config.yaml --output results.json
๐ Project Structure
03_web_scraper/
โโโ scraper/
โ โโโ __init__.py # Package marker
โ โโโ main.py # CLI entry point
โ โโโ config.py # Configuration dataclasses
โ โโโ crawler.py # Async HTTP client with rate limiting
โ โโโ parser.py # HTML parsing and data extraction
โ โโโ exporters.py # JSON and CSV export
โโโ tests/
โ โโโ __init__.py
โ โโโ test_scraper.py # Unit tests
โโโ config.yaml # Example configuration
โโโ requirements.txt # Python dependencies
โโโ README.md
๐ Configuration
YAML Configuration File
# config.yaml
targets:
- name: 'example_site'
url: 'https://example.com'
selectors:
title: 'h1'
paragraphs: 'p'
links: 'a[href]'
follow_links: false
max_pages: 1
- name: 'news_site'
url: 'https://news.ycombinator.com'
selectors:
headlines: '.titleline > a'
scores: '.score'
settings:
rate_limit: 1.0 # Seconds between requests
timeout: 30 # Request timeout in seconds
user_agent: 'MyScraper/1.0'
output_format: 'json' # 'json' or 'csv'
output_file: 'data.json'
max_retries: 3 # Retry failed requests
cache_enabled: true # Cache responses locally
cache_dir: '.cache'
Configuration Options
| Setting | Default | Description |
|---|---|---|
rate_limit | 1.0 | Seconds to wait between requests |
timeout | 30 | Request timeout in seconds |
user_agent | 'PythonScraper/1.0' | HTTP User-Agent header |
output_format | 'json' | Output format: json or csv |
output_file | 'output.json' | Path to output file |
max_retries | 3 | Number of retry attempts |
cache_enabled | true | Enable response caching |
cache_dir | '.cache' | Directory for cached responses |
๐ Key Concepts Explained
Async Context Manager
class Crawler:
"""Async crawler with proper resource management."""
async def __aenter__(self):
"""Setup on entering 'async with'."""
self.session = aiohttp.ClientSession(
timeout=ClientTimeout(total=30),
headers={'User-Agent': self.settings.user_agent}
)
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
"""Cleanup on exiting 'async with'."""
await self.session.close()
# Usage:
async with Crawler(settings) as crawler:
result = await crawler.scrape(target)
# Session is automatically closed
Rate Limiting
async def _rate_limit(self):
"""Wait between requests to be polite."""
elapsed = time.time() - self.last_request_time
if elapsed < self.settings.rate_limit:
await asyncio.sleep(self.settings.rate_limit - elapsed)
self.last_request_time = time.time()
CSS Selector Extraction
from bs4 import BeautifulSoup
def parse_html(html: str, selectors: dict) -> dict:
"""Extract data using CSS selectors."""
soup = BeautifulSoup(html, 'html.parser')
result = {}
for field, selector in selectors.items():
elements = soup.select(selector)
if len(elements) == 1:
result[field] = elements[0].get_text(strip=True)
else:
result[field] = [el.get_text(strip=True) for el in elements]
return result
Retry with Exponential Backoff
async def _fetch(self, url: str) -> str:
"""Fetch with retries and exponential backoff."""
for attempt in range(self.settings.max_retries):
try:
async with self.session.get(url) as response:
response.raise_for_status()
return await response.text()
except aiohttp.ClientError as e:
if attempt == self.settings.max_retries - 1:
raise
await asyncio.sleep(2 ** attempt) # 1s, 2s, 4s...
๐ก Usage Examples
Basic Scraping
import asyncio
from scraper.config import ScraperConfig, Target, Settings
from scraper.crawler import Crawler
async def main():
target = Target(
name='example',
url='https://example.com',
selectors={'title': 'h1', 'text': 'p'}
)
settings = Settings()
async with Crawler(settings) as crawler:
result = await crawler.scrape(target)
print(result)
asyncio.run(main())
Extracting Tables
from scraper.parser import extract_table
html = """
<table>
<tr><th>Name</th><th>Price</th></tr>
<tr><td>Item A</td><td>$10</td></tr>
<tr><td>Item B</td><td>$20</td></tr>
</table>
"""
rows = extract_table(html)
# [{'Name': 'Item A', 'Price': '$10'}, {'Name': 'Item B', 'Price': '$20'}]
Extracting All Links
from scraper.parser import extract_links
links = extract_links(html, base_url='https://example.com')
# Converts relative URLs to absolute:
# ['/page' -> 'https://example.com/page']
๐งช Running Tests
# Install test dependencies
pip install pytest pytest-asyncio
# Run tests
pytest tests/ -v
# Run specific test
pytest tests/test_scraper.py::TestParser -v
โ๏ธ Ethical Scraping Guidelines
IMPORTANT: Always scrape responsibly!
Do's โ
- โขCheck robots.txt before scraping any site
- โขIdentify your scraper with a descriptive User-Agent
- โขRespect rate limits - don't overwhelm servers
- โขCache responses to avoid redundant requests
- โขHandle errors gracefully without retrying infinitely
- โขOnly scrape public data you have permission to access
Don'ts โ
- โขDon't ignore robots.txt directives
- โขDon't make rapid-fire requests (use rate limiting)
- โขDon't scrape login-protected pages without permission
- โขDon't redistribute copyrighted content
- โขDon't pretend to be a browser if you're a bot
Example robots.txt Check
from urllib.robotparser import RobotFileParser
def can_scrape(url: str, user_agent: str = '*') -> bool:
"""Check if URL is allowed by robots.txt."""
rp = RobotFileParser()
rp.set_url(f"{url.rstrip('/')}/robots.txt")
rp.read()
return rp.can_fetch(user_agent, url)
๐ Extending the Scraper
Add JavaScript Rendering
For JavaScript-heavy sites, you might need a browser:
# Using playwright (async browser automation)
from playwright.async_api import async_playwright
async def render_js(url: str) -> str:
"""Render JavaScript and return HTML."""
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto(url)
await page.wait_for_load_state('networkidle')
html = await page.content()
await browser.close()
return html
Add Proxy Support
async def _fetch_with_proxy(self, url: str, proxy: str) -> str:
"""Fetch through a proxy server."""
async with self.session.get(url, proxy=proxy) as response:
return await response.text()
Add Progress Tracking
from tqdm.asyncio import tqdm
async def scrape_all(self, targets: list[Target]) -> list:
results = []
async for target in tqdm(targets, desc="Scraping"):
result = await self.scrape(target)
results.append(result)
return results
๐ Related Learning
- โขaiohttp Documentation
- โขBeautifulSoup Documentation
- โขasyncio Tutorial
- โขWeb Scraping Best Practices
โ Project Checklist
- โข Run the scraper with the example config
- โข Create a custom config for a site you want to scrape
- โข Try different CSS selectors
- โข Export to both JSON and CSV formats
- โข Implement caching and verify it works
- โข Add a new extraction function (e.g., for images)
- โข Write tests for your custom extractors
- โข Add robots.txt checking to the crawler
โ ๏ธ Disclaimer
This tool is for educational purposes only. Always:
- โขRespect website terms of service
- โขCheck and follow robots.txt
- โขObtain permission when required
- โขUse scraped data ethically