Overview
A flexible and efficient web scraping framework built with Python, designed to make web data extraction simple and maintainable. The framework handles common web scraping challenges like pagination, authentication, and rate limiting, while providing a clean API for data extraction and storage.
Key Features
- Automatic handling of pagination
- Built-in rate limiting and retry mechanisms
- Support for authentication and session management
- Configurable data storage (SQLite, CSV, JSON)
- Proxy support and rotation
- Extensible spider architecture
Example Usage
from web_scraper import Spider, Field
class BookSpider(Spider):
start_url = "https://bookstore.com/catalog"
# Define fields to extract
title = Field(css="h2.book-title")
author = Field(css="span.author")
price = Field(css="div.price", type=float)
def parse_page(self, response):
# Extract all book items on the page
books = response.css("div.book-item")
for book in books:
yield self.extract(book)
# Handle pagination
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield self.follow(next_page)
# Run the spider
spider = BookSpider()
results = spider.run()
Installation
pip install web-scraping-framework
For development setup:
git clone https://github.com/spideynolove/web-scraping-framework.git
cd web-scraping-framework
pip install -r requirements.txt
Documentation
Comprehensive documentation is available at web-scraping-framework-docs.netlify.app, including:
- Getting started guide
- API reference
- Advanced usage examples
- Best practices
- Troubleshooting guide