Web Scraping Framework - SpideyNoLove

Overview

A flexible and efficient web scraping framework built with Python, designed to make web data extraction simple and maintainable. The framework handles common web scraping challenges like pagination, authentication, and rate limiting, while providing a clean API for data extraction and storage.

Key Features

Automatic handling of pagination
Built-in rate limiting and retry mechanisms
Support for authentication and session management
Configurable data storage (SQLite, CSV, JSON)
Proxy support and rotation
Extensible spider architecture

Example Usage

from web_scraper import Spider, Field

class BookSpider(Spider):
    start_url = "https://bookstore.com/catalog"
    
    # Define fields to extract
    title = Field(css="h2.book-title")
    author = Field(css="span.author")
    price = Field(css="div.price", type=float)
    
    def parse_page(self, response):
        # Extract all book items on the page
        books = response.css("div.book-item")
        for book in books:
            yield self.extract(book)
        
        # Handle pagination
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield self.follow(next_page)

# Run the spider
spider = BookSpider()
results = spider.run()

Installation

pip install web-scraping-framework

For development setup:

git clone https://github.com/spideynolove/web-scraping-framework.git
cd web-scraping-framework
pip install -r requirements.txt

Documentation

Comprehensive documentation is available at web-scraping-framework-docs.netlify.app, including:

Getting started guide
API reference
Advanced usage examples
Best practices
Troubleshooting guide