Web Scraping Framework

Python Beautiful Soup Requests SQLite

Overview

A flexible and efficient web scraping framework built with Python, designed to make web data extraction simple and maintainable. The framework handles common web scraping challenges like pagination, authentication, and rate limiting, while providing a clean API for data extraction and storage.

Key Features

Example Usage

from web_scraper import Spider, Field

class BookSpider(Spider):
    start_url = "https://bookstore.com/catalog"
    
    # Define fields to extract
    title = Field(css="h2.book-title")
    author = Field(css="span.author")
    price = Field(css="div.price", type=float)
    
    def parse_page(self, response):
        # Extract all book items on the page
        books = response.css("div.book-item")
        for book in books:
            yield self.extract(book)
        
        # Handle pagination
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield self.follow(next_page)

# Run the spider
spider = BookSpider()
results = spider.run()

Installation

pip install web-scraping-framework

For development setup:

git clone https://github.com/spideynolove/web-scraping-framework.git
cd web-scraping-framework
pip install -r requirements.txt

Documentation

Comprehensive documentation is available at web-scraping-framework-docs.netlify.app, including: