AI-assisted code generation

When using LLMs to write Python code for web scraping, these are the most reasonable approaches to consider:

Plain Python functions or classes

Pros: Simple, dependency-free, and easy for LLMs to produce.

Cons: You must define your own conventions and testing practices; integration across teams and tools can be ad-hoc.

Use when you need a quick extractor or the logic is small and unlikely to be reused.

Scrapy spiders

Pros: Built-in crawling, request scheduling, retries and many utilities.

Cons: Large surface area for AI generation. Spiders mix crawling, error handling and extraction, which makes testing extraction in isolation difficult.

Avoid generating full spiders with an LLM; prefer generating extraction logic separately.

web-poet page objects

Pros: Small, standard contract for extraction with field-level decomposition, first-class testing support, and framework integration.

Cons: Requires adopting web-poet idioms and a small framework cost, which can be unnecessary for trivial scripts.

Use when you want maintainability, testability, and a predictable contract that can be used by tools and teams.

Note

scrapy-poet provides a great way to use web-poet page objects within Scrapy spiders, giving you the benefits of both approaches.