Built-in framework
web_poet.framework is a built-in web-poet framework
for simple use cases.
It is designed to be easy to use for quick proof-of-concepts, simple scripts, and for generating test fixtures. It can also serve as a reference implementation for framework authors.
Limitations
The main limitation of the built-in framework is that it is not a complete scraping framework like Scrapy, which can support web-poet thanks to scrapy-poet.
As a web-poet framework, the built-in framework also lacks support for
custom input classes, Retry
and UseFallback.
Also, browser inputs only support plain GET
requests. Requests with a non-GET method, headers or a body raise
HttpRequestError.
Installation
To use web_poet.framework, install the framework extra:
pip install web-poet[framework]
For browser support, you also need to install at least 1 browser with Playwright. For example, to install the main browsers:
playwright install
Basic use
from dataclasses import dataclass
from web_poet import WebPage
from web_poet.framework import Framework
from web_poet.utils import ensure_awaitable
@dataclass
class Book:
title: str
class BookPage(WebPage[Book]):
@field
def title(self) -> str:
return self.response.css("h1::text").get()
framework = Framework()
item = await framework.get_item("https://books.example.com/book/1", BookPage)
# Or, if you prefer, get a page object instance first.
page = await framework.get_page("https://books.example.com/book/1", BookPage)
item = await ensure_awaitable(page.to_item())
Choosing a page object class automatically
If you decorate your page object classes with handle_urls() and
make sure they are imported, e.g. with consume_modules(), you
can pass get_item() an item class, and let
it determine which page object class to use:
from dataclasses import dataclass
from web_poet import WebPage, handle_urls
from web_poet.framework import Framework
@dataclass
class Book:
title: str
@handle_urls("books.example.com")
class BookPage(WebPage[Book]):
@field
def title(self) -> str:
return self.response.css("h1::text").get()
framework = Framework()
item = await framework.get_item("https://books.example.com/book/1", Book)
Browser
The built-in framework can use Playwright to resolve browser dependencies
like BrowserHtml or
BrowserResponse.
Chromium is used by default. You can override that by passing
default_playwright_engine to Framework. Page
objects can also annotate their Playwright engine dependencies with
playwright_engine() to specify which engine they
require. For example:
from typing import Annotated
from web_poet import WebPage, Item
from web_poet.page_inputs.browser import BrowserResponse
from web_poet.framework import playwright_engine
class MyPageObject(WebPage[Item]):
response = Annotated[BrowserResponse, playwright_engine("firefox")]
Stats
The built-in framework supports Stats.
By default, Framework creates a
DictStatCollector object, exposes it to
any page object that requests Stats, and
exposes that object as the stats
attribute of the framework:
from web_poet.framework import Framework
framework = Framework()
item1 = await framework.get_item("http://example.com/book/1", BookPage)
item2 = await framework.get_item("http://example.com/book/2", BookPage)
all_stats = framework.stats
Framework also supports passing a custom stats
collector:
from web_poet.page_inputs.stats import StatCollector
class MyStatCollector(StatCollector): ...
framework = Framework(stats=MyStatCollector())