Skip to content

Commit 33503cb

Browse files
committed
add docs and fingerprint headers
1 parent bd45915 commit 33503cb

9 files changed

Lines changed: 385 additions & 73 deletions

File tree

docs/guides/architecture_overview.mdx

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,8 @@ class PlaywrightCrawler
5353
5454
class AdaptivePlaywrightCrawler
5555
56+
class StagehandCrawler
57+
5658
%% ========================
5759
%% Inheritance arrows
5860
%% ========================
@@ -63,6 +65,7 @@ BasicCrawler --|> AdaptivePlaywrightCrawler
6365
AbstractHttpCrawler --|> HttpCrawler
6466
AbstractHttpCrawler --|> ParselCrawler
6567
AbstractHttpCrawler --|> BeautifulSoupCrawler
68+
PlaywrightCrawler --|> StagehandCrawler
6669
```
6770

6871
### HTTP crawlers
@@ -79,7 +82,19 @@ You can learn more about HTTP crawlers in the [HTTP crawlers guide](./http-crawl
7982

8083
### Browser crawlers
8184

82-
Browser crawlers use a real browser to render pages, enabling scraping of sites that require JavaScript. They manage browser instances, pages, and context lifecycles. Currently, the only browser crawler is <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, which utilizes the [Playwright](https://playwright.dev/) library. Playwright provides a high-level API for controlling and navigating browsers. You can learn more about <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, its features, and how it internally manages browser instances in the [Playwright crawler guide](./playwright-crawler).
85+
Browser crawlers use a real browser to render pages, enabling scraping of sites that require
86+
JavaScript. They manage browser instances, pages, and context lifecycles. Crawlee provides
87+
two browser crawlers:
88+
89+
- <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> utilizes the
90+
[Playwright](https://playwright.dev/) library and provides a high-level API for controlling
91+
and navigating browsers. You can learn more about it in the
92+
[Playwright crawler guide](./playwright-crawler).
93+
- <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink> extends
94+
`PlaywrightCrawler` with AI-powered browser automation via
95+
[Stagehand](https://github.com/browserbase/stagehand). It adds natural-language methods
96+
(`act`, `extract`, `observe`, `execute`) directly on the page object. You can learn more
97+
about it in the [Stagehand crawler guide](./stagehand-crawler).
8398

8499
### Adaptive crawler
85100

@@ -122,6 +137,12 @@ class AdaptivePlaywrightPreNavCrawlingContext
122137
123138
class AdaptivePlaywrightCrawlingContext
124139
140+
class StagehandPreNavCrawlingContext
141+
142+
class StagehandPostNavCrawlingContext
143+
144+
class StagehandCrawlingContext
145+
125146
%% ========================
126147
%% Inheritance arrows
127148
%% ========================
@@ -143,6 +164,12 @@ PlaywrightPreNavCrawlingContext --|> PlaywrightCrawlingContext
143164
BasicCrawlingContext --|> AdaptivePlaywrightPreNavCrawlingContext
144165
145166
ParsedHttpCrawlingContext --|> AdaptivePlaywrightCrawlingContext
167+
168+
PlaywrightPreNavCrawlingContext --|> StagehandPreNavCrawlingContext
169+
170+
StagehandPreNavCrawlingContext --|> StagehandPostNavCrawlingContext
171+
172+
StagehandPostNavCrawlingContext --|> StagehandCrawlingContext
146173
```
147174

148175
They have a similar inheritance structure as the crawlers, with the base class being <ApiLink to="class/BasicCrawlingContext">`BasicCrawlingContext`</ApiLink>. The specific crawling contexts are:
@@ -154,6 +181,12 @@ They have a similar inheritance structure as the crawlers, with the base class b
154181
- <ApiLink to="class/PlaywrightCrawlingContext">`PlaywrightCrawlingContext`</ApiLink> for Playwright crawlers.
155182
- <ApiLink to="class/AdaptivePlaywrightPreNavCrawlingContext">`AdaptivePlaywrightPreNavCrawlingContext`</ApiLink> for Adaptive Playwright crawlers before the page is navigated.
156183
- <ApiLink to="class/AdaptivePlaywrightCrawlingContext">`AdaptivePlaywrightCrawlingContext`</ApiLink> for Adaptive Playwright crawlers.
184+
- <ApiLink to="class/StagehandPreNavCrawlingContext">`StagehandPreNavCrawlingContext`</ApiLink>
185+
for Stagehand crawlers before the page is navigated.
186+
- <ApiLink to="class/StagehandPostNavCrawlingContext">`StagehandPostNavCrawlingContext`</ApiLink>
187+
for Stagehand crawlers after the page is navigated.
188+
- <ApiLink to="class/StagehandCrawlingContext">`StagehandCrawlingContext`</ApiLink>
189+
for Stagehand crawlers.
157190

158191
## Storages
159192

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
import asyncio
2+
from typing import cast
3+
4+
from crawlee.browsers import StagehandOptions
5+
from crawlee.crawlers import StagehandCrawler, StagehandCrawlingContext
6+
7+
8+
async def main() -> None:
9+
crawler = StagehandCrawler(
10+
stagehand_options=StagehandOptions(
11+
model_api_key='your-openai-api-key',
12+
model='openai/gpt-4.1-mini',
13+
),
14+
max_requests_per_crawl=5,
15+
)
16+
17+
@crawler.router.default_handler
18+
async def handler(context: StagehandCrawlingContext) -> None:
19+
context.log.info(f'Processing {context.request.url} ...')
20+
21+
# Dismiss overlays or interact with the page using natural language.
22+
await context.page.act(instruction='Click the accept cookies button if present')
23+
24+
# Extract data from the page using AI.
25+
extracted = await context.page.extract(
26+
instruction='Get the page title and the main heading text',
27+
schema={
28+
'type': 'object',
29+
'properties': {
30+
'title': {'type': 'string'},
31+
'heading': {'type': 'string'},
32+
},
33+
},
34+
)
35+
36+
extract_result = extracted.data.result
37+
38+
if isinstance(extract_result, dict):
39+
# Push extracted data to the dataset
40+
# Use `cast()` to provide a more specific type hint for the extracted data.
41+
await context.push_data(cast('dict[str, str | None]', extract_result))
42+
43+
await crawler.run(['https://example.com'])
44+
45+
46+
if __name__ == '__main__':
47+
asyncio.run(main())
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
import asyncio
2+
from typing import cast
3+
4+
from crawlee.browsers import StagehandOptions
5+
from crawlee.crawlers import StagehandCrawler, StagehandCrawlingContext
6+
7+
8+
async def main() -> None:
9+
# Use Browserbase cloud browser instead of a local Chromium instance.
10+
crawler = StagehandCrawler(
11+
stagehand_options=StagehandOptions(
12+
env='BROWSERBASE',
13+
browserbase_api_key='your-browserbase-api-key',
14+
project_id='your-project-id',
15+
model_api_key='your-openai-api-key',
16+
model='openai/gpt-4.1-mini',
17+
),
18+
max_requests_per_crawl=5,
19+
)
20+
21+
@crawler.router.default_handler
22+
async def handler(context: StagehandCrawlingContext) -> None:
23+
context.log.info(f'Processing {context.request.url} ...')
24+
25+
extracted = await context.page.extract(
26+
instruction='Get the main content of the page',
27+
)
28+
29+
extract_result = extracted.data.result
30+
31+
await context.push_data(cast('dict[str, str | None]', extract_result))
32+
33+
await crawler.run(['https://example.com'])
34+
35+
36+
if __name__ == '__main__':
37+
asyncio.run(main())

docs/guides/stagehand_crawler.mdx

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
---
2+
id: stagehand-crawler
3+
title: Stagehand crawler
4+
description: Learn how to use StagehandCrawler for AI-powered browser automation and data extraction.
5+
---
6+
7+
import ApiLink from '@site/src/components/ApiLink';
8+
import CodeBlock from '@theme/CodeBlock';
9+
10+
import BasicExample from '!!raw-loader!./code_examples/stagehand_crawler/basic_example.py';
11+
import BrowserbaseExample from '!!raw-loader!./code_examples/stagehand_crawler/browserbase_example.py';
12+
13+
A <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink> extends <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> with AI-powered browser automation via [Stagehand](https://www.browserbase.com/stagehand). Instead of writing CSS selectors or XPath expressions, you describe what you want in plain English and the AI model takes care of the rest.
14+
15+
Each page in the crawling context is a <ApiLink to="class/StagehandPage">`StagehandPage`</ApiLink> - a drop-in replacement for the standard Playwright `Page` that adds four AI methods:
16+
17+
- `page.act(**kwargs)` - perform an action using a natural language instruction
18+
- `page.extract(**kwargs)` - extract structured data from the page using AI
19+
- `page.observe(**kwargs)` - get a list of AI-suggested actions available on the page
20+
- `page.execute(**kwargs)` - run an autonomous multi-step agent on the page
21+
22+
All standard Playwright methods remain available alongside these AI methods.
23+
24+
## When to use StagehandCrawler
25+
26+
Use <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink> when:
27+
28+
- **Selectors are brittle or unknown** - the AI can locate elements by their visual role or label rather than a specific CSS class.
29+
- **Interactions are complex** - multi-step forms, dynamic menus, or context-dependent flows that are hard to script.
30+
- **Rapid prototyping** - you want to build a scraper quickly without spending time reverse-engineering the page structure.
31+
32+
For straightforward scraping tasks where the page structure is stable and well-known, <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> is more efficient, read more in that [guide](./playwright_crawler).
33+
34+
## Installation
35+
36+
`StagehandCrawler` requires the `stagehand` optional dependency group:
37+
38+
```bash
39+
pip install 'crawlee[stagehand]'
40+
```
41+
42+
or with uv:
43+
44+
```bash
45+
uv add 'crawlee[stagehand]'
46+
```
47+
48+
## Basic usage
49+
50+
The example below demonstrates the typical usage pattern: dismiss cookie banners with `act()` and extract structured data with `extract()`.
51+
52+
<CodeBlock className="language-python">
53+
{BasicExample}
54+
</CodeBlock>
55+
56+
## StagehandOptions configuration
57+
58+
Stagehand-specific settings are provided via <ApiLink to="class/StagehandOptions">`StagehandOptions`</ApiLink>. Pass the instance to the `stagehand_options` argument of <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink>.
59+
60+
## AI page operations
61+
62+
### `act` - perform actions
63+
64+
Use `act()` to interact with the page using a natural language instruction:
65+
66+
```python
67+
await context.page.act(instruction='Click the "Sign in" button')
68+
```
69+
70+
### `extract` - structured data extraction
71+
72+
Use `extract()` to pull structured data from the page. Pass a JSON Schema via schema to enforce the output shape:
73+
74+
```python
75+
data = await context.page.extract(
76+
instruction='Extract the top comment on this page',
77+
schema={
78+
'type': 'object',
79+
'properties': {
80+
'comment_text': {'type': 'string'},
81+
'author': {'type': 'string'},
82+
},
83+
'required': ['comment_text'],
84+
},
85+
)
86+
```
87+
88+
### `observe` - inspect available actions
89+
90+
Use `observe()` to get AI-suggested actions currently available on the page. Useful for debugging or building adaptive workflows:
91+
92+
```python
93+
actions = await context.page.observe(
94+
instruction='What actions are available in the navigation menu?'
95+
)
96+
```
97+
98+
### `execute` - autonomous multi-step agent
99+
100+
Use `execute()` for longer autonomous tasks that span multiple interactions:
101+
102+
```python
103+
result = await context.page.execute(
104+
instruction='Search for "web scraping" and return the titles of the first five results',
105+
)
106+
```
107+
108+
## Browserbase integration
109+
110+
By default, Stagehand launches a local Chromium browser. To use [Browserbase](https://www.browserbase.com/) - a managed cloud browser service - set `env='BROWSERBASE'` in <ApiLink to="class/StagehandOptions">`StagehandOptions`</ApiLink> and supply the required credentials:
111+
112+
<CodeBlock className="language-python">
113+
{BrowserbaseExample}
114+
</CodeBlock>
115+
116+
Browserbase credentials (`browserbase_api_key`, `project_id`) can also be provided via the `BROWSERBASE_API_KEY` and `BROWSERBASE_PROJECT_ID` environment variables.
117+
118+
## Browser configuration limitations
119+
120+
Because Stagehand manages the browser session internally via CDP, only Chromium is supported. Browser settings are limited to the subset accepted by Stagehand's `BrowserLaunchOptions` - `headless`, `args`, `viewport`, `proxy`, `locale`, `executable_path`, and a few others. Features like fingerprint generation and incognito pages are not supported.
121+
122+
## Conclusion
123+
124+
This guide introduced <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink> and its AI page operations: `act()`, `extract()`, `observe()`, and `execute()`. You learned how to configure Stagehand via <ApiLink to="class/StagehandOptions">`StagehandOptions`</ApiLink> and switch to Browserbase for cloud browser execution. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,8 @@ sql_postgres = [
8080
stagehand = [
8181
"stagehand>=3.19.0",
8282
"playwright>=1.27.0",
83+
"apify_fingerprint_datapoints>=0.0.2",
84+
"browserforge>=1.2.3",
8385
]
8486
sql_sqlite = [
8587
"sqlalchemy[asyncio]>=2.0.0,<3.0.0",

0 commit comments

Comments
 (0)