|
| 1 | +--- |
| 2 | +id: stagehand-crawler |
| 3 | +title: Stagehand crawler |
| 4 | +description: Learn how to use StagehandCrawler for AI-powered browser automation and data extraction. |
| 5 | +--- |
| 6 | + |
| 7 | +import ApiLink from '@site/src/components/ApiLink'; |
| 8 | +import CodeBlock from '@theme/CodeBlock'; |
| 9 | + |
| 10 | +import BasicExample from '!!raw-loader!./code_examples/stagehand_crawler/basic_example.py'; |
| 11 | +import BrowserbaseExample from '!!raw-loader!./code_examples/stagehand_crawler/browserbase_example.py'; |
| 12 | + |
| 13 | +A <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink> extends <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> with AI-powered browser automation via [Stagehand](https://www.browserbase.com/stagehand). Instead of writing CSS selectors or XPath expressions, you describe what you want in plain English and the AI model takes care of the rest. |
| 14 | + |
| 15 | +Each page in the crawling context is a <ApiLink to="class/StagehandPage">`StagehandPage`</ApiLink> - a drop-in replacement for the standard Playwright `Page` that adds four AI methods: |
| 16 | + |
| 17 | +- `page.act(**kwargs)` - perform an action using a natural language instruction |
| 18 | +- `page.extract(**kwargs)` - extract structured data from the page using AI |
| 19 | +- `page.observe(**kwargs)` - get a list of AI-suggested actions available on the page |
| 20 | +- `page.execute(**kwargs)` - run an autonomous multi-step agent on the page |
| 21 | + |
| 22 | +All standard Playwright methods remain available alongside these AI methods. |
| 23 | + |
| 24 | +## When to use StagehandCrawler |
| 25 | + |
| 26 | +Use <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink> when: |
| 27 | + |
| 28 | +- **Selectors are brittle or unknown** - the AI can locate elements by their visual role or label rather than a specific CSS class. |
| 29 | +- **Interactions are complex** - multi-step forms, dynamic menus, or context-dependent flows that are hard to script. |
| 30 | +- **Rapid prototyping** - you want to build a scraper quickly without spending time reverse-engineering the page structure. |
| 31 | + |
| 32 | +For straightforward scraping tasks where the page structure is stable and well-known, <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> is more efficient, read more in that [guide](./playwright_crawler). |
| 33 | + |
| 34 | +## Installation |
| 35 | + |
| 36 | +`StagehandCrawler` requires the `stagehand` optional dependency group: |
| 37 | + |
| 38 | +```bash |
| 39 | +pip install 'crawlee[stagehand]' |
| 40 | +``` |
| 41 | + |
| 42 | +or with uv: |
| 43 | + |
| 44 | +```bash |
| 45 | +uv add 'crawlee[stagehand]' |
| 46 | +``` |
| 47 | + |
| 48 | +## Basic usage |
| 49 | + |
| 50 | +The example below demonstrates the typical usage pattern: dismiss cookie banners with `act()` and extract structured data with `extract()`. |
| 51 | + |
| 52 | +<CodeBlock className="language-python"> |
| 53 | + {BasicExample} |
| 54 | +</CodeBlock> |
| 55 | + |
| 56 | +## StagehandOptions configuration |
| 57 | + |
| 58 | +Stagehand-specific settings are provided via <ApiLink to="class/StagehandOptions">`StagehandOptions`</ApiLink>. Pass the instance to the `stagehand_options` argument of <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink>. |
| 59 | + |
| 60 | +## AI page operations |
| 61 | + |
| 62 | +### `act` - perform actions |
| 63 | + |
| 64 | +Use `act()` to interact with the page using a natural language instruction: |
| 65 | + |
| 66 | +```python |
| 67 | +await context.page.act(instruction='Click the "Sign in" button') |
| 68 | +``` |
| 69 | + |
| 70 | +### `extract` - structured data extraction |
| 71 | + |
| 72 | +Use `extract()` to pull structured data from the page. Pass a JSON Schema via schema to enforce the output shape: |
| 73 | + |
| 74 | +```python |
| 75 | +data = await context.page.extract( |
| 76 | + instruction='Extract the top comment on this page', |
| 77 | + schema={ |
| 78 | + 'type': 'object', |
| 79 | + 'properties': { |
| 80 | + 'comment_text': {'type': 'string'}, |
| 81 | + 'author': {'type': 'string'}, |
| 82 | + }, |
| 83 | + 'required': ['comment_text'], |
| 84 | + }, |
| 85 | +) |
| 86 | +``` |
| 87 | + |
| 88 | +### `observe` - inspect available actions |
| 89 | + |
| 90 | +Use `observe()` to get AI-suggested actions currently available on the page. Useful for debugging or building adaptive workflows: |
| 91 | + |
| 92 | +```python |
| 93 | +actions = await context.page.observe( |
| 94 | + instruction='What actions are available in the navigation menu?' |
| 95 | +) |
| 96 | +``` |
| 97 | + |
| 98 | +### `execute` - autonomous multi-step agent |
| 99 | + |
| 100 | +Use `execute()` for longer autonomous tasks that span multiple interactions: |
| 101 | + |
| 102 | +```python |
| 103 | +result = await context.page.execute( |
| 104 | + instruction='Search for "web scraping" and return the titles of the first five results', |
| 105 | +) |
| 106 | +``` |
| 107 | + |
| 108 | +## Browserbase integration |
| 109 | + |
| 110 | +By default, Stagehand launches a local Chromium browser. To use [Browserbase](https://www.browserbase.com/) - a managed cloud browser service - set `env='BROWSERBASE'` in <ApiLink to="class/StagehandOptions">`StagehandOptions`</ApiLink> and supply the required credentials: |
| 111 | + |
| 112 | +<CodeBlock className="language-python"> |
| 113 | + {BrowserbaseExample} |
| 114 | +</CodeBlock> |
| 115 | + |
| 116 | +Browserbase credentials (`browserbase_api_key`, `project_id`) can also be provided via the `BROWSERBASE_API_KEY` and `BROWSERBASE_PROJECT_ID` environment variables. |
| 117 | + |
| 118 | +## Browser configuration limitations |
| 119 | + |
| 120 | +Because Stagehand manages the browser session internally via CDP, only Chromium is supported. Browser settings are limited to the subset accepted by Stagehand's `BrowserLaunchOptions` - `headless`, `args`, `viewport`, `proxy`, `locale`, `executable_path`, and a few others. Features like fingerprint generation and incognito pages are not supported. |
| 121 | + |
| 122 | +## Conclusion |
| 123 | + |
| 124 | +This guide introduced <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink> and its AI page operations: `act()`, `extract()`, `observe()`, and `execute()`. You learned how to configure Stagehand via <ApiLink to="class/StagehandOptions">`StagehandOptions`</ApiLink> and switch to Browserbase for cloud browser execution. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! |
0 commit comments