Skip to content

Commit 78404c2

Browse files
committed
docs
1 parent a0cb91d commit 78404c2

2 files changed

Lines changed: 2 additions & 6 deletions

File tree

docs/guides/request_loaders.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> is
136136
The `SitemapRequestLoader` is designed specifically for sitemaps that follow the standard Sitemaps protocol. HTML pages containing links are not supported by this loader - those should be handled by regular crawlers using the `enqueue_links` functionality.
137137
:::
138138

139-
The loader supports filtering URLs using glob patterns and regular expressions, allowing you to include or exclude specific types of URLs. The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> provides streaming processing of sitemaps, ensuring efficient memory usage without loading the entire sitemap into memory.
139+
The loader supports filtering URLs using glob patterns and regular expressions, allowing you to include or exclude specific types of URLs. By default, the loader also keeps only URLs whose host matches their parent sitemap (`enqueue_strategy='same-hostname'`), matching the `enqueue_links` default. Pass `enqueue_strategy='all'` to disable this filter, or `'same-domain'` / `'same-origin'` for other scopes. The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> provides streaming processing of sitemaps, ensuring efficient memory usage without loading the entire sitemap into memory.
140140

141141
<RunnableCodeBlock className="language-python" language="python">
142142
{SitemapExample}

src/crawlee/_utils/robots.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -91,11 +91,7 @@ def is_allowed(self, url: str, user_agent: str = '*') -> bool:
9191
return bool(self._robots.can_fetch(str(check_url), user_agent))
9292

9393
def get_sitemaps(self) -> list[str]:
94-
"""Get the list of same-host sitemap URLs from the robots.txt file.
95-
96-
Sitemap entries pointing to a different host than the robots.txt file are filtered out, as required by the
97-
robots.txt specification.
98-
"""
94+
"""Get the list of same-host sitemap URLs from the robots.txt file."""
9995
same_host_sitemaps: list[str] = []
10096
for sitemap_url in self._robots.sitemaps:
10197
if matches_enqueue_strategy('same-hostname', target_url=sitemap_url, origin_url=self._original_url):

0 commit comments

Comments
 (0)