Web Scraping to AI Knowledge Base

Not all knowledge lives in SaaS tools. Competitor docs, industry publications, internal wikis behind HTTP auth, documentation sites -- the web scraper integration lets you crawl any website and import its content into REM Labs as searchable AI memory.

What Gets Extracted

The web scraper processes HTML pages into clean, structured content:

Single Page Import

Import a single URL as a memory. Useful for adding specific articles, blog posts, or documentation pages.

curl -X POST https://api.remlabs.ai/v1/memory/sync/web/import \ -H "Authorization: Bearer sk-slop-..." \ -H "Content-Type: application/json" \ -d '{ "url": "https://docs.example.com/api/authentication", "extract_links": true, "namespace": "competitor-docs" }'

Site Crawl

Crawl an entire site or subdomain. The crawler follows internal links up to a configurable depth, respects robots.txt, and deduplicates pages automatically.

curl -X POST https://api.remlabs.ai/v1/memory/sync/web/crawl \ -H "Authorization: Bearer sk-slop-..." \ -H "Content-Type: application/json" \ -d '{ "start_url": "https://docs.example.com/", "max_pages": 500, "max_depth": 3, "url_pattern": "https://docs.example.com/*", "respect_robots": true, "namespace": "example-docs" }'

The url_pattern keeps the crawler scoped to a specific path prefix. Without it, the crawler would follow all internal links on the domain. The max_depth parameter limits how many link hops from the start URL the crawler will traverse.

Sitemap Import

For sites with a sitemap.xml, you can import all listed URLs directly without crawling. This is faster and more predictable than link-following.

curl -X POST https://api.remlabs.ai/v1/memory/sync/web/sitemap \ -H "Authorization: Bearer sk-slop-..." \ -H "Content-Type: application/json" \ -d '{ "sitemap_url": "https://docs.example.com/sitemap.xml", "namespace": "example-docs" }'

Search Scraped Content

curl -X POST https://api.remlabs.ai/v1/integrations/web/search \ -H "Authorization: Bearer sk-slop-..." \ -H "Content-Type: application/json" \ -d '{ "query": "rate limiting best practices for REST APIs", "filters": {"namespace": "competitor-docs"}, "limit": 10 }'

Semantic search finds the competitor's page on "Throttling and Quota Management" and their blog post about "API Gateway Configuration" -- neither uses the phrase "rate limiting" but both are directly relevant to your query.

Scheduled Re-Crawl

Configure a schedule to re-crawl sites periodically. This keeps your knowledge base current as documentation sites update. Pages are deduplicated by URL, so only changed content is re-indexed.

API Endpoints Reference

EndpointMethodDescription
/v1/memory/sync/web/importPOSTImport a single URL
/v1/memory/sync/web/crawlPOSTCrawl a site following internal links
/v1/memory/sync/web/sitemapPOSTImport all URLs from a sitemap.xml
/v1/integrations/web/statusGETCheck crawl status and page count
/v1/integrations/web/searchPOSTSearch within the web namespace

Respectful crawling: The scraper respects robots.txt directives, enforces a configurable crawl delay (default 1 second between requests), and identifies itself with a REM-Labs-Bot user agent. For authenticated sites, pass HTTP headers via the headers parameter. See Web scraper docs.

Turn any website into AI memory

Free tier. Crawl, import, and search web content semantically.

Get started free →