Integration Tutorial April 13, 2026

Web Scraping to AI Knowledge Base

Not all knowledge lives in SaaS tools. Competitor docs, industry publications, internal wikis behind HTTP auth, documentation sites -- the web scraper integration lets you crawl any website and import its content into REM Labs as searchable AI memory.

What Gets Extracted

The web scraper processes HTML pages into clean, structured content:

Main content -- article body extracted with boilerplate removal (navigation, footers, sidebars stripped automatically)
Headings and structure -- H1-H6 hierarchy preserved for context during retrieval
Links -- internal links between pages become knowledge graph edges for relationship traversal
Metadata -- page title, URL, meta description, and Open Graph tags stored as searchable fields
Tables and lists -- structured data in HTML tables and lists preserved with formatting

Single Page Import

Import a single URL as a memory. Useful for adding specific articles, blog posts, or documentation pages.

curl -X POST https://remlabs.ai/v1/memory/sync/web/import \
  -H "Authorization: Bearer sk-rem-..." \
  -H "Content-Type: application/json" \
  -d '{
  "url": "https://docs.example.com/api/authentication",
  "extract_links": true,
  "namespace": "competitor-docs"
}'

Site Crawl

Crawl an entire site or subdomain. The crawler follows internal links up to a configurable depth, respects robots.txt, and deduplicates pages automatically.

curl -X POST https://remlabs.ai/v1/memory/sync/web/crawl \
  -H "Authorization: Bearer sk-rem-..." \
  -H "Content-Type: application/json" \
  -d '{
  "start_url": "https://docs.example.com/",
  "max_pages": 500,
  "max_depth": 3,
  "url_pattern": "https://docs.example.com/*",
  "respect_robots": true,
  "namespace": "example-docs"
}'

The url_pattern keeps the crawler scoped to a specific path prefix. Without it, the crawler would follow all internal links on the domain. The max_depth parameter limits how many link hops from the start URL the crawler will traverse.

Sitemap Import

For sites with a sitemap.xml, you can import all listed URLs directly without crawling. This is faster and more predictable than link-following.

curl -X POST https://remlabs.ai/v1/memory/sync/web/sitemap \
  -H "Authorization: Bearer sk-rem-..." \
  -H "Content-Type: application/json" \
  -d '{
  "sitemap_url": "https://docs.example.com/sitemap.xml",
  "namespace": "example-docs"
}'

Search Scraped Content

curl -X POST https://remlabs.ai/v1/integrations/web/search \
  -H "Authorization: Bearer sk-rem-..." \
  -H "Content-Type: application/json" \
  -d '{
  "query": "rate limiting best practices for REST APIs",
  "filters": {"namespace": "competitor-docs"},
  "limit": 10
}'

Semantic search finds the competitor's page on "Throttling and Quota Management" and their blog post about "API Gateway Configuration" -- neither uses the phrase "rate limiting" but both are directly relevant to your query.

Scheduled Re-Crawl

Configure a schedule to re-crawl sites periodically. This keeps your knowledge base current as documentation sites update. Pages are deduplicated by URL, so only changed content is re-indexed.

API Endpoints Reference

Endpoint	Method	Description
`/v1/memory/sync/web/import`	POST	Import a single URL
`/v1/memory/sync/web/crawl`	POST	Crawl a site following internal links
`/v1/memory/sync/web/sitemap`	POST	Import all URLs from a sitemap.xml
`/v1/integrations/web/status`	GET	Check crawl status and page count
`/v1/integrations/web/search`	POST	Search within the web namespace

Respectful crawling: The scraper respects robots.txt directives, enforces a configurable crawl delay (default 1 second between requests), and identifies itself with a REM-Labs-Bot user agent. For authenticated sites, pass HTTP headers via the headers parameter. See Web scraper docs.

Turn any website into AI memory

Free tier. Crawl, import, and search web content semantically.

Get started free →