Web Scraping to AI Knowledge Base
Not all knowledge lives in SaaS tools. Competitor docs, industry publications, internal wikis behind HTTP auth, documentation sites -- the web scraper integration lets you crawl any website and import its content into REM Labs as searchable AI memory.
What Gets Extracted
The web scraper processes HTML pages into clean, structured content:
- Main content -- article body extracted with boilerplate removal (navigation, footers, sidebars stripped automatically)
- Headings and structure -- H1-H6 hierarchy preserved for context during retrieval
- Links -- internal links between pages become knowledge graph edges for relationship traversal
- Metadata -- page title, URL, meta description, and Open Graph tags stored as searchable fields
- Tables and lists -- structured data in HTML tables and lists preserved with formatting
Single Page Import
Import a single URL as a memory. Useful for adding specific articles, blog posts, or documentation pages.
Site Crawl
Crawl an entire site or subdomain. The crawler follows internal links up to a configurable depth, respects robots.txt, and deduplicates pages automatically.
The url_pattern keeps the crawler scoped to a specific path prefix. Without it, the crawler would follow all internal links on the domain. The max_depth parameter limits how many link hops from the start URL the crawler will traverse.
Sitemap Import
For sites with a sitemap.xml, you can import all listed URLs directly without crawling. This is faster and more predictable than link-following.
Search Scraped Content
Semantic search finds the competitor's page on "Throttling and Quota Management" and their blog post about "API Gateway Configuration" -- neither uses the phrase "rate limiting" but both are directly relevant to your query.
Scheduled Re-Crawl
Configure a schedule to re-crawl sites periodically. This keeps your knowledge base current as documentation sites update. Pages are deduplicated by URL, so only changed content is re-indexed.
API Endpoints Reference
| Endpoint | Method | Description |
|---|---|---|
/v1/memory/sync/web/import | POST | Import a single URL |
/v1/memory/sync/web/crawl | POST | Crawl a site following internal links |
/v1/memory/sync/web/sitemap | POST | Import all URLs from a sitemap.xml |
/v1/integrations/web/status | GET | Check crawl status and page count |
/v1/integrations/web/search | POST | Search within the web namespace |
Respectful crawling: The scraper respects robots.txt directives, enforces a configurable crawl delay (default 1 second between requests), and identifies itself with a REM-Labs-Bot user agent. For authenticated sites, pass HTTP headers via the headers parameter. See Web scraper docs.
Turn any website into AI memory
Free tier. Crawl, import, and search web content semantically.
Get started free →