Clean content from any webpage
Extract article content, remove clutter, convert to markdown. Perfect for RAG pipelines, knowledge bases, and AI applications.
import { Stack0 } from '@stack0/sdk'const stack0 = new Stack0({ apiKey: process.env.STACK0_API_KEY })// Extract clean content from webpageexport async function extractContent(url: string) {const { extractedData } = await stack0.extraction.extractAndWait({url,mode: 'content',options: {format: 'markdown',includeImages: true,includeLinks: true,removeNavigation: true,removeAds: true,},})return {title: extractedData.title,content: extractedData.markdown,wordCount: extractedData.wordCount,images: extractedData.images,}}// Batch extract for knowledge baseexport async function buildKnowledgeBase(urls: string[]) {const { results } = await stack0.extraction.batchAndWait({urls,config: {mode: 'content',options: { format: 'markdown' },},})// Chunk for vector databasereturn results.flatMap(r =>chunkMarkdown(r.extractedData.markdown, 500).map(chunk => ({source: r.url,content: chunk,})))}
Features
What's included
AI-Powered
Semantic understanding identifies main content across any site design.
Markdown Output
Clean, structured markdown. Preserves headings, lists, and code blocks.
Content Focus
Main article content extracted. Navigation, ads, footers removed.
Batch Processing
Extract from many pages at once. Build knowledge bases fast.
RAG-Ready
Output chunks perfectly for vector databases and retrieval systems.
Fast Extraction
Pages processed in seconds. Async API for large batches.
Why Stack0
Built for production
Clean extraction
AI identifies main content. Navigation, ads, and sidebars removed automatically.
Markdown output
Structured markdown preserves headings, lists, and formatting. Ready for LLMs.
RAG-ready
Clean content chunks perfectly for vector databases. Build knowledge bases fast.
Batch processing
Extract from hundreds of pages at once. Build datasets efficiently.
Preserve structure
Headings, lists, code blocks, and images maintained in output.
Simple pricing
$2 per 1,000 extractions. No per-page or per-word fees.
Applications
Common implementations
RAG Pipelines
Extract content from documentation and articles for retrieval-augmented generation.
Knowledge Bases
Build searchable knowledge bases from web content.
AI Training Data
Extract clean text for fine-tuning and training datasets.
Content Archival
Archive webpage content in clean, readable format.
FAQ
Frequently asked questions
Our AI identifies the main article content on each page, filtering out navigation, sidebars, ads, footers, and other non-content elements. The extracted text preserves headings, lists, code blocks, and images.
We recommend chunks of 300-500 words for most vector databases. The markdown output preserves structure, so you can split on headings or paragraphs. Include some overlap between chunks for better context.
Yes. Our extraction service renders JavaScript before extracting content, so single-page applications and dynamically loaded content are fully supported.
You can extract up to 100 URLs per batch request. For larger datasets, submit multiple batch requests. Processing happens in parallel for fast results.
Our AI achieves 95%+ accuracy on standard article pages. For unusual layouts, you can provide hints about which content areas to target. Failed extractions don't count against your quota.
Ready to build?
Plans start at $5/month.