Use Cases/Extraction

Clean content from any webpage

Extract article content, remove clutter, convert to markdown. Perfect for RAG pipelines, knowledge bases, and AI applications.

MarkdownRAG ReadyBatch ProcessingAI Powered
import { Stack0 } from '@stack0/sdk'
const stack0 = new Stack0({ apiKey: process.env.STACK0_API_KEY })
// Extract clean content from webpage
export async function extractContent(url: string) {
const { extractedData } = await stack0.extraction.extractAndWait({
url,
mode: 'content',
options: {
format: 'markdown',
includeImages: true,
includeLinks: true,
removeNavigation: true,
removeAds: true,
},
})
return {
title: extractedData.title,
content: extractedData.markdown,
wordCount: extractedData.wordCount,
images: extractedData.images,
}
}
// Batch extract for knowledge base
export async function buildKnowledgeBase(urls: string[]) {
const { results } = await stack0.extraction.batchAndWait({
urls,
config: {
mode: 'content',
options: { format: 'markdown' },
},
})
// Chunk for vector database
return results.flatMap(r =>
chunkMarkdown(r.extractedData.markdown, 500)
.map(chunk => ({
source: r.url,
content: chunk,
}))
)
}

Features

What's included

AI-Powered

Semantic understanding identifies main content across any site design.

Markdown Output

Clean, structured markdown. Preserves headings, lists, and code blocks.

Content Focus

Main article content extracted. Navigation, ads, footers removed.

Batch Processing

Extract from many pages at once. Build knowledge bases fast.

RAG-Ready

Output chunks perfectly for vector databases and retrieval systems.

Fast Extraction

Pages processed in seconds. Async API for large batches.

Why Stack0

Built for production

Clean extraction

AI identifies main content. Navigation, ads, and sidebars removed automatically.

Markdown output

Structured markdown preserves headings, lists, and formatting. Ready for LLMs.

RAG-ready

Clean content chunks perfectly for vector databases. Build knowledge bases fast.

Batch processing

Extract from hundreds of pages at once. Build datasets efficiently.

Preserve structure

Headings, lists, code blocks, and images maintained in output.

Simple pricing

$2 per 1,000 extractions. No per-page or per-word fees.

Applications

Common implementations

RAG Pipelines

Extract content from documentation and articles for retrieval-augmented generation.

Knowledge Bases

Build searchable knowledge bases from web content.

AI Training Data

Extract clean text for fine-tuning and training datasets.

Content Archival

Archive webpage content in clean, readable format.

FAQ

Frequently asked questions

Our AI identifies the main article content on each page, filtering out navigation, sidebars, ads, footers, and other non-content elements. The extracted text preserves headings, lists, code blocks, and images.

We recommend chunks of 300-500 words for most vector databases. The markdown output preserves structure, so you can split on headings or paragraphs. Include some overlap between chunks for better context.

Yes. Our extraction service renders JavaScript before extracting content, so single-page applications and dynamically loaded content are fully supported.

You can extract up to 100 URLs per batch request. For larger datasets, submit multiple batch requests. Processing happens in parallel for fast results.

Our AI achieves 95%+ accuracy on standard article pages. For unusual layouts, you can provide hints about which content areas to target. Failed extractions don't count against your quota.

Ready to build?

Plans start at $5/month.

Get Started