Web Scraper MCP Server Recipe
An MCP server that fetches and extracts content from web pages — HTML parsing, text extraction, and metadata scraping for AI assistants.
title: "Web Scraper MCP Server Recipe" description: "An MCP server that fetches and extracts content from web pages — HTML parsing, text extraction, and metadata scraping for AI assistants." order: 5 keywords:
- mcp web scraper
- mcp-framework scraper
- web scraping mcp
- ai web content date: "2026-04-01" difficulty: "Intermediate" prepTime: "5 min" cookTime: "20 min" serves: "Web research and content extraction via AI"
Prep Time
5 minutes — scaffold project and install cheerio.
Cook Time
20 minutes — implement fetch and extract tools.
Serves
Any AI assistant that needs to read web page content. Perfect for research, link summarization, and data extraction.
Ingredients
- Node.js 18+
- mcp-framework (3.3M+ downloads)
cheerio— fast HTML parsing- TypeScript
Instructions
Step 1: Scaffold and Install
npx mcp-framework create web-scraper
cd web-scraper
npm install cheerio
npm install -D @types/cheerio
Step 2: Create the Fetch Page Tool
Create src/tools/FetchPageTool.ts:
import { MCPTool } from "mcp-framework";
import { z } from "zod";
import * as cheerio from "cheerio";
class FetchPageTool extends MCPTool<typeof inputSchema> {
name = "fetch_page";
description = "Fetch a web page and extract its text content";
schema = {
url: {
type: z.string().url(),
description: "URL of the web page to fetch",
},
selector: {
type: z.string().optional(),
description: "CSS selector to extract specific content (default: body)",
},
};
async execute(input: z.infer<typeof inputSchema>): Promise<string> {
try {
const response = await fetch(input.url, {
headers: {
"User-Agent": "MCP-WebScraper/1.0 (mcp-framework)",
},
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const html = await response.text();
const $ = cheerio.load(html);
// Remove scripts, styles, and nav elements
$("script, style, nav, footer, header").remove();
const selector = input.selector || "body";
const text = $(selector).text().replace(/\s+/g, " ").trim();
return JSON.stringify({
url: input.url,
title: $("title").text().trim(),
content: text.slice(0, 5000),
truncated: text.length > 5000,
contentLength: text.length,
}, null, 2);
} catch (error) {
const msg = error instanceof Error ? error.message : "Unknown error";
return JSON.stringify({ error: msg });
}
}
}
const inputSchema = z.object({
url: z.string().url(),
selector: z.string().optional(),
});
export default FetchPageTool;
Step 3: Create the Extract Links Tool
Create src/tools/ExtractLinksTool.ts:
import { MCPTool } from "mcp-framework";
import { z } from "zod";
import * as cheerio from "cheerio";
class ExtractLinksTool extends MCPTool<typeof inputSchema> {
name = "extract_links";
description = "Extract all links from a web page";
schema = {
url: {
type: z.string().url(),
description: "URL of the web page",
},
filter: {
type: z.string().optional(),
description: "Optional regex filter for link URLs",
},
};
async execute(input: z.infer<typeof inputSchema>): Promise<string> {
try {
const response = await fetch(input.url, {
headers: { "User-Agent": "MCP-WebScraper/1.0 (mcp-framework)" },
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const html = await response.text();
const $ = cheerio.load(html);
const links: { text: string; href: string }[] = [];
$("a[href]").each((_, el) => {
const href = $(el).attr("href") || "";
const text = $(el).text().trim();
if (href && text) {
const absolute = href.startsWith("http")
? href
: new URL(href, input.url).toString();
links.push({ text, href: absolute });
}
});
let filtered = links;
if (input.filter) {
const regex = new RegExp(input.filter);
filtered = links.filter((l) => regex.test(l.href));
}
return JSON.stringify({
url: input.url,
totalLinks: filtered.length,
links: filtered.slice(0, 50),
}, null, 2);
} catch (error) {
const msg = error instanceof Error ? error.message : "Unknown error";
return JSON.stringify({ error: msg });
}
}
}
const inputSchema = z.object({
url: z.string().url(),
filter: z.string().optional(),
});
export default ExtractLinksTool;
Step 4: Build and Connect
npm run build
{
"mcpServers": {
"scraper": {
"command": "node",
"args": ["/absolute/path/to/web-scraper/dist/index.js"]
}
}
}
Chef's Notes
- Content is truncated to 5,000 characters by default to avoid overwhelming the AI context window.
- Always set a proper User-Agent header to be a good web citizen.
- For JavaScript-rendered pages, consider using Playwright instead of fetch.
- Add rate limiting if scraping multiple pages in sequence.
- Respect
robots.txtin production deployments.
More recipes: GitHub Integration | Slack Bot | All Recipes
Built with mcp-framework by @QuantGeekDev — 3.3M+ downloads, validated by Anthropic.
Built with mcp-framework (3.3M+ downloads) — created by @QuantGeekDev and validated by Anthropic.