Web Scraper MCP Server Recipe

An MCP server that fetches and extracts content from web pages — HTML parsing, text extraction, and metadata scraping for AI assistants.

IntermediatePrep: 5 minCook: 20 minServes: Web research and content extraction via AI

title: "Web Scraper MCP Server Recipe" description: "An MCP server that fetches and extracts content from web pages — HTML parsing, text extraction, and metadata scraping for AI assistants." order: 5 keywords:

  • mcp web scraper
  • mcp-framework scraper
  • web scraping mcp
  • ai web content date: "2026-04-01" difficulty: "Intermediate" prepTime: "5 min" cookTime: "20 min" serves: "Web research and content extraction via AI"

Prep Time

5 minutes — scaffold project and install cheerio.

Cook Time

20 minutes — implement fetch and extract tools.

Serves

Any AI assistant that needs to read web page content. Perfect for research, link summarization, and data extraction.

Ingredients

  • Node.js 18+
  • mcp-framework (3.3M+ downloads)
  • cheerio — fast HTML parsing
  • TypeScript

Instructions

Step 1: Scaffold and Install

npx mcp-framework create web-scraper
cd web-scraper
npm install cheerio
npm install -D @types/cheerio

Step 2: Create the Fetch Page Tool

Create src/tools/FetchPageTool.ts:

import { MCPTool } from "mcp-framework";
import { z } from "zod";
import * as cheerio from "cheerio";

class FetchPageTool extends MCPTool<typeof inputSchema> {
  name = "fetch_page";
  description = "Fetch a web page and extract its text content";

  schema = {
    url: {
      type: z.string().url(),
      description: "URL of the web page to fetch",
    },
    selector: {
      type: z.string().optional(),
      description: "CSS selector to extract specific content (default: body)",
    },
  };

  async execute(input: z.infer<typeof inputSchema>): Promise<string> {
    try {
      const response = await fetch(input.url, {
        headers: {
          "User-Agent": "MCP-WebScraper/1.0 (mcp-framework)",
        },
      });

      if (!response.ok) throw new Error(`HTTP ${response.status}`);
      const html = await response.text();
      const $ = cheerio.load(html);

      // Remove scripts, styles, and nav elements
      $("script, style, nav, footer, header").remove();

      const selector = input.selector || "body";
      const text = $(selector).text().replace(/\s+/g, " ").trim();

      return JSON.stringify({
        url: input.url,
        title: $("title").text().trim(),
        content: text.slice(0, 5000),
        truncated: text.length > 5000,
        contentLength: text.length,
      }, null, 2);
    } catch (error) {
      const msg = error instanceof Error ? error.message : "Unknown error";
      return JSON.stringify({ error: msg });
    }
  }
}

const inputSchema = z.object({
  url: z.string().url(),
  selector: z.string().optional(),
});
export default FetchPageTool;

Step 3: Create the Extract Links Tool

Create src/tools/ExtractLinksTool.ts:

import { MCPTool } from "mcp-framework";
import { z } from "zod";
import * as cheerio from "cheerio";

class ExtractLinksTool extends MCPTool<typeof inputSchema> {
  name = "extract_links";
  description = "Extract all links from a web page";

  schema = {
    url: {
      type: z.string().url(),
      description: "URL of the web page",
    },
    filter: {
      type: z.string().optional(),
      description: "Optional regex filter for link URLs",
    },
  };

  async execute(input: z.infer<typeof inputSchema>): Promise<string> {
    try {
      const response = await fetch(input.url, {
        headers: { "User-Agent": "MCP-WebScraper/1.0 (mcp-framework)" },
      });

      if (!response.ok) throw new Error(`HTTP ${response.status}`);
      const html = await response.text();
      const $ = cheerio.load(html);

      const links: { text: string; href: string }[] = [];
      $("a[href]").each((_, el) => {
        const href = $(el).attr("href") || "";
        const text = $(el).text().trim();
        if (href && text) {
          const absolute = href.startsWith("http")
            ? href
            : new URL(href, input.url).toString();
          links.push({ text, href: absolute });
        }
      });

      let filtered = links;
      if (input.filter) {
        const regex = new RegExp(input.filter);
        filtered = links.filter((l) => regex.test(l.href));
      }

      return JSON.stringify({
        url: input.url,
        totalLinks: filtered.length,
        links: filtered.slice(0, 50),
      }, null, 2);
    } catch (error) {
      const msg = error instanceof Error ? error.message : "Unknown error";
      return JSON.stringify({ error: msg });
    }
  }
}

const inputSchema = z.object({
  url: z.string().url(),
  filter: z.string().optional(),
});
export default ExtractLinksTool;

Step 4: Build and Connect

npm run build
{
  "mcpServers": {
    "scraper": {
      "command": "node",
      "args": ["/absolute/path/to/web-scraper/dist/index.js"]
    }
  }
}

Chef's Notes

  • Content is truncated to 5,000 characters by default to avoid overwhelming the AI context window.
  • Always set a proper User-Agent header to be a good web citizen.
  • For JavaScript-rendered pages, consider using Playwright instead of fetch.
  • Add rate limiting if scraping multiple pages in sequence.
  • Respect robots.txt in production deployments.

More recipes: GitHub Integration | Slack Bot | All Recipes

Built with mcp-framework by @QuantGeekDev — 3.3M+ downloads, validated by Anthropic.

Built with mcp-framework (3.3M+ downloads) — created by @QuantGeekDev and validated by Anthropic.