How to Convert HTML to Markdown
HTML is the standard markup language for web pages, but Markdown provides a more readable and writeable format for content creation. Converting HTML to Markdown can be useful for content extraction, documentation, and migrating web content to platforms that use Markdown. In this guide, we'll show you how to convert HTML to Markdown using JavaScript, and then demonstrate how CaptureKit API offers a simpler alternative. Method 1: Converting HTML to Markdown with JavaScript To convert HTML to Markdown, we'll use the popular turndown library, which handles most of the heavy lifting. We'll need to: Set up the Turndown service with appropriate options Remove unwanted HTML elements Convert relative URLs to absolute URLs Process the HTML content Here's a complete solution using JavaScript with turndown: import TurndownService from 'turndown'; function convertHtmlToMarkdown(htmlString) { try { // Initialize Turndown with custom options const turndownService = new TurndownService({ headingStyle: 'atx', codeBlockStyle: 'fenced', emDelimiter: '_', hr: '---', bulletListMarker: '-', strongDelimiter: '**', }); // Remove scripts, styles, and other unwanted elements turndownService.remove(['script', 'style', 'noscript', 'iframe']); // Convert the HTML to Markdown const markdown = turndownService.turndown(htmlString); return markdown; } catch (error) { console.error('Error converting HTML to Markdown:', error); return ''; } } // Usage example function main() { const htmlString = ` Hello World This is a bold statement. Item 1 Item 2 Example Link `; const markdown = convertHtmlToMarkdown(htmlString); console.log('Markdown content:', markdown); } main(); Handling Relative URLs One common issue when converting HTML to Markdown is dealing with relative URLs in links and images. To ensure all links and images work correctly, we need to convert relative URLs to absolute URLs: import TurndownService from 'turndown'; function convertHtmlToMarkdownWithAbsoluteUrls(htmlString, baseUrl) { try { // Initialize Turndown with custom options const turndownService = new TurndownService({ headingStyle: 'atx', codeBlockStyle: 'fenced', emDelimiter: '_', hr: '---', bulletListMarker: '-', strongDelimiter: '**', }); // Remove scripts, styles, and other unwanted elements turndownService.remove(['script', 'style', 'noscript', 'iframe']); // Parse the domain from the URL const domain = new URL(baseUrl).origin; // Convert relative URLs to absolute URLs in links turndownService.addRule('links', { filter: 'a', replacement: function (content, node) { const href = node.getAttribute('href'); if (!href) return content; // Convert relative URLs to absolute let absoluteUrl = href; if (href.startsWith('/')) { absoluteUrl = domain + href; } else if ( !href.startsWith('http://') && !href.startsWith('https://') && !href.startsWith('#') ) { absoluteUrl = domain + '/' + href; } return '[' + content + '](' + absoluteUrl + ')'; }, }); // Convert relative URLs to absolute URLs in images turndownService.addRule('images', { filter: 'img', replacement: function (content, node) { const src = node.getAttribute('src'); const alt = node.getAttribute('alt') || ''; if (!src) return ''; // Convert relative URLs to absolute let absoluteUrl = src; if (src.startsWith('/')) { absoluteUrl = domain + src; } else if ( !src.startsWith('http://') && !src.startsWith('https://') && !src.startsWith('data:') ) { absoluteUrl = domain + '/' + src; } return ''; }, }); // Convert the HTML to Markdown const markdown = turndownService.turndown(htmlString); return markdown; } catch (error) { console.error('Error converting HTML to Markdown:', error); return ''; } } // Usage example function main() { const htmlString = ` Example Website Check out our About page. New Products `; const baseUrl = 'https://example.com'; const markdown = convertHtmlToMarkdownWithAbsoluteUrls(htmlString, baseUrl); console.log('Markdown content:', markdown); } main(); How It Works This code handles several important aspects of HTML to Markdown conversion: Custom Formatting: Configures the output Markdown style (headings, code blocks, etc.) Element Filtering: Removes unwanted elements like scripts and styles URL Handling: Converts relative URLs to absolute URLs in both links

HTML is the standard markup language for web pages, but Markdown provides a more readable and writeable format for content creation. Converting HTML to Markdown can be useful for content extraction, documentation, and migrating web content to platforms that use Markdown. In this guide, we'll show you how to convert HTML to Markdown using JavaScript, and then demonstrate how CaptureKit API offers a simpler alternative.
Method 1: Converting HTML to Markdown with JavaScript
To convert HTML to Markdown, we'll use the popular turndown
library, which handles most of the heavy lifting. We'll need to:
- Set up the Turndown service with appropriate options
- Remove unwanted HTML elements
- Convert relative URLs to absolute URLs
- Process the HTML content
Here's a complete solution using JavaScript with turndown:
import TurndownService from 'turndown';
function convertHtmlToMarkdown(htmlString) {
try {
// Initialize Turndown with custom options
const turndownService = new TurndownService({
headingStyle: 'atx',
codeBlockStyle: 'fenced',
emDelimiter: '_',
hr: '---',
bulletListMarker: '-',
strongDelimiter: '**',
});
// Remove scripts, styles, and other unwanted elements
turndownService.remove(['script', 'style', 'noscript', 'iframe']);
// Convert the HTML to Markdown
const markdown = turndownService.turndown(htmlString);
return markdown;
} catch (error) {
console.error('Error converting HTML to Markdown:', error);
return '';
}
}
// Usage example
function main() {
const htmlString = `
Hello World
This is a bold statement.
- Item 1
- Item 2
Example Link
`;
const markdown = convertHtmlToMarkdown(htmlString);
console.log('Markdown content:', markdown);
}
main();
Handling Relative URLs
One common issue when converting HTML to Markdown is dealing with relative URLs in links and images. To ensure all links and images work correctly, we need to convert relative URLs to absolute URLs:
import TurndownService from 'turndown';
function convertHtmlToMarkdownWithAbsoluteUrls(htmlString, baseUrl) {
try {
// Initialize Turndown with custom options
const turndownService = new TurndownService({
headingStyle: 'atx',
codeBlockStyle: 'fenced',
emDelimiter: '_',
hr: '---',
bulletListMarker: '-',
strongDelimiter: '**',
});
// Remove scripts, styles, and other unwanted elements
turndownService.remove(['script', 'style', 'noscript', 'iframe']);
// Parse the domain from the URL
const domain = new URL(baseUrl).origin;
// Convert relative URLs to absolute URLs in links
turndownService.addRule('links', {
filter: 'a',
replacement: function (content, node) {
const href = node.getAttribute('href');
if (!href) return content;
// Convert relative URLs to absolute
let absoluteUrl = href;
if (href.startsWith('/')) {
absoluteUrl = domain + href;
} else if (
!href.startsWith('http://') &&
!href.startsWith('https://') &&
!href.startsWith('#')
) {
absoluteUrl = domain + '/' + href;
}
return '[' + content + '](' + absoluteUrl + ')';
},
});
// Convert relative URLs to absolute URLs in images
turndownService.addRule('images', {
filter: 'img',
replacement: function (content, node) {
const src = node.getAttribute('src');
const alt = node.getAttribute('alt') || '';
if (!src) return '';
// Convert relative URLs to absolute
let absoluteUrl = src;
if (src.startsWith('/')) {
absoluteUrl = domain + src;
} else if (
!src.startsWith('http://') &&
!src.startsWith('https://') &&
!src.startsWith('data:')
) {
absoluteUrl = domain + '/' + src;
}
return '';
},
});
// Convert the HTML to Markdown
const markdown = turndownService.turndown(htmlString);
return markdown;
} catch (error) {
console.error('Error converting HTML to Markdown:', error);
return '';
}
}
// Usage example
function main() {
const htmlString = `
Example Website
Check out our About page.
New Products
`;
const baseUrl = 'https://example.com';
const markdown = convertHtmlToMarkdownWithAbsoluteUrls(htmlString, baseUrl);
console.log('Markdown content:', markdown);
}
main();
How It Works
This code handles several important aspects of HTML to Markdown conversion:
- Custom Formatting: Configures the output Markdown style (headings, code blocks, etc.)
- Element Filtering: Removes unwanted elements like scripts and styles
- URL Handling: Converts relative URLs to absolute URLs in both links and images
- Content Processing: Converts the full HTML document to clean Markdown
Method 2: Using CaptureKit API (Recommended)
While the JavaScript approach is flexible, it requires handling HTTP requests, HTML parsing, and URL transformations. CaptureKit API offers a simpler solution that handles all these complexities for you.
Here's how to use CaptureKit API to convert HTML to Markdown:
curl "https://api.capturekit.dev/content?url=https://example.com&access_key=YOUR_ACCESS_KEY&include_markdown=true"
The API response includes the Markdown content along with other useful website information:
{
"success": true,
"data": {
"metadata": {
"title": "CaptureKit - Turn any website into a screenshot with our powerful Screenshot API",
"description": "CaptureKit is a powerful API for capturing screenshots, extracting HTML, gathering links, and summarizing content—all with a simple request.",
"favicon": "https://capturekit.dev/favicon.ico",
"ogImage": "https://capturekit-assets.s3.amazonaws.com/capturekit-og+(1).png"
},
"links": {
"internal": [
"https://capturekit.dev/",
"https://capturekit.dev/dashboard",
"https://capturekit.dev/pricing",
"https://capturekit.dev/blog"
],
"external": [
"https://docs.capturekit.dev",
"https://zapier.com/apps/capturekit-website-screenshots-p/integrations",
"https://www.nextupkit.com"
],
"social": [
"https://github.com/CaptureKit-Web-Scraping-API",
"https://x.com/capturekit"
]
},
"markdown": "# CaptureKit - Turn any website into a screenshot with our powerful Screenshot API\n\nCaptureKit is a powerful API for capturing screenshots, extracting HTML, gathering links, and summarizing content—all with a simple request.\n\n..."
}
}
API Parameters
-
url
string (Required): The URL of the webpage to capture. -
access_key
string (Required): Your API access key. Can be provided via the access_key query parameter, x-access-key header, or request body. -
include_html
boolean (Optional, Default: false): Include the raw HTML of the webpage in the response. -
include_markdown
boolean (Optional, Default: false): Include the Markdown of the webpage in the response. -
include_sitemap
boolean (Optional, Default: false): Include sitemap data of the webpage in the response.
Benefits of Using CaptureKit API
- Simplicity: One API call instead of dozens of lines of code
- Reliability: Handles all edge cases and complex HTML structures
- Performance: Optimized for speed and efficiency
- Additional Data: Get website metadata alongside the Markdown content
- No Maintenance: No need to update your code when HTML standards change
Conclusion
Converting HTML to Markdown is essential for content extraction, documentation, and migration tasks. While our JavaScript solution provides a comprehensive approach with full control, CaptureKit API offers a more convenient alternative that handles the complexities for you.
Choose the method that best fits your needs:
- Use the JavaScript solution if you need full control over the conversion process
- Use CaptureKit API if you want a quick, reliable solution with minimal code
By converting HTML to Markdown, you can transform complex web content into a clean, readable format that's easy to work with and integrate into various platforms.