Steps for html-to-md script
Step 1 Download the website
To avoid getting blocked or having excessive request to the server you’ll want to download the website to your local computer. This will be useful incase you need to keep retry the html-to-md script. Easiest to use WGET
wget -r -m -p -x --user-agent="Mozilla/5.0" https://www.websiteurlhere.com
Step 2 Copy script and git init
Next thing to do is to add the html-to-md script into the downloaded folder and git init in this folder. This will help you see changes using git status and also roll back failed scrapes with git add . and git stash This is not connected to a repo online and is just used to help track locally.
So open a terminal and run:
git init
git add .
git commit -m 'init commit'
Step 3 Configure the html-to-md script
From here you’ll want to configure the HTML script. By default it’s set up to run both the _pages collection and _blogposts collection. The script looks for a unique selector that should be specific to the blog to determine what is a blog and what is not.
So you’ll need to work through the script and try and fill in the blanks for any selectors needed.
Step 4 run and verify
The script can then be run with ruby html-to-md.rb but it’s important to check that pages have been made. YOu can look through the created _pages and _blogposts to verify there are changes or use git status to see updates.
If you’re able to verify everything has been brought over and created appropriately move to step 5 otherwise you can reset with git add . and git stash then try step 3 again.
Step 5 verify and compare sitemaps
From here you’ll want to make sure all the pages came over and everything is the same as the live site. Find the sitemap on the live site and find and replace the XML code out so you have a list of all the websites URLs. You then need to sort these a-z.
Do the same on the html-to-md converted content. Add these pages into jekyll.. get all the links from the sitemap… and find and replace so you’re left with only the links. Add the url back in (https://www) however it is to match the live websites sitemap. And then sort the links a-z.
You should have two identical files sorted a-z. Name one new-pages.txt and the other one live-pages.txt
Then we’re going to use VIM to generate an html diff comparrison. There are other ways to run a diff between two files but using VIM has been the easiest for me. Running that live has had great results but this HTML output works well as well.
vimdiff new-pages.txt live-pages.txt -c TOhtml -c 'w! diff.html' -c 'qa!'
Any pages that are missed will show up here. Find these and convert them manually with this online converter (https://codebeautify.org/html-to-markdown) or go back to Step 3 and try to reconfigure the html-to-md script.
HTML-TO-MD.rb conversion script
##############################
# READ ME
###
# This script loops through every page in the directory (except /index and /template.index),
# pulling the title tag and description from the head,
# with the page content in the body then putting them around
# in the template file.
# **CAUTION**
# This does require a few edits, which will be listed below.
# Read the outputs to follow along with the script.
# Let's see how it's done!
# Requiring dependencies
puts "Requiring dependencies"
require 'nokogiri'
require 'reverse_markdown'
require 'find'
require 'date'
require 'fileutils'
require 'json'
# Define CSS selectors and other configuration settings at the top for easy modification
CONFIG = {
ignore_files: ['./index.html', './indexNEW.html'],
ignore_folders: ['/assets/', '.git', '.htaccess'],
blog_selector: ".main-content-container.blog",
title: 'title',
description: 'meta[name="description"]',
category: 'a[rel="category tag"]',
h1: 'h1',
json_ld: "script[type='application/ld+json']",
content: 'main.main-content-container',
remove_selectors: [
'.extra-selector-1', # Add your extra selectors here
'.extra-selector-2' # Add more selectors as needed
],
# Both of these need to be changed if scraping categories from the blog snippets
# Path to main blog page
snippet_file_path: './blog/index.html',
# Base domain
domain_url: 'https://www.EXAMPLE.com'
}
# Read and Parse the Snippet File to Build the Blog Categories Map
snippet_file = File.read(CONFIG[:snippet_file_path])
puts "Snippet file read successfully"
snippet_doc = Nokogiri::HTML(snippet_file)
puts "Snippet file parsed successfully"
blog_categories = {}
snippet_doc.css('article.post').each do |post|
# Normalize URL by removing domain, trailing slashes, and whitespace
url = post.at_css('a')["href"].gsub(CONFIG[:domain_url], '').sub(%r{/*$}, '').strip + '/'
category = post["category"]
blog_categories[url] = category
end
puts "Blog categories mapped successfully"
# Method to process each file
def process_file(f, config, blog_categories)
puts "---"
puts "Processing file: #{f}"
file = File.read(f)
puts "File read successfully"
doc = Nokogiri::HTML(file)
puts "File parsed successfully"
puts "Beginning DOM interaction"
# Extract title, ensuring it handles nil gracefully
title = doc.at_css(config[:title])&.text&.strip || ""
puts "Found title tag: #{title}"
# Extract description
description_tag = doc.at_css(config[:description])
description = description_tag ? description_tag["content"].strip : ""
puts "Found header metadata: title=#{title}, description=#{description}"
# Determine if this is a blog page
is_blog = doc.at_css(config[:blog_selector])
if is_blog
puts 'Processing as BLOG PAGE'
# Generate permalink with consistent trailing slash and strip whitespace
permalink = f.gsub("./", "/").gsub("/index.html", "").sub(%r{/*$}, '').strip + '/'
puts "Generated permalink for post: #{permalink}"
# Primary category check using config selector
category = doc.at_css(config[:category])&.text&.strip
if category.nil? || category.empty?
# Fallback to blog snippet if primary selector didn’t find anything
puts "No category set, looking for category in blog snippet..."
category = blog_categories[permalink] || "Uncategorized"
puts "Category found from blog snippet: #{category}"
else
puts "Category found on page: #{category}"
end
# Extract h1
h1 = doc.at_css(config[:h1])&.text&.strip || ""
doc.at_css(config[:h1])&.remove
puts "Found H1: #{h1}"
# Directly attempt to extract datePublished
date = Date.today # Default if datePublished isn’t found
json_ld_scripts = doc.css("script[type='application/ld+json']")
date_found = false # Track whether a date was found
json_ld_scripts.each do |script|
begin
json_content = JSON.parse(script.text) rescue nil
if json_content && json_content["@type"] == "BlogPosting" && json_content["datePublished"]
date = json_content["datePublished"]
date_found = true
puts "Found datePublished in schema: #{date}" # Message confirming date was found
break
end
rescue JSON::ParserError => e
puts "JSON parsing error: #{e.message}"
end
end
# If no date was found in the schema, confirm it’s using today’s date
puts "No datePublished found in schema. Using today's date: #{date}" unless date_found
# Remove additional unwanted selectors from the content
config[:remove_selectors].each do |selector|
doc.css(selector).each(&:remove)
end
puts "Removed additional unwanted content"
content = doc.at_css(config[:content]).inner_html
return if content.nil?
markdown = ReverseMarkdown.convert(content.to_s)
puts "Converted content to Markdown"
# Extract and create the front matter
frontMatter = <<~FRONTMATTER
---
title: >
#{h1}
layout: post
date: >
#{date}
titletag: >
#{title}
description: >
#{description}
permalink: >
#{permalink}
sitemap: true
categories:
- #{category}
---
FRONTMATTER
puts "Saved front matter"
newPage = frontMatter + markdown
puts "Combined front matter and content"
# Generate file paths and write the Markdown file
filename = f.split('/')[-2]
mdF = "./_blogposts/#{filename}.md"
puts "Markdown file path: #{mdF}"
directory = mdF.gsub(mdF.split('/').last, "")
puts "Making directory: #{directory}"
FileUtils.mkdir_p(directory)
puts "Directory created"
File.write(mdF, newPage)
puts "Written to file: #{mdF}"
FileUtils.rm_rf(f)
puts "Removed old file: #{f}"
else
puts 'Processing as regular PAGE'
# Remove additional unwanted selectors from the content
config[:remove_selectors].each do |selector|
doc.css(selector).each(&:remove)
end
puts "Removed additional unwanted content"
# Convert content to Markdown and create the Markdown file
content = doc.at_css(config[:content]).inner_html
if content.nil?
puts "Content not found using selector: #{config[:content]}"
return
else
puts "Found content"
end
markdown = ReverseMarkdown.convert(content.to_s)
puts "Converted content to Markdown"
# Extract and create the front matter
frontMatter = <<~FRONTMATTER
---
layout: page
title: >
#{f.gsub('./', '').gsub('.html', '').gsub('-', ' ').split('/').map(&:capitalize).join(' ')}
titletag: >
#{title}
description: >
#{description}
permalink: >
#{f.gsub("./","/").gsub("/index.html","/")}
titlebar: >
sitemap: true
---
FRONTMATTER
puts "Saved front matter"
newPage = frontMatter + markdown
puts "Combined front matter and content"
# Generate filename for pages by replacing slashes with dashes and removing index.html
filename = f.gsub('./', '').gsub('/index.html', '').gsub('.html', '').gsub('/', '-')
mdF = "./_pages/#{filename}.md"
puts "Markdown file path: #{mdF}"
directory = mdF.gsub(mdF.split('/').last, "")
puts "Making directory: #{directory}"
FileUtils.mkdir_p(directory)
puts "Directory created"
File.write(mdF, newPage)
puts "Written to file: #{mdF}"
FileUtils.rm_rf(f)
puts "Removed old file: #{f}"
end
rescue => e
puts "Error processing file #{f}: #{e.message}"
puts e.backtrace.join("\n")
end
# Loop through each HTML file and process it
puts "Beginning loop of every file"
Find.find("./") do |f|
if CONFIG[:ignore_files].include?(f)
puts "Ignored file: #{f}"
elsif CONFIG[:ignore_folders].any? { |folder| f.include?(folder) }
puts "Ignored folder content: #{f}"
elsif f.include?('.html')
process_file(f, CONFIG, blog_categories)
else
puts "Not an HTML file: #{f}"
end
end
exec "find ./ -empty -type d -delete"
puts "Removed all empty directories"