Steps for html-to-md script

Step 1 Download the website

To avoid getting blocked or having excessive request to the server you’ll want to download the website to your local computer. This will be useful incase you need to keep retry the html-to-md script. Easiest to use WGET

wget -r -m -p -x --user-agent="Mozilla/5.0" https://www.websiteurlhere.com

Step 2 Copy script and git init

Next thing to do is to add the html-to-md script into the downloaded folder and git init in this folder. This will help you see changes using git status and also roll back failed scrapes with git add . and git stash This is not connected to a repo online and is just used to help track locally.

So open a terminal and run:

git init
git add .
git commit -m 'init commit'

Step 3 Configure the html-to-md script

From here you’ll want to configure the HTML script. By default it’s set up to run both the _pages collection and _blogposts collection. The script looks for a unique selector that should be specific to the blog to determine what is a blog and what is not.

So you’ll need to work through the script and try and fill in the blanks for any selectors needed.

Step 4 run and verify

The script can then be run with ruby html-to-md.rb but it’s important to check that pages have been made. YOu can look through the created _pages and _blogposts to verify there are changes or use git status to see updates.

If you’re able to verify everything has been brought over and created appropriately move to step 5 otherwise you can reset with git add . and git stash then try step 3 again.

Step 5 verify and compare sitemaps

From here you’ll want to make sure all the pages came over and everything is the same as the live site. Find the sitemap on the live site and find and replace the XML code out so you have a list of all the websites URLs. You then need to sort these a-z.

Do the same on the html-to-md converted content. Add these pages into jekyll.. get all the links from the sitemap… and find and replace so you’re left with only the links. Add the url back in (https://www) however it is to match the live websites sitemap. And then sort the links a-z.

You should have two identical files sorted a-z. Name one new-pages.txt and the other one live-pages.txt

Then we’re going to use VIM to generate an html diff comparrison. There are other ways to run a diff between two files but using VIM has been the easiest for me. Running that live has had great results but this HTML output works well as well.

vimdiff  new-pages.txt live-pages.txt -c TOhtml -c 'w! diff.html' -c 'qa!'

Any pages that are missed will show up here. Find these and convert them manually with this online converter (https://codebeautify.org/html-to-markdown) or go back to Step 3 and try to reconfigure the html-to-md script.

HTML-TO-MD.rb conversion script

##############################
#         READ ME
###

# This script loops through every page in the directory (except /index and /template.index),
# pulling the title tag and description from the head,
# with the page content in the body then putting them around
# in the template file.

# **CAUTION**
# This does require a few edits, which will be listed below.

# Read the outputs to follow along with the script.

# Let's see how it's done!

# Requiring dependencies
puts "Requiring dependencies"
require 'nokogiri'
require 'reverse_markdown'
require 'find'
require 'date'
require 'fileutils'
require 'json' 

# Define CSS selectors and other configuration settings at the top for easy modification
CONFIG = {
  ignore_files: ['./index.html', './indexNEW.html'],
  ignore_folders: ['/assets/', '.git', '.htaccess'],
  blog_selector: ".main-content-container.blog",
  title: 'title',
  description: 'meta[name="description"]',
  category: 'a[rel="category tag"]',
  h1: 'h1',
  json_ld: "script[type='application/ld+json']",
  content: 'main.main-content-container',
  remove_selectors: [
    '.extra-selector-1',  # Add your extra selectors here
    '.extra-selector-2'   # Add more selectors as needed
  ],
  # Both of these need to be changed if scraping categories from the blog snippets
  # Path to main blog page
  snippet_file_path: './blog/index.html',  
  # Base domain 
  domain_url: 'https://www.EXAMPLE.com'
}

# Read and Parse the Snippet File to Build the Blog Categories Map
snippet_file = File.read(CONFIG[:snippet_file_path])
puts "Snippet file read successfully"
snippet_doc = Nokogiri::HTML(snippet_file)
puts "Snippet file parsed successfully"

blog_categories = {}
snippet_doc.css('article.post').each do |post|
  # Normalize URL by removing domain, trailing slashes, and whitespace
  url = post.at_css('a')["href"].gsub(CONFIG[:domain_url], '').sub(%r{/*$}, '').strip + '/'
  category = post["category"]
  blog_categories[url] = category
end
puts "Blog categories mapped successfully"

# Method to process each file
def process_file(f, config, blog_categories)
  puts "---"
  puts "Processing file: #{f}"

  file = File.read(f)
  puts "File read successfully"

  doc = Nokogiri::HTML(file)
  puts "File parsed successfully"
  puts "Beginning DOM interaction"

  # Extract title, ensuring it handles nil gracefully
  title = doc.at_css(config[:title])&.text&.strip || ""
  puts "Found title tag: #{title}"

  # Extract description
  description_tag = doc.at_css(config[:description])
  description = description_tag ? description_tag["content"].strip : ""
  puts "Found header metadata: title=#{title}, description=#{description}"

  # Determine if this is a blog page
  is_blog = doc.at_css(config[:blog_selector])
  if is_blog
    puts 'Processing as BLOG PAGE'

    # Generate permalink with consistent trailing slash and strip whitespace
    permalink = f.gsub("./", "/").gsub("/index.html", "").sub(%r{/*$}, '').strip + '/'
    puts "Generated permalink for post: #{permalink}"

    # Primary category check using config selector
    category = doc.at_css(config[:category])&.text&.strip
    if category.nil? || category.empty?
      # Fallback to blog snippet if primary selector didn’t find anything
      puts "No category set, looking for category in blog snippet..."
      category = blog_categories[permalink] || "Uncategorized"
      puts "Category found from blog snippet: #{category}"
    else
      puts "Category found on page: #{category}"
    end

    # Extract h1
    h1 = doc.at_css(config[:h1])&.text&.strip || ""
    doc.at_css(config[:h1])&.remove
    puts "Found H1: #{h1}"

    # Directly attempt to extract datePublished
    date = Date.today  # Default if datePublished isn’t found
    json_ld_scripts = doc.css("script[type='application/ld+json']")

    date_found = false  # Track whether a date was found

    json_ld_scripts.each do |script|
      begin
        json_content = JSON.parse(script.text) rescue nil
        if json_content && json_content["@type"] == "BlogPosting" && json_content["datePublished"]
          date = json_content["datePublished"]
          date_found = true
          puts "Found datePublished in schema: #{date}"  # Message confirming date was found
          break
        end
      rescue JSON::ParserError => e
        puts "JSON parsing error: #{e.message}"
      end
    end

    # If no date was found in the schema, confirm it’s using today’s date
    puts "No datePublished found in schema. Using today's date: #{date}" unless date_found

    # Remove additional unwanted selectors from the content
    config[:remove_selectors].each do |selector|
      doc.css(selector).each(&:remove)
    end
    puts "Removed additional unwanted content"

    content = doc.at_css(config[:content]).inner_html
    return if content.nil?

    markdown = ReverseMarkdown.convert(content.to_s)
    puts "Converted content to Markdown"

    # Extract and create the front matter
    frontMatter = <<~FRONTMATTER
      ---
      title: >
        #{h1}
      layout: post
      date: >
        #{date}
      titletag: >
        #{title}
      description: >
        #{description}
      permalink: >
        #{permalink}
      sitemap: true
      categories:
        - #{category}
      ---
    FRONTMATTER
    puts "Saved front matter"

    newPage = frontMatter + markdown
    puts "Combined front matter and content"

    # Generate file paths and write the Markdown file
    filename = f.split('/')[-2]
    mdF = "./_blogposts/#{filename}.md"
    puts "Markdown file path: #{mdF}"

    directory = mdF.gsub(mdF.split('/').last, "")
    puts "Making directory: #{directory}"

    FileUtils.mkdir_p(directory)
    puts "Directory created"

    File.write(mdF, newPage)
    puts "Written to file: #{mdF}"

    FileUtils.rm_rf(f)
    puts "Removed old file: #{f}"
  else
    puts 'Processing as regular PAGE'

    # Remove additional unwanted selectors from the content
    config[:remove_selectors].each do |selector|
      doc.css(selector).each(&:remove)
    end
    puts "Removed additional unwanted content"

    # Convert content to Markdown and create the Markdown file
    content = doc.at_css(config[:content]).inner_html
    if content.nil?
      puts "Content not found using selector: #{config[:content]}"
      return
    else
      puts "Found content"
    end

    markdown = ReverseMarkdown.convert(content.to_s)
    puts "Converted content to Markdown"

    # Extract and create the front matter
    frontMatter = <<~FRONTMATTER
      ---
      layout: page
      title: >
        #{f.gsub('./', '').gsub('.html', '').gsub('-', ' ').split('/').map(&:capitalize).join(' ')}
      titletag: >
        #{title}
      description: >
        #{description}
      permalink: >
        #{f.gsub("./","/").gsub("/index.html","/")}
      titlebar: >
      sitemap: true
      ---
    FRONTMATTER
    puts "Saved front matter"

    newPage = frontMatter + markdown
    puts "Combined front matter and content"

    # Generate filename for pages by replacing slashes with dashes and removing index.html
    filename = f.gsub('./', '').gsub('/index.html', '').gsub('.html', '').gsub('/', '-')
    mdF = "./_pages/#{filename}.md"
    puts "Markdown file path: #{mdF}"

    directory = mdF.gsub(mdF.split('/').last, "")
    puts "Making directory: #{directory}"

    FileUtils.mkdir_p(directory)
    puts "Directory created"

    File.write(mdF, newPage)
    puts "Written to file: #{mdF}"

    FileUtils.rm_rf(f)
    puts "Removed old file: #{f}"
  end
rescue => e
  puts "Error processing file #{f}: #{e.message}"
  puts e.backtrace.join("\n")
end

# Loop through each HTML file and process it
puts "Beginning loop of every file"
Find.find("./") do |f|
  if CONFIG[:ignore_files].include?(f)
    puts "Ignored file: #{f}"
  elsif CONFIG[:ignore_folders].any? { |folder| f.include?(folder) }
    puts "Ignored folder content: #{f}"
  elsif f.include?('.html')
    process_file(f, CONFIG, blog_categories)
  else
    puts "Not an HTML file: #{f}"
  end
end

exec "find ./ -empty -type d -delete"
puts "Removed all empty directories"