Using Pandoc to convert HTML to Markdown for Jekyll

Tags


Using Pandoc to convert HTML to Markdown for Jekyll

Sometimes you want to archive a page of a website by downloading and reupping the raw HTML. But the nice formatting you made for your Jekyll site doesn’t seem to apply to static files like .html files. Here’s a quick-ish fix using Pandoc.

$ pandoc yourfile.html -f html -t gfm-raw_html -s -o newfile.md

Here’s what the auto generated frontmatter looks like for this webpage:

---
description: A static file is a file that does not contain any front
  matter. These include images, PDFs, and other un-rendered content.
generator: Jekyll v4.3.2
lang: en-US
title: Static Files \| Jekyll • Simple, blog-aware, static sites
twitter:card: summary_large_image
twitter:site: "@jekyllrb"
viewport: width=device-width,initial-scale=1
---

As you can see, it automatically grabs the title of the page, which makes it easy to reference with Liquid tags.

It’s not a perfect solution but it’s better than manually editing a bunch of HTML files to add your frontmatter and/or stylesheets. And if you’re grabbing a lot of pages from the same site, you could always add their classes to your CSS so it looks a little neater.

Note that this solution doesn’t support tables. gfm-raw_html is needed to remove weird CSS remnants but it also removes tables.

How to convert an entire directory:

cd into the directory, then:

On Linux or OSX:

for f in *.html; do pandoc "$f" -f html -t gfm-raw_html -s -o "${f%.html}.md"; done

In Windows Powershell (untested):

gci -r -i *.html |foreach{$rtf=$_.directoryname+"\"+$_.basename+".md";pandoc -f html -t gfm-raw_html -s $_.name -o $rtf}

Note that the .md files will convert to HTML, thereby ‘overwriting’ the original unformatted files on a front-facing level if they have the same file name as is done in the commands above. Simply add a character to the file name to solve this problem.

To access your original files:

{ % assign html_files = site.static_files | where: "html", true %}
<!-- Add 'html: true' (or whatever you want) to the default frontmatter so you can filter on it. -->
<ul>
    { %- for myfile in html_files | where myfile.extname contains 'html' -%}
    <li>
       <h2> <a href="/miscs/">  </a> </h2>
    </li>
    { % endfor %}
</ul>

And the new markdown files:

<ul>
    { % for misc in site.miscs %}
    <li>
        <h2><a href="{ { misc.url }}">{ { misc.title }}</a></h2>
    </li>
    { % endfor %}
</ul>
All Icons By Icons8

Published