Using Pandoc to convert HTML to Markdown for Jekyll
Tags
Sometimes you want to archive a page of a website by downloading and reupping the raw HTML. But the nice formatting you made for your Jekyll site doesn’t seem to apply to static files like .html
files. Here’s a quick-ish fix using Pandoc.
$ pandoc yourfile.html -f html -t gfm-raw_html -s -o newfile.md
Here’s what the auto generated frontmatter looks like for this webpage:
---
description: A static file is a file that does not contain any front
matter. These include images, PDFs, and other un-rendered content.
generator: Jekyll v4.3.2
lang: en-US
title: Static Files \| Jekyll • Simple, blog-aware, static sites
twitter:card: summary_large_image
twitter:site: "@jekyllrb"
viewport: width=device-width,initial-scale=1
---
As you can see, it automatically grabs the title of the page, which makes it easy to reference with Liquid tags.
It’s not a perfect solution but it’s better than manually editing a bunch of HTML files to add your frontmatter and/or stylesheets. And if you’re grabbing a lot of pages from the same site, you could always add their classes to your CSS so it looks a little neater.
Note that this solution doesn’t support tables. gfm-raw_html
is needed to remove weird CSS remnants but it also removes tables.
How to convert an entire directory:
cd
into the directory, then:
On Linux or OSX:
for f in *.html; do pandoc "$f" -f html -t gfm-raw_html -s -o "${f%.html}.md"; done
In Windows Powershell (untested):
gci -r -i *.html |foreach{$rtf=$_.directoryname+"\"+$_.basename+".md";pandoc -f html -t gfm-raw_html -s $_.name -o $rtf}
Note that the .md
files will convert to HTML, thereby ‘overwriting’ the original unformatted files on a front-facing level if they have the same file name as is done in the commands above. Simply add a character to the file name to solve this problem.
To access your original files:
{ % assign html_files = site.static_files | where: "html", true %}
<!-- Add 'html: true' (or whatever you want) to the default frontmatter so you can filter on it. -->
<ul>
{ %- for myfile in html_files | where myfile.extname contains 'html' -%}
<li>
<h2> <a href="/miscs/"> </a> </h2>
</li>
{ % endfor %}
</ul>
And the new markdown files:
<ul>
{ % for misc in site.miscs %}
<li>
<h2><a href="{ { misc.url }}">{ { misc.title }}</a></h2>
</li>
{ % endfor %}
</ul>
- Next: GF Banana Bread
Published