Extracting Ao3 HTML metadata into YAML via shell (and using Eleventy for tag sorting)

24 Jun 2024

My experience with ChatGPT #

It was able to output fairly competent code, at least, at first. It worked best when I boiled down each problem to one or two steps, and then asked it to add another feature. Sometimes it seemed like it didn't know what to do without broader context, but when I pasted in my existing code, it just spat it back out verbatim. There was definitely a tipping point when it stopped being useful, and I had to figure the rest out myself. But that tipping point came a lot later than I thought. It mostly had issues with regex (but who doesn't?) and not understanding the output format I wanted.

I think that AI will only be good for snippets/simple functions for a long while yet. Nobody's feeding whole open-source programs into it for it to be able to guess what many interlocking parts look like.

Code #

Bash #

This is the script I ended up with. It takes in a file or folder, and outputs the same file with YAML frontmatter into an output folder. It compiles the fandom, relationships, characters, and additional_tags into one 'tags' key. This was just easier to display using the existing Eleventy architecture because I didn't want to write javascript. <- certified javascript hater I didn't include the category or warnings because I don't think those are useful to filter on, since there's not really a way to combine 'searches'.

#!/bin/bash

# This removes every closing ')' because it can mess with YAML. Not only smileys but tags are affected. Remove at your own peril. 
#  sed 's/)//g'

# main fn
process_html_file() {

#setup
input_file=$1
output_folder="output"
output_file="$output_folder/${input_file##*/}"
output_file="${output_file%.*}-output.html"

# Create the output folder if it doesn't exist
mkdir -p "$output_folder"

# Read the input HTML content
input=$(<"$input_file")

# Extract content within <div id="preface"> tag
content=$(awk '/<div id="preface">/,/<\/div>/' "$input_file")

# Extract data from the content using regular expressions
title=$(echo "$content" | grep -o '<h1>[^<]*' | sed 's/<h1>//')
author=$(echo "$content" | grep -o 'by <a rel="author" href="[^"]*">[^<]*' | sed 's/by <a rel="author" href="[^"]*">//')
published=$(echo "$content" | grep -o 'Published: [0-9-]*' | sed 's/Published: //')
words=$(echo "$content" | grep -o 'Words: [0-9]*' | sed 's/Words: //')
chapters=$(echo "$content" | grep -o 'Chapters: [^<]*' | sed 's/Chapters: //')
#fandom=$(echo "$content" | awk -F'<[^>]*>' '/<dt>Fandom:<\/dt>/{getline; gsub(/<\/a>/,""); print $3}')
fandom=$(echo "$content" | awk -F'<[^>]*>' '/<dt>Fandoms?:<\/dt>/{getline; gsub(/<\/a>/,""); while (match($0, />([^<]+)</)) {fandoms = fandoms substr($0, RSTART+1, RLENGTH-2); $0 = substr($0, RSTART+RLENGTH)} gsub(/, $/, "", fandoms); gsub(/, /, "\n  - ", fandoms); printf "\n  - %s\n", fandoms}')
relationships=$(echo "$content" | awk -F'<[^>]*>' '/<dt>Relationships?:<\/dt>/{getline; gsub(/<\/a>/,""); while (match($0, />([^<]+)</)) {relationships = relationships substr($0, RSTART+1, RLENGTH-2); $0 = substr($0, RSTART+RLENGTH)} gsub(/, $/, "", relationships); gsub(/, /, "\n  - ", relationships); printf "\n  - %s\n", relationships}')
characters=$(echo -e "\n$(echo "$content" | sed -n '/<dt>Characters:<\/dt>/,/<dt>/p' | sed 's/<[^>]*>//g; s/Characters://; s/, /\n- /g; s/^/- /' | sed '/^$/d; /^- $/d; /- Additional Tags:/d' | sed 's/^/  /')")
additional_tags=$(echo -e "\n$(echo "$content" | sed -n '/<dt>Additional Tags:<\/dt>/,/<dt>/p' | sed 's/<[^>]*>//g; s/Additional Tags://; s/, /\n/g' | sed '/^$/d; /^- $/d; /- Additional Tags:/d; /<dt>Language:<\/dt>/d' | sed '/^$/d' | sed '$d' | sed 's/://g' | sed 's/)//g')")
language=$(echo "$content" | awk -F'[<>]' '/<dt>Language:<\/dt>/{getline; print $3}')
summary=$(echo "$input" | sed -n 's/.*<blockquote class="userstuff"><p>\(.*\)<\/p><\/blockquote>.*/\1/p' | sed 's/://g' | sed 's/)//g' | tr '\n' ' ')
rating=$(echo "$content" | sed -n '/<dt>Rating:<\/dt>/,/<dt>/p' | sed 's/<[^>]*>//g; s/Rating://; s/, /\n/g' | sed '/^$/d; /^- $/d; /- Rating:/d; /<dt>Archive Warning:<\/dt>/d' | sed '/^$/d' | sed '$d')
archive_warning=$(echo "$content" | awk -F'<[^>]*>' '/<dt>Archive Warnings?:<\/dt>/{getline; gsub(/<\/a>/,""); while (match($0, />([^<]+)</)) {warnings = warnings substr($0, RSTART+1, RLENGTH-2); $0 = substr($0, RSTART+RLENGTH)} gsub(/, $/, "", warnings); gsub(/, /, "\n  - ", warnings); printf "\n  - %s\n", warnings}')
category=$(echo "$content" | sed -n '/<dt>Category:<\/dt>/,/<dt>/p' | sed 's/<[^>]*>//g; s/Category://; s/, /\n/g' | sed '/^$/d; /^- $/d; /- Category:/d' | sed '/^$/d' | sed '$d')

#all tags
tags=$(echo -e "${fandom}\n${relationships}\n${characters}\n${additional_tags}" | sed '/^$/d; s/:/-/g; s/^  - //')
tags=$(echo "$tags" | sed 's/^/  - /')


#replace &quot; with " and &amp; with &
tags=$(echo "$tags" | sed 's/&quot;/"/g' | sed 's/&amp;/\&/g')
additional_tags=$(echo "$additional_tags" | sed 's/&quot;/"/g' | sed 's/&amp;/\&/g')
characters=$(echo "$characters" | sed 's/&quot;/"/g' | sed 's/&amp;/\&/g')
relationships=$(echo "$relationships" | sed 's/&quot;/"/g' | sed 's/&amp;/\&/g')

# Create YAML content with proper indentation and replace colons with dashes
yaml_content="---
title: ${title//:/-}
author: ${author//:/-}
published: ${published//:/-}
words: ${words//:/-}
chapters: ${chapters//:/-}
rating: ${rating//:/-}
category: ${category//:/-}
archive_warning: ${archive_warning//:/-}
fandom: ${fandom//:/-}
relationships: ${relationships//:/-}
characters: ${characters//:/-}
additional_tags: 
$(echo "$additional_tags" | sed '/^$/d' | sed 's/^/  - /' | sed 's/:/-/')
language: ${language//:/-}
summary: ${summary//:/-}
tags:
${tags//:/-}
---"

# Prepend YAML content to the input HTML file
echo "$yaml_content" | cat - "$input_file" > temp.html

# Rename the temp file to the output file
mv temp.html "$output_file"

# Removes weird duplicated YAML that appends at random??
awk '1;/<\/html>/{exit}' "$output_file" > temp.html && mv temp.html "$output_file"

echo "Output file generated: $output_file"
}

# Check if the input is a folder or a single file
if [ -d "$1" ]; then
    # Input is a folder
    input_folder=$1

    # Process each HTML file in the folder
    for input_file in "$input_folder"/*.html; do
        if [ -f "$input_file" ]; then
            process_html_file "$input_file"
        fi
    done
elif [ -f "$1" ]; then
    # Input is a single file
    process_html_file "$1"
else
    echo "Error: Invalid input. Please provide a folder or a single file."
    exit 1
fi

Eleventy #

post.njk looks like this:

 
<h1>{{ title }}</h1>
<h2>{{ author }}</h2>
<h3>{% if published %}
		<time datetime="{{ published | htmlDateString }}">{{ published | readableDate }}</time>
		{% else %}
		<time datetime="{{ date | htmlDateString }}">{{ date  | readableDate }}</time>
		{% endif %}</h3>
<p>{{ summary }}</p>

{%- set isFic = true %}

<h3>Tags</h3>
<ul class="post-metadata">
	{%- for tag in tags | filterTagList %}
	{%- set tagUrl %}/tags/{{ tag | slugify }}/{% endset %}
	{%- if tag == "blog" %}{%- set isFic = false %}{% endif %}
	<li><a href="{{ tagUrl }}" class="post-tag">{{ tag }}</a>{%- if not loop.last %}, {% endif %}</li>
	{%- endfor %}
</ul>
{%- css %}{% include "public/css/message-box.css" %}{% endcss %}
{%- if isFic == true %} 
<p class="message-box">The following is the entire original page. Any links below this line will lead offsite.</p>
{% endif %}
<hr>

{{ content | safe }}

You can ignore the isFic thing if you want, but I wanted that message box to help cement the divide between my in-site tags and the outgoing ones which link straight to Ao3.

postslist.njk looks like this:

 
{%- css %}.postlist { counter-reset: start-from {{ (postslistCounter or postslist.length) + 1 }} }{% endcss %}
<ol reversed class="postlist">
{% for post in postslist | reverse %}
	<li class="postlist-item{% if post.url == url %} postlist-item-active{% endif %}">
		<a href="{{ post.url }}" class="postlist-link">{% if post.data.title %}{{ post.data.title }}{% else %}<code>{{ post.url }}</code>{% endif %}</a>
		{% if post.data.published %}
		<time datetime="{{ post.data.published | htmlDateString }}">{{ post.data.published | readableDate }}</time>
		{% else %}
		<time datetime="{{ post.date | htmlDateString }}">{{ post.date  | readableDate }}</time>
		{% endif %}
	</li>
{% endfor %}
</ol>
{% endif %}

The change here is so blog posts have their created date while fics have their original published date (same as in post.njk).

In eleventy.config.js (aka eleventy.js, but if you copy from the eleventy base-blog repo it'll be called eleventy.config.js) I added this:

const HTMLMinifier = require('@bhavingajjar/html-minify');
const minifier = new HTMLMinifier(); 

// ...
eleventyConfig.addTransform("minifier", function(content, outputPath) {
		if(outputPath.endsWith(".html")) {
			// so doesn't squish code
			if(!content.includes("<pre")){
				let m = minifier.htmlMinify(content);
				return m;
			}	
		}
		return content;
	});

You'll need to npm install @bhavingajjar/html-minify and restart the eleventy server. It strips extra whitespace so all the fics are spaced out the same, but doesn't condense your nice code blocks into one line.

And that's it!

Why is the eleventy-base-blog repo so bad? I ended up cloning the eleventy-high-performance-blog because it actually lets you have animated images (although there are weird issues with that repo too. Why wouldn't you want to see every post's tags by default?). It does take a few seconds to build each time if you have a lot of posts with a lot of tags, though, which the base-blog doesn't do. I really recommend seperate dev and prod data.

Previous: GF Banana Bread
Next: Are Rosencrantz and Guildenstern?

Published 24 Jun 2024