This article will explain how the blog is organized at a technical level, and show how I implemented various IndieWeb features.
Table of Contents:
Motivation
Earlier this year I migrated this blog off Wordpress to the Hugo static site generator. Over the last six months I’ve been gradually sculpting and shaping this digital place to feel more like my cozy corner of the web. This was partly motivated by withering social media, partly inspired by the IndieWeb POSSE philosophy (Publish on your Own Site, Syndicate Elsewhere), and partly driven by my desire to do more with the space like embed Javascript demos and have footnotes and endnotes.
So one of the big new features is that I have a “shortform” section on the blog, which corresponds roughly to threads I would post on Twitter or Mastodon. Syndication automatically converts those to full threads on all platforms, with a link back to the website.
The main constraint I had for myself when doing all this was to avoid managing a continuously running server, and certainly not deal with self-hosting. As the rest of the article shows, this means I rely heavily on GitHub Actions to run automation, as well as Netlify to trigger deployment webhooks and various third-party services for webmentions.
Depending so heavily on GitHub Actions mildly triggers my reliability brain, but because the custom features are implemented in standard python scripts, and the “databases” are also stored in flat files in the repo, if something goes awry migrating off GH Actions should be relatively straightforward.
As a standard disclaimer, the scripts in this article are very hacky and specific to my setup. I will likely never take the time to make them into something general-purpose. It’s my cozy corner and it can be a little messy.
Structure and Deployment
The blog is a relatively standard Hugo site using the Paperesque theme with various customizations. The blog lives in a private GitHub repository (with the theme fully vendored within), and deploys to Netlify on every push.
Here is my hugo.toml
baseURL = 'https://www.jeremykun.com/'
languageCode = 'en-us'
title = 'Math ∩ Programming'
theme = 'paperesque'
# This parameter is branched on in templates to determine if MathJAX should be
# served when the page is loaded.
[params]
math = true
[[params.topmenu]]
name = "Main Content"
url = "main-content/"
[[params.topmenu]]
name = "Primers"
url = "primers/"
[[params.topmenu]]
name = "All articles"
url = "posts/"
[[params.topmenu]]
name = "About"
url = "about/"
[[params.topmenu]]
name = "rss"
url = "rss/"
[markup.goldmark.extensions.passthrough]
enable = true
[markup.goldmark.extensions.passthrough.delimiters]
inline = [['$', '$'], ['\(', '\)']]
block = [['$$', '$$'], ['\[', '\]']]
Note the extension to allow Hugo to work well with standard TeX delimiters, I actually added that to Hugo in v0.122.0. See the docs here, and note there is still one bug I haven’t figured out how to fix.
And my netlify.toml
:
[build]
# npx -y pagefind ... runs pagefind to create a static search index
command = "hugo --gc --minify && npx -y pagefind --site public"
publish = "public"
# exclude flat files in scripts/ that contain database mappings for published
# social media posts
ignore = "git diff --quiet $CACHED_COMMIT_REF $COMMIT_REF -- . ':(exclude)scripts/*.txt'"
[build.environment]
HUGO_VERSION = "v0.122.0"
TZ = "America/Los_Angeles"
[[redirects]]
from = "/feed/"
to = "/index.xml"
status = 301
[context.deploy-preview]
command = "hugo --gc --minify --buildFuture -b $DEPLOY_PRIME_URL && npx -y pagefind --site public"
Static search index
The build command above has an extra step to initialize a static search index using Pagefind. When this runs, it creates files such that this just works.
<div style="margin: 20px 0px 20px 0px;">
<link href="/pagefind/pagefind-ui.css" rel="stylesheet">
<script src="/pagefind/pagefind-ui.js"></script>
<div id="search"></div>
<script>
window.addEventListener('DOMContentLoaded', (event) => {
new PagefindUI({ element: "#search", showSubResults: true });
});
</script>
</div>
I slapped it in themes/paperesque/layouts/partials/homepage_display_section.html
and it creates the search bar and dynamic experience you see on the blog homepage.
I have not noticed it producing significant additional bandwidth usage,
especially not compared to serving images.
Running scripts via GitHub Actions
Because the site is static, all of the automation on the blog is done through GitHub Actions, modifying static files in the repository and triggering a new deployment if any files change.
To accomplish this,
I utilize the workflow_dispatch
trigger in GitHub Actions,
which allows one to remotely trigger a workflow via an HTTP POST request,
optionally with arguments passed in the URL.
For example, my social media syndication script starts like this:
#.github/workflows/syndicate.yml
name: Syndicate to social media
permissions:
contents: write
on:
workflow_dispatch:
schedule:
# https://crontab.guru/once-a-day
- cron: '0 0 * * *'
To trigger this action remotely, I can POST to
https://api.github.com/repos/j2kun/math-intersect-programming/actions/workflows/syndicate.yml/dispatches
with the following headers
{
"Accept": "application/vnd.github+json",
"Authorization": "Bearer <API_KEY>",
"X-GitHub-Api-Version": "2022-11-28",
"Content-Type": "application/x-www-form-urlencoded"
}
where <API_KEY>
is a GitHub personal access token with permissions to trigger
workflows. (I have it set to “Read access to actions variables, metadata, and
secrets” and “Read and Write access to actions”)
I will show how this is set up for the various instances in which it’s invoked
in the later sections.
As a bonus, workflow_dispatch
gives a nice button in the GitHub Actions UI
to trigger the workflow, which is helpful for debugging.
Once I can trigger a workflow remotely,
the rest of the workflow involves
checking out the repo, installing dependencies of the script,
running a script that does whatever the action is supposed to do,
and commit and push any changes to main
.
The script running is simple, but this was the first time
I tried actually pushing commits from a GH Action workflow.
To do this, I set up a GITHUB_TOKEN
secret that has write permissions,
and then each script has a set of steps that looks like this:
- name: Run my_script.py
run: |
python -m scripts.my_script
env:
SOME_SECRET: ${{ secrets.SOME_SECRET }}
- name: Commit changes
run: |
git config --local user.name ${{ github.actor }}
git config --local user.email "${{ github.actor }}@users.noreply.github.com"
test -z "$(git status --porcelain)" || git add <FILE_THAT_COULD_BE_CHANGED>
test -z "$(git status --porcelain)" || git commit -m "<SOME MESSAGE>"
- name: Push changes
uses: ad-m/github-push-action@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
branch: main
The git status --porcelain
command returns an empty string if there are no changes to commit,
and test -z
returns true if the argument string is empty,
so this skips the git add
and git commit
commands if there is nothing to commit.
This boilerplate is repeated often enough in the next few sections that I will omit it and just show the script that is being run.
Social media syndication and the “shortform” section
The big one is social media syndication. Here I do a few things:
- For long articles, post a single social media post with the title of the article and a link to it.
- For a new special “shortform” section of the blog, split the article into posts and publish them on each social media platform as a thread.
- For each syndicated post, save the URL of the syndicated post in a text file (“database”), so I can know it’s been processed in the future.
- For each syndicated post, add a link to the end of the post to the syndicated post, so readers can “discover” that I syndicate stuff and see any associated discussion.
Each social media platform I want to syndicate to gets its own script,
but they all have a common structure.
Each has a flat file stored in the repository
where each line of the file contains a pair of the blog post URL
and the syndicated post URL.
E.g., published_mastodon.txt
has as its first line
https://www.jeremykun.com/shortform/2024-05-06-1018/ https://mathstodon.xyz/@j2kun/112426452236288901
Then each syndication script loads the markdown files for all relevant posts from the repo, determines their canonical URL, looks them up in the database to see if they’ve been syndicated, and syndicates each one that hasn’t. Finally, it writes the database with any new entries back to disk.
I have a set of helpers in a “utils” module
that I use to parse markdown and convert articles to threads, among other things.
I copied the current utils.py
into a GitHub gist for full reference,
and will highlight some of the more relevant parts shortly.
Then there is a common syndication.py
that handles the common logic
across platforms:
import pprint
from scripts import utils as utils
def publish_single_post(
publish_callback, abspath: str, published_posts, dry_run=False, **kwargs
):
"""
publish_callback takes as input the post and any additional args passed in
kwargs, and returns the URI to the post that should be stored in the
databse.
"""
print(f"Processing {abspath} as single post")
blog_post_permalink = utils.canonical_url(abspath, post_type="")
with open(abspath, "r") as infile:
post = utils.title_and_link_as_post(infile.read(), blog_post_permalink)
# a debug print of the posts about to be posted
print(f"Printing post for {abspath}:\n----------------------")
print(f"\n{post}")
print("\n----------------------\n")
if dry_run:
print("Dry run enabled, skipping post creation")
return
print(f"Publishing post for {abspath}")
output_uri = publish_callback(post, **kwargs)
print(f"Successfully posted post {output_uri}")
published_posts[blog_post_permalink] = output_uri
def publish_thread(
publish_callback,
post_adjuster,
abspath: str,
published_posts,
dry_run=False,
**kwargs,
):
"""
publish_callback takes as input the list of posts and any additional args
passed in kwargs, and returns the URI to the root post that should be
stored in the databse.
post_adjuster takes as input the list of posts and returns the list of
posts that should be published, adjusting the length and/or splitting them
as needed for the platform.
"""
print(f"Processing {abspath} as thread")
blog_post_permalink = utils.canonical_url(abspath)
convert_math = kwargs.pop("convert_math", True)
with open(abspath, "r") as infile:
posts = utils.convert_post_to_thread(
infile.read(),
blog_post_permalink,
convert_math=convert_math,
)
posts = post_adjuster(posts, blog_post_permalink=blog_post_permalink, **kwargs)
print(f"Printing post thread for {abspath}:\n----------------------")
for i, post in enumerate(posts):
print(f"\n{i}.\t{post}")
print("\n----------------------\n")
if dry_run:
print("Dry run enabled, skipping post creation")
return
print(f"Publishing post thread for {abspath}")
root_uri = publish_callback(posts, **kwargs)
published_posts[blog_post_permalink] = root_uri
def syndicate_to_service(
name: str,
database_filepath,
thread_publisher,
thread_adjuster,
post_publisher,
since_days=1,
dry_run=False,
**kwargs,
):
print("Syndicating to", name)
# dict mapping Blog URL to first post url in published thread.
git_root = utils.get_git_root()
database_path = git_root / database_filepath
published_posts = utils.load_database(database_path)
print(f"Existing {name} posts from {database_path}:")
pprint.pp(published_posts)
posts_to_try = utils.get_blog_posts()
posts_to_publish = utils.get_posts_without_mapping(
posts_to_try, published_posts, since_days=since_days
)
try:
for abspath in posts_to_publish["shortform"]:
publish_thread(
thread_publisher,
thread_adjuster,
abspath,
published_posts,
dry_run=dry_run,
**kwargs,
)
for abspath in posts_to_publish["posts"]:
publish_single_post(
post_publisher,
abspath,
published_posts,
dry_run=dry_run,
**kwargs,
)
finally:
print("Writing successful post URLs to disk")
utils.dump_database(published_posts, database_path)
The main part is syndicate_to_service
,
which loads the database file, extracts the posts
and finds the ones that haven’t been syndicated yet,
and then decides to publish a thread or single post.
The publish_thread
function is the tricky part.
It takes two callbacks
publish_callback
and post_adjuster
,
which, respectively, implement the platform specific ways
to publish all the posts in a thread
and “adjust” a thread to fit the platform’s constraints.
By “adjust” I mean that platforms differ
in the allowed length of a post,
as well as how to count the character contributions
of links and such.
It’s also responsible for adding the link back to the original post.
The slightly devious part is that the post_adjuster
and publish_callback
receive **kwargs
from passed through from syndicate_to_service
,
which allow the original caller to propagate platform-specific options
(mainly the client for the platform’s API).
The Mastodon client is the simplest:
import os
import fire
from mastodon import Mastodon
from scripts import syndication as syndication
# A simple text file with two urls per line
DATABASE_FILE = "scripts/published_mastodon.txt"
def mastodon_post_publisher(post: str, mastodon_client=None, **kwargs):
if not mastodon_client:
raise ValueError("mastodon_client must be provided")
status_dict = mastodon_client.status_post(post)
return status_dict['url']
def mastodon_thread_adjuster(posts, blog_post_permalink=None, **kwargs):
if not blog_post_permalink:
raise ValueError("blog_post_permalink must be provided")
posts[0] += f"\n\nArchived at: {blog_post_permalink}"
return posts
def mastodon_thread_publisher(posts, mastodon_client=None, **kwargs):
if not mastodon_client:
raise ValueError("mastodon_client must be provided")
toots_for_post = []
for i, toot in enumerate(posts):
reply_id = toots_for_post[-1]["id"] if len(toots_for_post) > 0 else None
status_dict = mastodon_client.status_post(toot, in_reply_to_id=reply_id)
print(
f"Successfully posted toot {i} of the thread: "
f"{status_dict['id']} -> {status_dict['url']}"
)
toots_for_post.append(status_dict)
return toots_for_post[0]["url"]
def publish_to_mastodon(since_days=1, dry_run=False):
"""Idempotently publish shortform and regular posts to mastodon."""
# File generated by scripts/login_with_mastodon.py or else set in
# environment for headless usage in GH actions.
mastodon_client = Mastodon(
api_base_url="https://mathstodon.xyz",
access_token=os.getenv(
"MASTODON_TOKEN", "scripts/jeremykun_tootbot_usercred.secret"
),
)
syndication.syndicate_to_service(
"mastodon",
database_filepath=DATABASE_FILE,
thread_publisher=mastodon_thread_publisher,
thread_adjuster=mastodon_thread_adjuster,
post_publisher=mastodon_post_publisher,
since_days=since_days,
dry_run=dry_run,
mastodon_client=mastodon_client,
)
if __name__ == "__main__":
fire.Fire(publish_to_mastodon)
Twitter is similar,
but requires more “thread adjusting” because of the shorter character limits,
and I have a split_post
function that handles that.
It tries to split them at sentence boundaries close to the word limit,
then at comma boundaries.
def split_post(post, max_char_len=300):
if len(post) < max_char_len:
return [post]
# weird because re.split keeps the separators as list items
# re_joined rejoins them together
re_split = [p.strip() for p in re.split(r"(\. |, )", post)]
re_joined = [
i + j for i, j in zip_longest(re_split[::2], re_split[1::2], fillvalue="")
]
subposts = deque(re_joined)
for subpost in subposts:
if len(subpost) > max_char_len:
raise ValueError(f"Sentence is too long: {subpost}")
accumulated_subposts = []
while subposts:
next_subpost = subposts.popleft()
if not accumulated_subposts:
accumulated_subposts.append(next_subpost)
continue
merged = accumulated_subposts[-1] + " " + next_subpost
if len(merged) > max_char_len:
accumulated_subposts.append(next_subpost)
else:
accumulated_subposts[-1] = merged
return accumulated_subposts
And Bluesky is the weirdest,
because the API is more complicated (create_strong_ref
??)
mainly because you have to provide things like links
in terms of something called “facets”
which has a lot of extra structure.
Links to syndicated versions at the end of each post
The GH Actions workflow runs the three scripts sequentially,
makes three separate commits for each database update,
and then at the end runs a final script add_links_on_posts
that adds the syndication links to the end of each article.
The script itself is quite simple, except for Bluesky:
from scripts import utils as utils
SYNDICATION_FILES = {
"mastodon": "scripts/published_mastodon.txt",
"twitter": "scripts/published_twitter.txt",
"bluesky": "scripts/published_bluesky.txt",
}
def add_links_on_posts():
git_root = utils.get_git_root()
shortform_path = git_root / "content" / "shortform"
for service, file in SYNDICATION_FILES.items():
print(f"Adding {service} links to shortform posts.")
database_path = git_root / file
syndicated_posts = utils.load_database(database_path)
for blog_url, syndicated_url in syndicated_posts.items():
# bluesky is weird, have to transform
#
# at://did:plc:6st2p3o4niwz5olbxkuimxlk/app.bsky.feed.post/3ksggi2tfnk2t
#
# to
#
# https://bsky.app/profile/jeremykun.com/post/3ksggi2tfnk2t
#
if service == "bluesky":
key = syndicated_url.strip("/").split("/")[-1]
syndicated_url = f"https://bsky.app/profile/jeremykun.com/post/{key}"
blog_filename = blog_url.strip("/").split("/")[-1] + ".md"
post_path = shortform_path / blog_filename
with open(post_path, "r") as infile:
post_lines = infile.readlines()
output = utils.add_link(post_lines, f"{service}_url", syndicated_url)
with open(post_path, "w") as outfile:
outfile.write(output)
if __name__ == "__main__":
add_links_on_posts()
The add_link
function (at the end of utils.py)
hides how the actual link is displayed.
What I do is put the link in the yaml frontmatter of the markdown file,
and then add some Hugo templating to display it if it’s present.
For example, the script would change the yaml frontmatter to look like this:
# content/shortform/2024-05-06-1018.md
---
title: "Remez and function approximations"
date: 2024-05-06T10:18:29-07:00
shortform: true
...
mastodon_url: https://mathstodon.xyz/@j2kun/112426452236288901
twitter_url: https://x.com/jeremyjkun/status/1798788845896339460
---
And then in my themes/paperesque/layouts/partials/single-article.html
template I have the following code to display the links:
{{ if .Params.shortform }}
<p>This article is syndicated on:</p>
<ul>
{{ if .Params.mastodon_url }}
<li><a href="{{ .Params.mastodon_url }}" rel="syndication">Mastodon</a></li>
{{ end }}
{{ if .Params.bluesky_url }}
<li><a href="{{ .Params.bluesky_url }}" rel="syndication">Bluesky</a></li>
{{ end }}
{{ if .Params.twitter_url }}
<li><a href="{{ .Params.twitter_url }}" rel="syndication">Twitter</a></li>
{{ end }}
</ul>
{{ end }}
Warning for a too-long first paragraph
The first post for a shortform post becomes the first post in a social media thread. I want to make sure that first post is not awkwardly cut into pieces, so one thing I do is have a rendering error baked into Hugo that says, “shortform articles can’t have a first paragraph that’s too long.”
This gets put into single_article.html
before {{ .Content }}
:
{{ if .Params.shortform }}
{{ $maxLength := 299 }}
{{ $collapsed := (strings.Replace .Content "\n" " ")}}
{{ $firstParagraph := index (index (strings.FindRESubmatch `<p.*?>(.*?)</p>` $collapsed 1) 0) 1}}
{{ if gt (len $firstParagraph) $maxLength }}
{{ errorf "The length of the first paragraph is %d, exceeds %d characters.\n\nFirst paragraph is:\n\n%s\n\n" (len $firstParagraph) $maxLength $firstParagraph }}
{{ end }}
{{ end }}
Triggering this workflow automatically after deployment
Netlify supports webhooks after a build is completed,
and you can cnofigure this by adding a netlify/functions/deploy-succeeded.mjs
to the repo. This file contains some javascript to fire a POST request
with some secrets that you have to give to Netlify.
import fetch from "node-fetch";
const syndicate_url = 'https://api.github.com/repos/j2kun/math-intersect-programming/actions/workflows/syndicate.yml/dispatches';
export default async (req, context) => {
const apiKey = Netlify.env.get("GITHUB_TRIGGER_ACTION_PAT");
if (apiKey == null) {
return new Response("Need env GITHUB_TRIGGER_ACTION_PAT", { status: 401 });
}
const response = await fetch(syndicate_url, {
method: 'POST',
headers: {
'Accept': 'application/vnd.github+json',
'Authorization': 'Bearer ' + apiKey,
'X-GitHub-Api-Version': '2022-11-28',
'Content-Type': 'application/x-www-form-urlencoded'
},
body: '{"ref":"main"}'
});
if (syndicate_response.ok) {
return new Response("Successfully triggered action", { status: 200 });
} else {
return new Response("Failed to trigger action", { status: 500 });
}
};
So with this, pushing a new article to main kicks off the Netlify build, which publishes the new post, then calls the syndication script, which syndicates and further adds new content (the syndication links) and when that is committed netlify deploys again.
Blogroll (the “What I’m Reading” page)
The next most complicated thing I implemented is a twist on the idea of a blogroll. I don’t like blogrolls because they typically get stale. Most blogrolls point to dead blogs. Instead, I wanted to make a blogroll that points to specific articles I enjoyed, with some sense of recency so that people who visit my homepage get a sense of where my head is at these days. I also wanted to have an easy way to add to the list without having to fire up a text editor.
So here’s how I did that, which culminated in my “What I’m Reading” page and homepage sidebar which shows the latest three entries I added to the blogroll.
Internally, it’s driven by a flat file called blogroll.txt
That contains entries with a URL and title separated by newlines:
https://vickiboykis.com/2024/09/19/dead-internet-souls/
Dead Internet Souls
https://blog.nelhage.com/post/fuzzy-dedup/
Finding near-duplicates with Jaccard similarity and MinHash
https://mathenchant.wordpress.com/2023/09/18/when-five-isnt-prime/
When Five Isn’t Prime
...
Then there is a script add_url_to_blogroll.py
that takes a URL as a CLI input,
updates the blogroll.txt
file,
and then updates the two template pages that have the blogroll data in them.
The hard part here is actually determining from the URL what the page title is.
The <title>
tag often has extra stuff in it (like the title of the blog).
The <meta property="og:title">
tag is often more reliable, but not always present.
Otherwise I will typically dig through <h1>
tags and use hints about the css class names
or parent tags
to figure out if it’s the actual title, but it breaks in stupid ways.
For one, IACR preprint pages
will detect you’re not using javascript, and then
insert an <h1>
that says “What a lovely hat”.
Thankfully IACR also uses og:title
, but still, don’t use h1
for this.
Once the database is updated, then the script
updates content/blogroll/_index.md
with a simple blogroll
shortcode
like Data scientists work alone and that's bad (www.ethanrosenthal.com)
that renders as
# layouts/shortcodes/blogroll.html
<a class="u-like-of" href="{{ .Get "url" }}" target="_blank">{{ .Get "title" }}</a> ({{ .Get "domain" }})
And more manual HTML in layouts/partials/sidebar.html
so that it looks like
<div class="sidebarblogroll">
<a href="/blogroll" class="sidebar-large">What I'm Reading</a>
<ul>
<li><a href="https://www.ethanrosenthal.com/2023/01/10/data-scientists-alone/">Data scientists work alone and that's bad</a> (www.ethanrosenthal.com)</li>
<li><a href="https://billwear.github.io/art-of-attention.html">the quiet art of attention</a> (billwear.github.io)</li>
<li><a href="https://blog.jreyesr.com/posts/typst/">Exploring Typst, a new typesetting system similar to LaTeX</a> (blog.jreyesr.com)</li>
</ul>
</div>
This is all done with very manual string processing kludgery as you can see for yourself in the linked GitHub gist.
Chrome extension
To make adding articles to my blogroll easy, I made a Chrome Extension that, when clicking the extension icon, triggers the GH Action workflow to add a URL, using the URL of the current page as the parameter.
This works for any workflow, and I have it loaded into my browser
and configured to trigger my add_url_to_blogroll
workflow and script.
My workflow looks like this—just like all the others,
run the script and commit—the only difference is the workflow_dispatch
inputs
and how that’s referenced as a CLI arg.
name: Add URL to blogroll
permissions:
contents: write
on:
workflow_dispatch:
inputs:
url:
description: 'URL to add to blogroll'
required: true
jobs:
add_url:
runs-on: ubuntu-latest
steps:
- name: Echo URL
run: echo "${{ github.event.inputs.url }}"
- uses: actions/checkout@v4
- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: 3.11
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
# Run script
- name: Run add_url_to_blogroll.py
run: |
python -m scripts.add_url_to_blogroll --url="${{ github.event.inputs.url }}"
- name: Commit changes to blogroll pages
run: |
git config --local user.name ${{ github.actor }}
git config --local user.email "${{ github.actor }}@users.noreply.github.com"
test -z "$(git status --porcelain)" || git add scripts/blogroll.txt content/blogroll/_index.md layouts/partials/sidebar.html
test -z "$(git status --porcelain)" || git commit -m "blogroll: add ${{ github.event.inputs.url }}"
- name: Push changes
uses: ad-m/github-push-action@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
branch: ${{ github.ref }}
Webmentions and referencing external discussion threads
Next we have webmentions. I start by adding https://webmention.io support
via the layouts/partials/webmentions.html
containing
<link rel="webmention" href="https://webmention.io/www.jeremykun.com/webmention" />
And added {{ partial "webmentions" . }}
to the header section of
themes/paperesque/layouts/_default/baseof.html
.
This allows people who want to post webmentions to find the right endpoint.
Finally, I added a webmention.js
to static/
which contains a minified version of my fork of the webmention.js
project.
This queries webmention.io
for my webmentions at page load time.
I forked it basically because I didn’t like how the webmentions were displayed by default,
so I hacked it up to make everything look like the “comment” style,
and put special cases
for Hacker News and Reddit.
Run the minifier from the README, and copy it to my static directory, and it was easy to tweak it to my liking.
Bridgy
Bridgy is a nice service that lets you find folks linking to your site on places like Reddit and Twitter, and it sends you webmentions.
It’s easy to set up but it has one fatal flaw: it detects each and every post in my syndicated threads as webmentions. The dev who maintains bridgy basically said, “yeah that sucks”. But now, a few months later, when I test it the behavior seems slightly different. I don’t get any webmentions from anything I post myself, but when other people favorite my posts, I get webmentions for those.
So maybe I just need some more client-side filtering or something else to get that part right. I’ll keep working on it.
At least, what’s nice here is that I don’t need to add anything to my blog, since the webmentions are handled externally.
Hacker News backlinks
My blog gets on HackerNews regularly, but Bridgy doesn’t support it.
So I added a simple script that doesn’t touch my blog itself, but queries the HackerNews API for links to my blog, and then sends webmentions to my blog for each one.
The script
is relatively simple,
and runs on a schedule via GitHub Actions.
I ran it once with a huge since_days
value
to get the entire HN history queried once,
and now it just checks the last week of stories when it runs.
This uses the very nice indieweb_utils
package
in a util file to send the webmentions.
Outgoing webmentions
To participate in the IndieWeb ecosystem, I also want to send outgoing webmentions whenever I link to someone else’s blog in an article.
I have another script, outgoing.py which handles outgoing webmentions. In the same workflow as the one that looks for HackerNews articles, I scan my own blog for posts that haven’t been processed, with special handling for the blogroll, parse it looking for links, and send webmentions to those pages.
This uses the very nice indieweb_utils
package in a util
file to send
the webmentions.
That said, I find very few webmentions are actually sent. I may be doing something wrong, or maybe I’m just not linking to enough IndieWeb people.
DOIs
Rogue Scholar is a serviec that provides DOIs for articles on scientific blogs. It automates the whole process of getting DOIs and provides an API for querying them.
So I have a script, run again in a workflow on a schedule, that queries Rogue Scholar for DOIs for my blog posts, and, similarly to the syndication script, adds them to the frontmatter of posts, and renders them in the template if present.
{{ if .Params.doi }}
<p>DOI: <a href="{{ .Params.doi }}" rel="doi">{{ .Params.doi }}</a></p>
{{ end }}
Dead link checker
I have yet another workflow that runs the Lychee dead link checker on my blog once a week.
Here’s the workflow:
name: Check for dead links
on:
repository_dispatch:
workflow_dispatch:
schedule:
- cron: "0 6 7,21 * *"
jobs:
linkChecker:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Restore lychee cache
uses: actions/cache@v4
with:
path: .lycheecache
key: cache-lychee-${{ github.sha }}
restore-keys: cache-lychee-
# check for existing issue
- name: Find Link Checker Issue
uses: micalevisk/last-issue-action@v2
with:
state: open
labels: |
link-checker
- name: Setup Hugo
uses: peaceiris/actions-hugo@v3
with:
hugo-version: '0.119.0'
extended: true
- name: Build
run: hugo
- name: Link Checker
id: lychee
uses: lycheeverse/lychee-action@v1
with:
args: --accept 200,429 --verbose --max-concurrency 1 --cache --max-cache-age 7d --exclude 'linkedin.com' --exclude 'fonts.googleapis.com' --exclude 'pnas.org' --exclude 'tandfonline.com' --exclude 'ogp.me' --exclude 'fonts.gstatic.com' --exclude 'dl.acm.org/doi' --exclude 'sciencemag.org' --exclude 'web.archive.org' --exclude 'doi.org' --exclude 'gmplib.org' --exclude 'github.com/j2kun/mlir-tutorial' -r 5 -t 50 --archive wayback --suggest --base https://www.jeremykun.com public
- name: Update Issue
uses: peter-evans/create-issue-from-file@v5
if: env.lychee_exit_code != 0
with:
title: Broken links detected in docs 🔗
issue-number: "${{ steps.link-checker-issue.outputs.issue_number }}"
content-filepath: ./lychee/out.md
token: ${{secrets.GITHUB_TOKEN}}
labels: |
link-checker
Note: there are so many --exclude
flags because the tool just fails to handle a bunch of domains
in ways I can’t quite understand. So stuff that seems like it has longevity
(like journal websites and stuff that were constantly giving false positives),
I just ignore the failures.
The last step creates an issue, so I get an email when stuff breaks.
Want to respond? Send me an email, post a webmention, or find me elsewhere on the internet.