In February, I moved my Bookmark/Link content to using Markdown (Commonmark) as the definitively stored content format. Today, I updated that implementation.
My Bookmarks have four “rich” fields: Title, Quote, and Commentary. Title should only allow inline elements, while Quote and Commentary can contain any HTML. For all of these fields, I wanted to store Markdown and be able to retrieve HTML and plain text versions.
The first attempt
I wanted my plain text version to be somewhat pleasant to look at, kind of like raw markdown but with some modifications. I never found a presentation format that satisfied me, so I went with pandoc/pypandoc and a custom filter.
This turned out to be noticeably slow when parsing on the fly for multiple pieces of content on a single page, so I begrudgingly opted to parse the markdown when it’s saved and store each output format in the database.
Unfortunately, my custom filter still didn’t satisfy me. One of the things I wanted was to convert anchor elements from <a href="someurl">link text</a>
to link text (someurl)
. I also wanted cite
elements rendered wrapped with underscore characters, but sometimes this resulted in output like Started _East of Eden (https://en.wikipedia.org/wiki/East_of_Eden_(novel))_
instead of the preferred Started _East of Eden_ (https://en.wikipedia.org/wiki/East_of_Eden_(novel))
.
Over the months, I got tired of these little errors to the point where it was affecting what I was writing.
My current attempt
I decided to largely give up on being particular about this output. Microformats2 has an algorithm for rendering embedded content to a value
property:
value
: thetextContent
of the element after:
- dropping any nested
<script>
&<style>
elements;- replacing any nested
<img>
elements with theiralt
attribute, if present; otherwise theirsrc
attribute, if present, adding a space at the beginning and end, resolving the URL if it’s relative;- removing all leading/trailing spaces
This is nowhere close to my original desire, but it’s at least predictable. I’m already using mf2py. Not part of it’s documented API, but dom_helpers
has a get_textContent
method that does conversion.
Since I was no longer using pandoc for the plain text, I replaced it with mistune for the markdown-to-HTML conversion, which I was also already using elsewhere.
These changes sped up the just-in-time parsing to the point where I don’t need to pre-compile the HTML and plain-text parsing anymore. I never liked having content that had the potential to get out of sync.
While I was there: updated plain-text-count
custom web component
While I was working on that, I also added some features to the plain-text-count
component I added in January. I didn’t like having to save and reload other pages to see the rendered output from my markup content.
The endpoint I had configured for the character count was already returning the rendered output content, so I updated the component to include the rendered output alongside character counts for each one.
I plan to give this iteration some experience, but my end goal is to move all of the post types to this kind of markdown authored/stored format with HTML and plain text output formats.