Changelog: Updated Plain Text Format and Added Additional Markdown Support

Last week I deployed a change to how I generate plain text versions of content on my website. This week I changed it again. And updated additional post types to use Markdown as their editing and storage format.

I deployed some changes to my website. I updated the plain text rendering if my content and moved additional post types to using Markdown as their editing and storage format.

Updated Plain Text Format

Last week I deployed a change to how I generated plain text content by using the microformats2 spec for parsing embedded content into a value property.

That quickly proved it’s limitations in how meaningful white space was being lost from the markdown -> HTML -> plain text conversion. The plainest offender was lists. Markdown like

1. Item one
2. Item two

got rendered as

Item one Item two

After renewed internet floundering and continuing disbelief that there’s not a package for rendering markdown or HTML to an “aesthetic” plain text format, I’m trying a new approach.

Content is stored as Markdown, which may also have HTML elements. So first the content is converted to HTML. This gets everything into the same format. I’m using mistune for this.

html = mistune.html(input)

Then the HTML is processed with Beautiful Soup to apply the microformats2 parsing steps of dropping script, style, and template elements and replacing img elements with their alt or src attribute values.

soup = BeautifulSoup(html, "html.parser")
for el in soup.find_all(["script", "style", "template"]):
    el.decompose()

for el in soup.find_all("img"):
    alt = el.get("alt")
    if alt is not None:
        el.replace_with(" " + alt + " ")
        continue
    
    src = el.get("src")
    if src is not None:
        el.replace_with(" " + src + " ")
        continue
    
    el.decompose()

Then I convert a elements to strings. I think Markdown link syntax is too “markup language”-y. I’m opting for the href in parenthesis after the element text.

for el in soup.find_all("a"):
    href = el.get("href")
    text = el.get_text()

    if href is not None and href != text:
        el.append(" (" + href + ")")

    el.unwrap()

And finally I change any header tags to paragraph tags. I’m fine if they end up similar to paragraphs in the output.

for el in soup.find_all(["h1","h2","h3","h4","h5","h6"]):
    el.name = "p"

This massaged HTML then gets converted back to Markdown with html2text. Crucially, this strips in HTML elements that are not supported in Markdown and takes care of a white space, list numbering, and other details that make the output look good. I’m using several options to get the look I want.

html = str(soup)

parser = html2text.HTML2Text()
parser.strong_mark = "*"
parser.unicode_snob = True
parser.body_width = 0
parser.wrap_links = True
parser.images_to_alt = True
parser.ul_item_mark = "-"

return parser.handle(html)

This does mean that the output is no longer correct Markdown, but that’s exactly what I want. I want good looking plain text.

Additional post types updated to Markdown storage

I also got Notes, Photos, and Posts moved over to using Markdown as the storage format. That leaves Reposts as the only type yet to move over, but I hardly use that one. I’m going to give what I have some time to see how it does.