I deployed some changes to my website. I updated the plain text rendering if my content and moved additional post types to using Markdown as their editing and storage format.
Updated Plain Text Format
Last week I deployed a change to how I generated plain text content by using the microformats2 spec for parsing embedded content into a value
property.
That quickly proved it’s limitations in how meaningful white space was being lost from the markdown -> HTML -> plain text conversion. The plainest offender was lists. Markdown like
1. Item one
2. Item two
got rendered as
Item one Item two
After renewed internet floundering and continuing disbelief that there’s not a package for rendering markdown or HTML to an “aesthetic” plain text format, I’m trying a new approach.
Content is stored as Markdown, which may also have HTML elements. So first the content is converted to HTML. This gets everything into the same format. I’m using mistune for this.
html = mistune.html(input)
Then the HTML is processed with Beautiful Soup to apply the microformats2 parsing steps of dropping script
, style
, and template
elements and replacing img
elements with their alt
or src
attribute values.
soup = BeautifulSoup(html, "html.parser")
for el in soup.find_all(["script", "style", "template"]):
el.decompose()
for el in soup.find_all("img"):
alt = el.get("alt")
if alt is not None:
el.replace_with(" " + alt + " ")
continue
src = el.get("src")
if src is not None:
el.replace_with(" " + src + " ")
continue
el.decompose()
Then I convert a
elements to strings. I think Markdown link syntax is too “markup language”-y. I’m opting for the href
in parenthesis after the element text.
for el in soup.find_all("a"):
href = el.get("href")
text = el.get_text()
if href is not None and href != text:
el.append(" (" + href + ")")
el.unwrap()
And finally I change any header tags to paragraph tags. I’m fine if they end up similar to paragraphs in the output.
for el in soup.find_all(["h1","h2","h3","h4","h5","h6"]):
el.name = "p"
This massaged HTML then gets converted back to Markdown with html2text. Crucially, this strips in HTML elements that are not supported in Markdown and takes care of a white space, list numbering, and other details that make the output look good. I’m using several options to get the look I want.
html = str(soup)
parser = html2text.HTML2Text()
parser.strong_mark = "*"
parser.unicode_snob = True
parser.body_width = 0
parser.wrap_links = True
parser.images_to_alt = True
parser.ul_item_mark = "-"
return parser.handle(html)
This does mean that the output is no longer correct Markdown, but that’s exactly what I want. I want good looking plain text.
Additional post types updated to Markdown storage
I also got Notes, Photos, and Posts moved over to using Markdown as the storage format. That leaves Reposts as the only type yet to move over, but I hardly use that one. I’m going to give what I have some time to see how it does.