HTML All the Things

I’m thinking about converting all of the fields on my website to store as HTML, and then convert to alternative formats as needed for the destination. But I still have some unanswered questions.

When I first set up this site, I had some fields that took HTML and some fields that took plain text. For example, for articles (like this one), the Title and Summary fields were plain text and the Content field was HTML.

The idea was that Title and Summary would be used for syndication to Twitter and Mastodon, so plain text was a better fit. But for use on my site, I was missing the ability to usefully markup this content. So I converted several fields to accept Markdown, thinking that Markdown makes an ok plain text output and can be converted to HTML for uses where that’s available.

Now I’m feeling the limitation of pure Markdown fields, and I’m thinking everything should be stored as HTML and converted to plain text as needed for the destination. In theory, HTML is valid in Markdown, so I could allow all fields accept Markdown and convert to HTML on save. Then I could still author with the simpler Markdown syntax, use HTML as needed, and saving would convert it all to HTML.

For any markup that is not covered by Markdown’s syntax, you simply use HTML itself. There’s no need to preface it or delimit it to indicate that you’re switching from Markdown to HTML; you just use the tags.

A couple of unresolved questions still stand: what’s a good python library for converting HTML to plain text, and how do I handle character limits for specific fields?

For converting to plain text, it’s not just a matter of stripping tags. I’d like Markdown-ish syntax for things like strong and em tags. But also special formatting for cite tags. And I’d like paragraphs to be separated by double spaces. I’m hopeful that a package exists that handles the conversion and some customizations, but I haven’t found it yet.

For handling character limits, I don’t know. I couldn’t rely on database configuration for this because the field needs to handle the larger HTML format. I’d want to enforce character limits on the plain text version.

I’m still thinking this one through, but I think making this conversion would enable more powerful flexibility on the authoring, presenting, and syndication sides of things.