Jan 27, 2016
Table of contents:
As you may of already guessed, I’m currently in the process of converting Culttt from a WordPress blog into a Ruby on Rails web application.
I love that I’ve got this far with WordPress, but sometimes you just need to scratch your own itch. Fortunately this itch is easy to scratch for me because I’m a developer and I enjoy doing this stuff anyway.
One of the decisions I’ve made is I want to write all my articles in Markdown, and then let the application generate the HTML for me.
But the problem I face is that I’ve already got nearly 700 posts in WordPress flavoured HTML.
I need to convert my existing WordPress HTML into normal HTML and to Markdown, and I definitely do not want to do that by hand!
Fortunately as we are programmers, we can let the computer do the hard work for us. I didn’t get into this game to give myself boring and repetitive work.
In today’s tutorial I will walk you through how I implemented this conversion process.
So before I get into the actual implementation, first I will describe the problem that I faced.
I’ve currently nearly 700 articles that I need to migrate to the next version of Culttt. These existing posts are stored in a database as WordPress flavoured HTML.
When I save “WordPress HTML”, I mean the HTML doesn’t have p
tags and I’m using WordPress tags for code blocks.
So I need to convert all of my existing articles into regular HTML to remove the WordPress flavouring.
Going forward I also want to start writing my articles in Markdown. This means I will also need to generate a Markdown version for each of the existing articles incase I need to go back and make a change.
So I’m going to need a way to convert my WordPress HTML into Markdown, and a more general purpose way of converting Markdown to HTML.
When I need a service such as converting one format to another, I usually stick it in the lib
directory.
My general rule is, if it is related to the domain of the application, it should go in app
, but if it is a general purpose tool, it should go in lib
.
By default the lib
directory won’t be autoloaded, so we can add that path in the application.rb
file under the config
directory:
config.autoload_paths << Rails.root.join('lib')
Next, under the lib
directory I’m going to create a render
directory to group this code together under a namespace.
Over the years I’ve been pretty consistent with my WordPress HTML authoring and so for me this job isn’t too difficult.
In order to have a consistent output, I’m going to first convert my WordPress HTML into Markdown, and then generate each article from the Markdown, rather than converting it to straight up HTML, and then generating the Markdown.
I found it was actually easier to go from WordPress HTML to Markdown, and then to normal HTML.
Instead of reinventing the wheel, I’m going to be using the html2markdown gem.
Add the following line to your Gemfile
:
gem 'html2markdown'
And run the following command in Terminal:
bundle install
Next we can create the class for generating the Markdown. I’m going to be wrapping the html2markdown
gem in some customisations, so it makes sense to encapsulate this in a class:
module Render
class Markdown
def render(content)
page = HTMLPage.new(contents: content)
page.h1 { |node, contents| "# #{contents}" }
page.h2 { |node, contents| "## #{contents}\n" }
page.h3 { |node, contents| "### #{contents}\n" }
page.code { |node, contents| "`#{contents}`" }
page.markdown
end
end
end
First I create a new instance of HTMLPage
and pass it the contents
I want to convert.
Next I’m defining the customisations I want. Depending on the HTML you are converting or the Markdown format you want your customisations may differ.
Finally I will return the converted Markdown.
There isn’t a whole lot of value of writing tests for a class like this to be honest. The responsibility for actually converting is not in our hands and so writing tests to make sure it works is going to be a waste of time.
If you want to test the wrapping class is working correctly, you could write something like this:
require 'test_helper'
class MarkdownTest < ActiveSupport::TestCase
def setup
@markdown = Render::Markdown.new
end
test 'should render markdown' do
assert_equal(
'# Hello World',
@markdown.render('<h1>Hello World</h1>').chomp
)
end
end
But like I say, there isn’t a great deal of value here. Either the output looks right or it doesn’t.
Now that I’ve got all of my posts in consistent Markdown, I can now convert them into the final HTML format that will be rendered when an article loads.
Once again, instead of writing my own Markdown parser, I’m just going to use an off-the-shelf solution. I will be using the redcarpet gem.
Add the following to your Gemfile
:
gem 'redcarpet'
And run the following command:
bundle install
I’m going to be converting Markdown into HTML whenever I write an article, or whenever someone leaves a comment. To ensure the HTML that is generated is consistent, I can encapsulate this process as a class:
module Render
class HTML
end
end
I use code blocks quite a lot on Culttt and so I want a nice way of styling these chunks of code with syntax highlighting.
I’m going to use Pygments and so I will need the pygments gem.
Add the following line to your Gemfile
:
gem 'pygments'
And run the following command in Terminal:
bundle install
Next I can create my own HTML Renderer by extending the Redcarpet HTML renderer and defining the block_code
method:
module Render
class HTMLWithPygments < Redcarpet::Render::HTML
def block_code(code, language)
if language
Pygments.highlight(code, lexer: language)
else
"<pre>#{code}</pre>"
end
end
end
end
Next I can finish off my HTML
class:
module Render
class HTML
MARKDOWN_OPTIONS = {
no_intra_emphasis: true,
tables: true,
fenced_code_blocks: true,
auto_link: true,
strikethrough: true,
space_after_headers: true,
superscript: true,
with_toc_data: true,
underline: true,
highlight: true
}.freeze
def initialize
@renderer =
Redcarpet::Markdown.new(Render::HTMLWithPygments, MARKDOWN_OPTIONS)
end
def render(content)
@renderer.render(content)
end
end
end
First I define some options for how I want the HTML to be rendered.
In the initialize
method I create a new instance of Redcarpet::Markdown.new
and pass it my HTMLWithPygments
class and the MARKDOWN_OPTIONS
hash.
Finally I can define the render
method which simply delegates to the Redcarpet renderer.
As I mentioned earlier, you could write a test to make sure your wrapper is working correctly:
require 'test_helper'
class HTMLTest < ActiveSupport::TestCase
def setup
@markdown = Render::HTML.new
end
test 'should render html' do
assert_equal(
'<h1>Hello World</h1>',
@markdown.render('# Hello World').chomp
)
end
end
But there isn’t much point in going nuts and testing that the conversion process is working correctly as that is not our responsibility.
I’m sure every programmer in their career will be tasked with converting one format of something into another format.
HTML is a particular awkward format to convert because you can get away with murder when writing HTML, and so trying to convert inconsistent HTML can be a nightmare.
Fortunately in my case, this process wasn’t too bad because I’ve been pretty strict with how I write my WordPress posts.
Once of the beautiful things about what do is that we can avoid the long and tedious job of manually converting hundreds of articles by writing a simple script to do the job for us.