How to export Substack posts to Jekyll

Hi everyone,

TL;DR: I wrote a post on how to export Substack posts to Jekyll using a Ruby script. I hope you find it useful!

https://www.santiago-martins.com/title-how-to-export-substack-posts-to-jekyll.html

I’ve replicated the post below:


A golden rule of the internet is that you should own your content, and not have it subject to a third party’s arbitrary decisions. Substack is one such third party, however it is a great tool to syndicate content towards. So, how do we get the best of both worlds and get our Substack content into Jekyll?

Substack doesn’t have an API but you can export all of your Substack content fairly easily, see their help article.

However, Substack exports everything in HTML, whereas Jekyll uses markdown. Luckily there is a Ruby gem built to convert HTML to Markdown called reverse_markdown, and it does a pretty good job of it. Using some simple Ruby scripting, we can add our usual font matter, and because Substack provides a CSV file with data on our posts, we can retrieve the date and time that the Substack post was published as well.

The following is a script I created to that effect. To use it with Ruby you’ll need to run gem install reverse markdown
in your terminal/shell and then use ruby your_script_name.rb
in the directory you’ve saved the script in. The way it’s currently written assumes that you’ve saved the script in the folder that the Substack export created.

require 'reverse_markdown' 
require 'csv' 
require 'date' 

def front_matter(date,title) 
return %(--- 
layout: 
post title: #{title} 
description: date: #{DateTime.parse(date).strftime('%Y-%m-%d %H:%M:%S')} +0100 published: true 
categories: 
tags: 
lang: 
--- ) 
end 

Dir.foreach('/your/directory/posts/') do |filename| 
    next if filename == '.' || filename == '..' || File.extname(filename) != '.html' # skip if file does not end in .html 
    CSV.foreach(Dir.pwd + '/posts.csv', headers: true) do |row| 
    @date = row[1] if row[0].to_s == File.basename(filename.chomp, File.extname(filename)) && row[1] end 
    #gets post date if in posts.csv file 
    file = File.open(Dir.pwd + "/posts/#{filename.chomp}").read 
    result = ReverseMarkdown.convert file 
    title = File.basename(filename.chomp, File.extname(filename)).split('.').last
    date = !@date.nil? ? @date : '2022-09-21 16:16:38' 
    # get post date if it has been published, otherwise use a set date and time      
    File.open(Dir.pwd + "/posts/#{DateTime.parse(date).strftime('%Y-%m-%d').to_s + title}.markdown", 'w+') do |f| f.write front_matter(date,title) + result end 
    # Create new markdown file 
end 

I hope that’s useful. Obligatory Substack newsletter plug: https://interessant3.substack.com/

In addition, if you’re looking for jobs in data with an effective social impact check out https://www.gooddatajobs.com

If you’re looking for data analysis work for your organisation, feel free to DM/email me. See details below.

From the post I can understand that you’re not well-versed with Ruby, but as an FYI, Jekyll does not need posts to be in Markdown. An HTML post (with html extension) will work just as equally well.
Markdown is advertised as such because of the significant ease of use in comparison with HTML.
Also, there’s an in-house plugin named jekyll-import. A pull request is welcome if you’re interested in fleshing out your implementation to be in sync with rest of the plugin codebase.

Thanks.

From the post I can understand that you’re not well-versed with Ruby

Would you like to provide some feedback? I’m always open to improving my ruby code.

First of all, I apologise if my comment on your script offended you. The reason I said so is because your script doesn’t use conventional idioms of contemporary Ruby devs. But, Ruby being flexible is convention-agnostic. Secondly, I am happy that you’re open to my feedback and willing to improve your knowledge based on the feedback.
To avoid going back-n-forth with alterations and further feedback (esp. on a platform not suited for code review), I am going to dump all of my observations at once.

  • Front matter is typically YAML. Instead of using a hand-crafted string, one would be better off letting YAML library handle formatting:

    require 'yaml'
    
    # Return a YAML front matter string from given title string, string date and an optional data hash.
    def front_matter(date, title, **additional_data)
      hsh = {
        'layout' => 'post',
        'date' => DateTime.parse(date).strftime('%Y-%m-%d %H:%M%:S %z'),
        'title' => title,
      }.merge(additional_data)
      YAML.dump(hsh) + "---\n\n"
    end
    
  • You’re already testing for files with .html extension, so the checks for dot filenames are redundant.

  • Dir.foreach {} is an iteration block, so everything you do in the block is understandably repeated for every filename encountered. So, when seemingly minor calls for example filename.chomp) are repeated multiple times within the block, the minor function is unnecessarily repeated wasting resources.

  • File.read('path/to/file') is better than File.open('path/to/file').read albeit equivalent.

  • !@date.nil? ? is a double-test. A human reading the code has to spend more energy to comprehend this in comparison to @date.nil? ? 'default' : @date. In Ruby, nil and false are the only falseys. So, can be simplified to:

    date = @date || 'default'