Replacing a slow include with a custom Ruby Tag

On most pages on my website, I have a footer which recommends similar items in the archive. Since I have very little text about these items, and lots of rich metadata, I decided to implement my own content similarity engine and, since I was using GitHub Pages, I did this entirely in Liquid.

Fast forward a few months, and the site now has over a thousand pieces of content. Since the similarity algo is O(n^2), this blew up and the other day my build times exceeded the GitHub Pages timeout of 10 mins / build.

In case it’s useful for anyone else, here is what I did to fix it:

  1. I switched my site from the default GitHub Pages deploy to a custom GitHub workflow. A few gotchyas there. Look at the linked workflow file for what I eventually landed on. Not only does this overcome the 10 min build timeout, but it also (finally!) let me upgrade Jekyll and use (make) custom plugins:
  2. This overcame the timeout problem, but the increased overhead actually increased my build times. I upgraded to Jekyll 4.2, and changed all my site.foos | where: "slug", bar | first with the more sensible site.foos | find: "slug", bar. This did make my build slightly faster, but I was honestly underwhelmed by the improvement, so then…
  3. I rewrote the offending _include code in Ruby as a custom Tag _plugin. This was relatively straightforward, but came with a few gotchyas:
    1. My original include called another include. In order to render a vanilla include from Ruby-land, I had to copy-paste (!) the IncludeTag’s load_cached_partial function into my Tag (:grimacing:) to be able to load the include I wanted to render. (Perhaps that Jekyll method could be made public / static?)
    2. Sometimes when you fetch a page object from context it comes back as a Document and sometimes it comes back as a DocumentDrop? Eventually, I figured out that the best way to get data off these things consistently and reliably was to call .to_liquid.to_h on any such object first, converting it to a plain old Ruby Hash.
    3. Lastly, this hacking and debugging was quite slow at first, as I had to stop and restart the server each time to reload the Ruby file (unlike Liquid changes which can reload on the fly). Eventually I figured out a hack :see_no_evil: A custom tag can accept arbitrary text after its name. So, I passed that @text into an eval() in my render function, thus allowing me to run arbitrary code from the proper context without having to restart the server. Not sure if there is a nicer way to drop a debugger in, but this worked for me :smiling_imp:

At the end of the day, I managed to speed up this particular O(n^2) path by about 10x and cut my overall build time in about half: a result I’m quite happy with! If you have any thoughts or feedback, please comment below, and thanks for reading! :grinning_face_with_smiling_eyes:

2 Likes

Fast-forward a year and a half and my site has grown large enough that even this nativized O(n²) algorithm is too slow.

Thankfully, k-NN is a known algorithm, and by precomputing some binary space partitions you can narrow down the search space dramatically.

In my case, that meant simply keeping around N sets of all the posts with a given tag, and then merging those sets to find the posts with the highest tag overlap.

My build is now ~7x faster :smiling_imp: Happy Jekylling!