Forgive me for a longer post. I want to give a bit of the background before launching into the technical part(s). Skip to TL;DR if that helps…
I have been using Jekyll since about 2012 for my personal sites and proof-of-concepts for work. It’s always been good to use. I’ve built a theme from scratch and I’ve used some that I’ve pulled from various repos. I have been using Minimal Mistakes for at least 5 years and it’s an easy go-to for getting something looking good quickly. So, hats off to Michael Rose and all the contributors to the theme. I’ve also gotten good ideas and feedback here from Michael Currin and others, so thank you all.
Current Project
I’m working with our historical society, a non-profit. We had a local weekly newspaper from 1901 until about 1997. We have blessing from the publisher to do with the content from those issues as we wish.
Our state historical society gave our society president a 16GB archive of PDFs that contained scanned images of the newspapers from 1903 to 1947. He’s been giving copies of these on a flash drive to people. When I heard this, my mind immediately went “this should be online”. I’ll skip the parts where the archives consisted of 13 PDFs containing between 900 and 1400 pages each with no metadata. Three nights of flipping through each PDF and editing a spreadsheet while watching baseball…
Fast forward…
I now have a folder hierarchy of files that contain…
- A PDF consisting of each dated issue in YYYY-MM-DD.pdf format
- Each individual page as a JPEG in YYYY-MM-DD-XX.jpg format
- A 320px square thumbnail, from the top of each page, similarly named
With this, I built a Python program to generate pages into a jekyll project structure using Minimal Mistakes and the wonderful remote_theme/local_theme gems. The pages have nice navigation between the issues and pages within. There’s a sidebar for easy browsing and I make use of tags and categories to build those indexes.
It looks nice on my tablet as well as a desktop. It took me about a week of evenings from the point I got the flash drive to having a working prototype that I can show at the next historical society meeting. After that, we’re going back to the state historical society to see about getting the remainder of the pages scanned, which will require funding. That may take months or even years.
I did not create custom layouts or add plugins to build indexes. I don’t know Ruby well enough to go messing with plugins. I handled a lot of the heavy work that a plugin would do in my Python script to handle the index building. I know the jekyll processor will do a lot of that, but I didn’t want to go down the rabbit hole too far. As it was, I was able to start with a subset of the issues from the first year. Once things looked nice, I added a few more years. Finally, I have the entire set of issues with a post for the issue and individual post pages for each page.
In all, there are 15,481 individual .md files in _posts, arranged in a folder hierarchy by year. It takes about 8 minutes to build the full site on my workstation after doing a jekyll clean. But, it looks nice. Certainly beats passing around a flash drive.
Search
While showing this to my wife and others, I knew that the real user scenario for this would be to have a usable search index. Sure, people will look at the issues and browse, but the #1 use case to me is someone looking for information about a relative. I need a good search index and to eventually get search engines to pick up the content.
Each post contains the Jekyll scaffolding and navigation, but very little actual text. I knew for search that I would need the text. While working on the Jekyll project, I devoted my spare workstation at home to running OCR on all those pages, which took several days. I ended up with 1 text file for each page. I fed the OCR’ed text into some bash scripts to generate files containing a sorted unique word list for each file.
I looked at my options for search for Jekyll projects. I want this in Jekyll for all the reasons we love it. Basically, I don’t want to administer a WordPress or other CMS. Other than hosting, I want to build it and mostly forget it. Minimal Mistakes provides support for lunr.js out of the box. I realized from the start that it wasn’t up to snuff for this. I opted to try the Algolia community version and the algolia-plugin.
With each of those word lists, I ran them against several dictionaries containing given names, surnames, place names in our state, roads, businesses, etc. For each individual page, I now have a file containing the likely words that people will want to search for. The largest of these is about 7kb.
I set up an algolia project for 6 months of the paper with the search words for each page in a collapsible div at the bottom of each page, with the text very small. When I ran bundle exec jekyll algolia, it created the database and I can search when I’m running on localhost. The Algolia dashboard shows that I have 62,000 individual searchable terms for those 6 months of pages.
TL;DR
OK, question(s) time…
-
Am I crazy? Yes, next question.
-
Are there practical limits on the Algolia community edition? They have length limits on the size of the individual records, so I do make sure I structure my data to stay under that. But, if I have 250,000 or more searchable terms, will that put this into another category? I’ve used Algolia at work before, but the bills went to Finance so I have no idea what the costs are. Given that this is for a non-profit, there won’t be any funding for subscriptions.
-
I know the algolia plugin is basically abandonware at this point, but will it scale? It seems to work, but I don’t know if it will choke under the strain when I throw the full site at it. Especially if we get the full number of issues up to 1997 at some point. That will effectively triple the number of pages, as the number of pages in each weekly issue grew over time from 4 to 6 to 8 to 10.
-
How does one configure the algolia plugin using _config.yml or otherwise. I’m just letting it run in it’s default manner, but I wonder if I need to tune it. It’s hinted at in some of their developer blog posts, but it really seems like they don’t want to the support the community now. They pulled their Discourse in favor of Discord, which doesn’t seem to have any traffic.
Two more, less about Jekyll than hosting.
-
All told, the built site comes to about 40GB. Does anyone have any best practices on how to constrain AWS costs on a large static site? I haven’t played with CloudFront regional restrictions on my personal sites but they have way less content. For my personal sites, I set a budget and I’ve never hit $50/month. I’ll keep an eye on my hosting bill with AWS for this, but my effort and hosting will be my donation to the cause. If it gets where I’m spending $100s/month on bandwidth, I’ll have to add auth to the site or else pull the plug.
-
I’m hoping someone has a search term that I can use to find the needle in AWS’ tech docs haystack. Since I copy all the image files into the jekyll framework using an external Python script, I’d like to keep those files out of the checked-in repo so that the Github CI/CD doesn’t take hours to build and deploy to S3. Plus, they’re static at this point and not likely to change. Is it possible to have a second S3 bucket that acts as a virtual part of the filesystem when deployed? I seem to recall a work project that did this, but I haven’t found the right combination of search terms that gets me from Google’s
shitAI results into AWS’ convoluted documentation.
Thanks for reading and giving your ideas.
Eric