Questions about a site with a large (ish) number of pages [LONG]

Forgive me for a longer post. I want to give a bit of the background before launching into the technical part(s). Skip to TL;DR if that helps…

I have been using Jekyll since about 2012 for my personal sites and proof-of-concepts for work. It’s always been good to use. I’ve built a theme from scratch and I’ve used some that I’ve pulled from various repos. I have been using Minimal Mistakes for at least 5 years and it’s an easy go-to for getting something looking good quickly. So, hats off to Michael Rose and all the contributors to the theme. I’ve also gotten good ideas and feedback here from Michael Currin and others, so thank you all.

Current Project

I’m working with our historical society, a non-profit. We had a local weekly newspaper from 1901 until about 1997. We have blessing from the publisher to do with the content from those issues as we wish.

Our state historical society gave our society president a 16GB archive of PDFs that contained scanned images of the newspapers from 1903 to 1947. He’s been giving copies of these on a flash drive to people. When I heard this, my mind immediately went “this should be online”. I’ll skip the parts where the archives consisted of 13 PDFs containing between 900 and 1400 pages each with no metadata. Three nights of flipping through each PDF and editing a spreadsheet while watching baseball…

Fast forward…

I now have a folder hierarchy of files that contain…

  1. A PDF consisting of each dated issue in YYYY-MM-DD.pdf format
  2. Each individual page as a JPEG in YYYY-MM-DD-XX.jpg format
  3. A 320px square thumbnail, from the top of each page, similarly named

With this, I built a Python program to generate pages into a jekyll project structure using Minimal Mistakes and the wonderful remote_theme/local_theme gems. The pages have nice navigation between the issues and pages within. There’s a sidebar for easy browsing and I make use of tags and categories to build those indexes.

It looks nice on my tablet as well as a desktop. It took me about a week of evenings from the point I got the flash drive to having a working prototype that I can show at the next historical society meeting. After that, we’re going back to the state historical society to see about getting the remainder of the pages scanned, which will require funding. That may take months or even years.

I did not create custom layouts or add plugins to build indexes. I don’t know Ruby well enough to go messing with plugins. I handled a lot of the heavy work that a plugin would do in my Python script to handle the index building. I know the jekyll processor will do a lot of that, but I didn’t want to go down the rabbit hole too far. As it was, I was able to start with a subset of the issues from the first year. Once things looked nice, I added a few more years. Finally, I have the entire set of issues with a post for the issue and individual post pages for each page.

In all, there are 15,481 individual .md files in _posts, arranged in a folder hierarchy by year. It takes about 8 minutes to build the full site on my workstation after doing a jekyll clean. But, it looks nice. Certainly beats passing around a flash drive.

Search

While showing this to my wife and others, I knew that the real user scenario for this would be to have a usable search index. Sure, people will look at the issues and browse, but the #1 use case to me is someone looking for information about a relative. I need a good search index and to eventually get search engines to pick up the content.

Each post contains the Jekyll scaffolding and navigation, but very little actual text. I knew for search that I would need the text. While working on the Jekyll project, I devoted my spare workstation at home to running OCR on all those pages, which took several days. I ended up with 1 text file for each page. I fed the OCR’ed text into some bash scripts to generate files containing a sorted unique word list for each file.

I looked at my options for search for Jekyll projects. I want this in Jekyll for all the reasons we love it. Basically, I don’t want to administer a WordPress or other CMS. Other than hosting, I want to build it and mostly forget it. Minimal Mistakes provides support for lunr.js out of the box. I realized from the start that it wasn’t up to snuff for this. I opted to try the Algolia community version and the algolia-plugin.

With each of those word lists, I ran them against several dictionaries containing given names, surnames, place names in our state, roads, businesses, etc. For each individual page, I now have a file containing the likely words that people will want to search for. The largest of these is about 7kb.

I set up an algolia project for 6 months of the paper with the search words for each page in a collapsible div at the bottom of each page, with the text very small. When I ran bundle exec jekyll algolia, it created the database and I can search when I’m running on localhost. The Algolia dashboard shows that I have 62,000 individual searchable terms for those 6 months of pages.

TL;DR

OK, question(s) time…

  1. Am I crazy? Yes, next question.

  2. Are there practical limits on the Algolia community edition? They have length limits on the size of the individual records, so I do make sure I structure my data to stay under that. But, if I have 250,000 or more searchable terms, will that put this into another category? I’ve used Algolia at work before, but the bills went to Finance so I have no idea what the costs are. Given that this is for a non-profit, there won’t be any funding for subscriptions.

  3. I know the algolia plugin is basically abandonware at this point, but will it scale? It seems to work, but I don’t know if it will choke under the strain when I throw the full site at it. Especially if we get the full number of issues up to 1997 at some point. That will effectively triple the number of pages, as the number of pages in each weekly issue grew over time from 4 to 6 to 8 to 10.

  4. How does one configure the algolia plugin using _config.yml or otherwise. I’m just letting it run in it’s default manner, but I wonder if I need to tune it. It’s hinted at in some of their developer blog posts, but it really seems like they don’t want to the support the community now. They pulled their Discourse in favor of Discord, which doesn’t seem to have any traffic.

Two more, less about Jekyll than hosting.

  1. All told, the built site comes to about 40GB. Does anyone have any best practices on how to constrain AWS costs on a large static site? I haven’t played with CloudFront regional restrictions on my personal sites but they have way less content. For my personal sites, I set a budget and I’ve never hit $50/month. I’ll keep an eye on my hosting bill with AWS for this, but my effort and hosting will be my donation to the cause. If it gets where I’m spending $100s/month on bandwidth, I’ll have to add auth to the site or else pull the plug.

  2. I’m hoping someone has a search term that I can use to find the needle in AWS’ tech docs haystack. Since I copy all the image files into the jekyll framework using an external Python script, I’d like to keep those files out of the checked-in repo so that the Github CI/CD doesn’t take hours to build and deploy to S3. Plus, they’re static at this point and not likely to change. Is it possible to have a second S3 bucket that acts as a virtual part of the filesystem when deployed? I seem to recall a work project that did this, but I haven’t found the right combination of search terms that gets me from Google’s shit AI results into AWS’ convoluted documentation.

Thanks for reading and giving your ideas.

Eric

All told, the built site comes to about 40GB. Does anyone have any best practices on how to constrain AWS costs on a large static site?

Cache, cache, cache. Especially with your PDFs and Images - they should be set to cache for years, as you don’t expect to change them.

Is it possible to have a second S3 bucket that acts as a virtual part of the filesystem when deployed?

Yes, but there might be faster and better alternatives. I’d look at building a YAML/CSV/SQLite file from the second S3 bucket, and using that in your site as a data-source. You can then run both the scripts at a different cadence - keep uploading to the assets bucket as the scans happen, and your website builds can happen alongside (ignoring the new assets), till you update your asset sheet, at which point you just need to copy the markdown for the extra assets(?).

For search, I don’t have experience with Algolia but if it breaks down, you can look at lunr.js or just embedding a google search page against your site.

By this, do you mean an AWS setting in the S3 bucket? I don’t see cache control options in the bucket configuration. Or, I could be dense.

You are correct that once these files go into the bucket, they probably won’t change unless I do something to rearrange the file locations.

On reflection, I think I’ll put a variable for the image file prefix in the _config.yml for the dev vs. production builds. I keep separate .yml for production (cloud) and dev (localhost). The second bucket will hold all the images.

I just ran the production build on my local system and I was slightly incorrect on the size. The full production build comes to 32.2GB, with the build HTML and other content being less than 200MB.

Thanks!

After some reflection on my data architecture, I decided to not build a post for each individual page. I am only generating a post for each issue now, which reduces the number of posts from 15K down to about 2,200. It makes processing much less onerous now. It still takes a couple minutes to build the full site, but the categories and tags pages are much more manageable.

I have the image content serving from S3 for a subset of the data. It seems to be working correctly, but I’m going to have to put some budgetary constraints on the bandwidth for those buckets. I’ve also decided to pare down the size of some of the image assets to have a fixed size, again to get the size of the bucket down to save on bandwidth.

If you are serving objects straight off S3 (with a s3.amazonaws.com domain) then you need to set object-level metadata for each of your files. You’ll need to pair this with a CDN like CloudFront for any benefit, otherwise the only entity maintaining a cache is gonna be the browser, which is not what you want.

Once you put a CDN (like CloudFront) ahead of a configured-bucket, it will pick up the cache headers and start caching content, reducing your S3 costs (and switching you to CloudFront data-egress costs, which should be much cheaper). You can also limit your CloudFront distribution to a few regions (See Section 2) avoiding the higher priced regions if your userbase aligns with it.