Htmlproofer fails on encoding error?

Hi,

I typically use this Makefile target in my Jekyll sites:

checklinks:
        htmlproofer --check-html --empty-alt-ignore ./_site

This helps verify that I did not include any broken links, etc.

It works fine on my Linux machines (Debian 11 and Debian 12).

However, I recently set up a dev env on a MacOS machine. This one comes with htmlproofer v4.x whereas the Debian uses htmlproofer v3.x.

It seems the command-line usage for htmlproofer changed significantly from v3 to v4.

I now try to run it with these options:

htmlproofer -t --extensions .html ./_site

This gives me the below errors which does not make much sense to me. It seems to complain about some non-ASCII character. It does not tell me what file contains the problematic byte. Also, my HTML files are UTF8. I have never had problems with non-ASCII before…

Am I using the new htmlproofer wrong?


Running 3 checks (Images, Scripts, Links) in ["./_site"] on *.html files...


/Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/nokogiri-1.15.5-arm64-darwin/lib/nokogiri/html5.rb:386:in `encode': "\\xE2" on US-ASCII (Encoding::InvalidByteSequenceError)
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/nokogiri-1.15.5-arm64-darwin/lib/nokogiri/html5.rb:386:in `reencode'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/nokogiri-1.15.5-arm64-darwin/lib/nokogiri/html5.rb:279:in `read_and_encode'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/nokogiri-1.15.5-arm64-darwin/lib/nokogiri/html5/document.rb:119:in `do_parse'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/nokogiri-1.15.5-arm64-darwin/lib/nokogiri/html5/document.rb:95:in `parse'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/nokogiri-1.15.5-arm64-darwin/lib/nokogiri/html5.rb:31:in `HTML5'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/html-proofer-4.4.3/lib/html_proofer/utils.rb:22:in `create_nokogiri'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/html-proofer-4.4.3/lib/html_proofer/runner.rb:110:in `load_file'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/html-proofer-4.4.3/lib/html_proofer/runner.rb:101:in `block in process_files'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/parallel-1.23.0/lib/parallel.rb:627:in `call_with_index'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/parallel-1.23.0/lib/parallel.rb:597:in `process_incoming_jobs'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/parallel-1.23.0/lib/parallel.rb:577:in `block in worker'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/parallel-1.23.0/lib/parallel.rb:568:in `fork'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/parallel-1.23.0/lib/parallel.rb:568:in `worker'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/parallel-1.23.0/lib/parallel.rb:559:in `block in create_workers'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/parallel-1.23.0/lib/parallel.rb:558:in `each'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/parallel-1.23.0/lib/parallel.rb:558:in `each_with_index'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/parallel-1.23.0/lib/parallel.rb:558:in `create_workers'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/parallel-1.23.0/lib/parallel.rb:497:in `work_in_processes'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/parallel-1.23.0/lib/parallel.rb:291:in `map'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/html-proofer-4.4.3/lib/html_proofer/runner.rb:101:in `process_files'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/html-proofer-4.4.3/lib/html_proofer/runner.rb:75:in `check_files'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/html-proofer-4.4.3/lib/html_proofer/runner.rb:46:in `run'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/html-proofer-4.4.3/bin/htmlproofer:97:in `block (2 levels) in <top (required)>'
        from /Users/user/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `block in execute'
        from /Users/user/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `each'
        from /Users/user/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `execute'
        from /Users/user/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/program.rb:44:in `go'
        from /Users/user/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary.rb:21:in `program'
        from /Users/user/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/html-proofer-4.4.3/bin/htmlproofer:11:in `<top (required)>'
        from /Users/user/.rbenv/versions/3.0.0/bin/htmlproofer:23:in `load'
        from /Users/user/.rbenv/versions/3.0.0/bin/htmlproofer:23:in `<main>'

It seems related to the environment switch, maybe this post can be useful: jenkins - How to fix Ruby script which fails with encoding error: "\xD8" on US-ASCII? - Stack Overflow

I tried setting:

export LC_ALL=en_US.UTF-8

This works. Thanks.

However, I don’t think my locale is actually en_US.UTF8. I think it is en_DK.UTF8.

But if I run:

export LC_ALL=en_DK.UTF-8

I get the same error:
.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/nokogiri-1.15.5-arm64-darwin/lib/nokogiri/html5.rb:386:in encode’: “\xC3” on US-ASCII (Encoding::InvalidByteSequenceError)`

Is it safe to put this in my .zshrc:
export LC_ALL=en_US.UTF-8

when I actually have LC_ALL=en_DK.UTF-8?

Next difference between htmlproofer 4.x and 3.x I run into is:
The new htmlproofer seems to assume that relative links in my Jekyll site are actually external links. So when it encounters a link like: /foo/bar.html it will try to reach https://mysite.example.com/foo/bar.html instead of just looking for ./_site/foo/bar.html

How do I fix this behaviour?

I am not sure if it might be a problem or not, but the point seems that your Jekyll project was made with that encoding and the Nokogiri gem refused to compile with the actual one.

To have a better idea of what happened check also the variable LANG in both your system to see the difference.

However, if you think that storing it in the .zshrc might lead to troubles, you can just prepend the export to the command:

export LC_ALL=en_US.UTF-8 htmlproofer ...

Also with zsh you should be able to create an alias with that command in your .zshrc to don’t repeat it again and again.

Regarding the internal link check I just downloaded the latest html-proofer gem and ran it on my _site folder, it perfectly understands the relative links to me, and also spotted a really difficult-to-notice error in a relative link having also a permalink, this is the command I used:

htmlproofer --ignore-empty-mailto=true --disable-external=true --checks Links _site/

As you can see I disabled the external links to focus only on the internals. If you try that and don’t have errors anymore it might point to a page that’s in your local folder but not yet uploaded and the link is hardcoded with the real URL address.

An alternative approach is to include html-proofer in your site’s Gemfile and limit it to version 3. I.e.:

gem "html-proofer", "~> 3.0"

This is what I’ve done with many of my sites, because v4 removes features (in particular, HTML proofing!).