[hakyll] Static site implementation of redirects (for fixing broken URLs)?

Discussion:

Gwern Branwen

2016-12-27 15:55:47 UTC

When a website becomes relatively large and old, it's considered good
practice to set up logging or Google Webmaster Tools and look for
traffic to broken URLs on your site (misspelled URLs, files or pages
which have been moved/renamed etc) and set up redirects to the right
page. This saves time for all those visitors, and it also effectively
increases your traffic because many of those visitors would've given
up & been unable to find the right page and apparently search engines
give a little bonus to 'clean' sites.

(My interest in this is prompted by recently looking at an
institution's Google Analytics & Webmaster Tools and noting that
perhaps 1-5% of their site traffic is to broken URLs, and a decent
fraction of their search engine traffic is being driven by being
unable to find a particular page. I have a fair number of broken links
on gwern.net myself, which I have tolerated mostly because I didn't
know how to fix it inside the static site approach.)

The usual way to do this is to look at the broken URLs, and set up a
mapping from broken to correct in your Apache rewrite rules or the
equivalent in your web server, so any visitor loading a broken URL
gets a 301 redirect. This doesn't require any additional changes to
your website and works well. Along the lines of

Redirect 301 /old-page.html http://www.mydomain.com/new-page.html

But most Hakyll users will be using Amazon S3 or Github or another
file host precisely to avoid things like running your own Apache. How
can we fix broken links?

Amazon S3 has one supported method, but last I checked, it's weird -
it requires a file at the broken URL/filename and it also needs on-S3
metadata to be toggled. So it clutters up your Hakyll directory,
requires additional manual intervention for each URL, and could break
at any time if a sync tool changes the metadata (perhaps by
accidentally resetting it).

The most common non-server-based method seems to be HTML redirection:
writing a mostly-empty HTML file at the broken URL with a special META
tag, like this:

<meta http-equiv="refresh" content="0;
url=http://www.mydomain.com/new-page.html">

Possibly augmented with a 'canonical' link:

<link rel="canonical" href="http://www.mydomain.com/new-page.html">

(There are some other forms of HTML redirection using JS and iframes,
but they are worse. The META tag method doesn't require JS and is
widely used and understood by all search engines.)

This also requires cluttering up your source directory with lots of
repetitive HTML files... unless you generate them.
It would be straightforward to write a function `createRedirects`
which takes a target directory ("_site") and a `Map broken working` of
URLs, and for each pair broken/working, writes to
`target/relativeLink(broken)` a HTML template of

<html><head><meta http-equiv="refresh" content="0;
url=$working"><link rel="canonical" href="$working"></head>
<body>The page has moved to: <a href="$working">this
page</a></body></html>

Then this function could be added somewhere in `main` like
`createRedirects "_site" linkMap` and it'd generate an indefinite
number of redirects. So the user only needs to set up the mapping
inside their `hakyll.hs`, without cluttering the source code directory
or needing to use a web server - the clutter only exists inside the
compiled site / host.

This would be general enough that it'd be worth adding to Hakyll, I think.

How do other Hakyll users solve this problem? Has anyone done
something similar to my `createRedirects` suggestion?

--
gwern
https://www.gwern.net
--
You received this message because you are subscribed to the Google Groups "hakyll" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hakyll+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kyle Marek-Spartz

2016-12-29 00:21:10 UTC

Permalink

I did something like using Disallows in a robots.txt file, but that's not
quite a redirect.

Post by Gwern Branwen
When a website becomes relatively large and old, it's considered good
practice to set up logging or Google Webmaster Tools and look for
traffic to broken URLs on your site (misspelled URLs, files or pages
which have been moved/renamed etc) and set up redirects to the right
page. This saves time for all those visitors, and it also effectively
increases your traffic because many of those visitors would've given
up & been unable to find the right page and apparently search engines
give a little bonus to 'clean' sites.
(My interest in this is prompted by recently looking at an
institution's Google Analytics & Webmaster Tools and noting that
perhaps 1-5% of their site traffic is to broken URLs, and a decent
fraction of their search engine traffic is being driven by being
unable to find a particular page. I have a fair number of broken links
on gwern.net myself, which I have tolerated mostly because I didn't
know how to fix it inside the static site approach.)
The usual way to do this is to look at the broken URLs, and set up a
mapping from broken to correct in your Apache rewrite rules or the
equivalent in your web server, so any visitor loading a broken URL
gets a 301 redirect. This doesn't require any additional changes to
your website and works well. Along the lines of
Redirect 301 /old-page.html http://www.mydomain.com/new-page.html
But most Hakyll users will be using Amazon S3 or Github or another
file host precisely to avoid things like running your own Apache. How
can we fix broken links?
Amazon S3 has one supported method, but last I checked, it's weird -
it requires a file at the broken URL/filename and it also needs on-S3
metadata to be toggled. So it clutters up your Hakyll directory,
requires additional manual intervention for each URL, and could break
at any time if a sync tool changes the metadata (perhaps by
accidentally resetting it).
writing a mostly-empty HTML file at the broken URL with a special META
<meta http-equiv="refresh" content="0;
url=http://www.mydomain.com/new-page.html">
<link rel="canonical" href="http://www.mydomain.com/new-page.html">
(There are some other forms of HTML redirection using JS and iframes,
but they are worse. The META tag method doesn't require JS and is
widely used and understood by all search engines.)
This also requires cluttering up your source directory with lots of
repetitive HTML files... unless you generate them.
It would be straightforward to write a function `createRedirects`
which takes a target directory ("_site") and a `Map broken working` of
URLs, and for each pair broken/working, writes to
`target/relativeLink(broken)` a HTML template of
<html><head><meta http-equiv="refresh" content="0;
url=$working"><link rel="canonical" href="$working"></head>
<body>The page has moved to: <a href="$working">this
page</a></body></html>
Then this function could be added somewhere in `main` like
`createRedirects "_site" linkMap` and it'd generate an indefinite
number of redirects. So the user only needs to set up the mapping
inside their `hakyll.hs`, without cluttering the source code directory
or needing to use a web server - the clutter only exists inside the
compiled site / host.
This would be general enough that it'd be worth adding to Hakyll, I think.
How do other Hakyll users solve this problem? Has anyone done
something similar to my `createRedirects` suggestion?
--
gwern
https://www.gwern.net
--
You received this message because you are subscribed to the Google Groups "hakyll" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "hakyll" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hakyll+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gwern Branwen

2016-12-31 18:51:43 UTC

Permalink

Here is a draft implementation of what I mean:

import System.FilePath (dropFileName)

brokenLinks :: [(String,String)]
brokenLinks = [
("/silk%20road", "/Silk%20Road"),
("/silk%20road?revision=20121015013418-f7719-c2d41233ee27eacdb0252a7c6b6f975a30bb220b",
"/Silk%20Road"),
("/spaced%20repetition", "/Spaced%20repetition"),
("/2docs/dnb/1978-zimmer.pdf", "/docs/music-distraction/1978-zimmer.pdf"),
("/About.html", "/About"),
("/2015-10-27-modafinilsurvey-feedback.csv",
"/docs/modafinil/survey/2015-10-27-modafinilsurvey-feedback.csv")
]

createRedirects :: FilePath -> [(String,String)] -> IO()
createRedirects target db = mapM_ (\(broken,working) ->
writeRedirect target broken (createRedirect working)) db

writeRedirect :: String -> String -> String -> IO ()
writeRedirect prefix fileName html = do createDirectoryIfMissing
True (prefix ++ dropFileName fileName)
writeFile (prefix ++ fileName) html

createRedirect :: String -> String
createRedirect working = "<html><head><meta http-equiv=\"refresh\"
content=\"0; url=" ++ working ++ "\"><link rel=\"canonical\" href=\""
++ working ++ "\"></head><body>The page has moved to: <a href=\""
++ working ++ "\">this page</a></body></html>"

Invocation in `main`:

createRedirects "_site" brokenLinks

This yields results like this:

$ ls _site/
2015-10-27-modafinilsurvey-feedback.csv About.html
silk%20road?revision=20121015013418-f7719-c2d41233ee27eacdb0252a7c6b6f975a30bb220b
2docs/ silk%20road spaced%20repetition
$ cat _site/About.html
<html><head><meta http-equiv="refresh" content="0;
url=/About"><link rel="canonical" href="/About"></head><body>The
page has moved to: <a href="/About">this page</a></body></html>

--
gwern

Gwern Branwen

2017-01-08 16:49:30 UTC

Permalink

After some additional tweaks, I sent in a patch
(https://github.com/jaspervdj/hakyll/issues/403); it was Hakyll-fied
and is now available as 'Hakyll.Web.Redirect'
(https://github.com/jaspervdj/hakyll/blob/master/src/Hakyll/Web/Redirect.hs)
in hakyll-4.9.3.0. Enjoy.

Gwern Branwen

2017-06-03 17:36:46 UTC

Permalink

I've now been using this functionality for about half a year. I'm up
to 903 defined redirects. Amazingly, I am still defining redirects
because there is no end to the number of typos and errors people
commit; many seem to be caused by PDF or PDF readers mangling correct
URLs.

So far I have been able to reduce 404 traffic on gwern.net from ~1100
hits per month to ~150 hits (so perhaps an annual savings of ~11,000
hits?): https://www.dropbox.com/s/zbutd9iqu0wwu5s/Analytics%20www.gwern.net%20Pages%2020160501-20170531.pdf

I haven't been able to get it to 0 because somehow people are still
triggering 404s when visiting URLs I have already set up redirects
for, or which can't be expressed as a redirect file because they
collide with directory names, or the URL is meaningless.

If I was starting from scratch, I think I would avoid trying to fix
the crawl errors in Google Webmaster Tools and prioritize actual 404s.
In retrospect, the crawl errors that GWT flagged were largely a waste
of time to fix, as few people would ever visit those ancient URLs, and
the 404 logs record where traffic is actually going to in order of
importance. So I should've started with those and dealt with GWT
issues later, if at all.

Still, the redirects haven't caused any website or maintenance
problems so far, and saving 900 hits per month is nice, so it seems
like a good thing to have done.