[hakyll] A simple shell script for generating sitemap.xml

Kyle Marek-Spartz

2017-10-10 19:01:13 UTC

Neat! I opted for a sitemap.txt format since it was simpler to generate:

https://github.com/kmarekspartz/kyle.marek-spartz.org/blob/master/bin/echo_sitemap.sh

And:

https://github.com/kmarekspartz/kyle.marek-spartz.org/blob/master/bin/build_sitemap.sh

Post by Gwern Branwen
Sitemaps are useful for guiding search engines to pages & files which
are not prominently linked and will not necessarily be spidered on
their own. I noticed that a number of documents I host on gwern.net
did not show up in Google Scholar as I expected, which defeated the
point of hosting them, so I began looking into how to set up a
sitemap.xml.
There are a few examples for Hakyll or even full-blown modules, but I
either couldn't get them to work or they looked absurdly complex &
required hundreds of lines of code. A sitemap.xml is such a simple
format that I found that a little disgusting, so eventually I gave up
the XML validates despite containing thousands of URLs and Google
Webmaster Tools reports no errors & crawling is successful. So perhaps
other Hakyll users might find it useful.
This script depends on xargs, sh/Bash, find, sed, and the CLI tool
'urlencode' (available on Debian/Ubuntu from the 'gridsite-clients'
package) to escape filenames. Find looks for files which are big
enough that they are unlikely to be HTML redirects generated by
Hakyll.Redirects, then they are all url-encoded and substituted into a
URL entry, concatenated, and header/footer appended, and voila, a
(echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?> <urlset
xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">"
find _site/ -type f -size +2k -print0 | xargs -L 1 -0 urlencode -m |
sed -e
's/_site\/\(.*\)/\<url\>\<loc\>https:\/\/www\.gwern\.net\/\
1<\/loc><changefreq>monthly<\/changefreq><\/url>/'
echo "</urlset>") >> ./_site/sitemap.xml
This is run inside the top-level website directory; replace the
'www\.gwern\.net' as appropriate.
--
gwern
https://www.gwern.net
--
You received this message because you are subscribed to the Google Groups "hakyll" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "hakyll" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hakyll+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.