Discussion:
[hakyll] A simple shell script for generating sitemap.xml
Gwern Branwen
2017-10-10 17:30:58 UTC
Permalink
Sitemaps are useful for guiding search engines to pages & files which
are not prominently linked and will not necessarily be spidered on
their own. I noticed that a number of documents I host on gwern.net
did not show up in Google Scholar as I expected, which defeated the
point of hosting them, so I began looking into how to set up a
sitemap.xml.

There are a few examples for Hakyll or even full-blown modules, but I
either couldn't get them to work or they looked absurdly complex &
required hundreds of lines of code. A sitemap.xml is such a simple
format that I found that a little disgusting, so eventually I gave up
and just hacked together a short shell script which appears to work:
the XML validates despite containing thousands of URLs and Google
Webmaster Tools reports no errors & crawling is successful. So perhaps
other Hakyll users might find it useful.

This script depends on xargs, sh/Bash, find, sed, and the CLI tool
'urlencode' (available on Debian/Ubuntu from the 'gridsite-clients'
package) to escape filenames. Find looks for files which are big
enough that they are unlikely to be HTML redirects generated by
Hakyll.Redirects, then they are all url-encoded and substituted into a
URL entry, concatenated, and header/footer appended, and voila, a
sitemap.xml:

(echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?> <urlset
xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">"
find _site/ -type f -size +2k -print0 | xargs -L 1 -0 urlencode -m |
sed -e
's/_site\/\(.*\)/\<url\>\<loc\>https:\/\/www\.gwern\.net\/\1<\/loc><changefreq>monthly<\/changefreq><\/url>/'
echo "</urlset>") >> ./_site/sitemap.xml

This is run inside the top-level website directory; replace the
'www\.gwern\.net' as appropriate.
--
gwern
https://www.gwern.net
--
You received this message because you are subscribed to the Google Groups "hakyll" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hakyll+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Kyle Marek-Spartz
2017-10-10 19:01:13 UTC
Permalink
Neat! I opted for a sitemap.txt format since it was simpler to generate:

https://github.com/kmarekspartz/kyle.marek-spartz.org/blob/master/bin/echo_sitemap.sh

And:

https://github.com/kmarekspartz/kyle.marek-spartz.org/blob/master/bin/build_sitemap.sh
Post by Gwern Branwen
Sitemaps are useful for guiding search engines to pages & files which
are not prominently linked and will not necessarily be spidered on
their own. I noticed that a number of documents I host on gwern.net
did not show up in Google Scholar as I expected, which defeated the
point of hosting them, so I began looking into how to set up a
sitemap.xml.
There are a few examples for Hakyll or even full-blown modules, but I
either couldn't get them to work or they looked absurdly complex &
required hundreds of lines of code. A sitemap.xml is such a simple
format that I found that a little disgusting, so eventually I gave up
the XML validates despite containing thousands of URLs and Google
Webmaster Tools reports no errors & crawling is successful. So perhaps
other Hakyll users might find it useful.
This script depends on xargs, sh/Bash, find, sed, and the CLI tool
'urlencode' (available on Debian/Ubuntu from the 'gridsite-clients'
package) to escape filenames. Find looks for files which are big
enough that they are unlikely to be HTML redirects generated by
Hakyll.Redirects, then they are all url-encoded and substituted into a
URL entry, concatenated, and header/footer appended, and voila, a
(echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?> <urlset
xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">"
find _site/ -type f -size +2k -print0 | xargs -L 1 -0 urlencode -m |
sed -e
's/_site\/\(.*\)/\<url\>\<loc\>https:\/\/www\.gwern\.net\/\
1<\/loc><changefreq>monthly<\/changefreq><\/url>/'
echo "</urlset>") >> ./_site/sitemap.xml
This is run inside the top-level website directory; replace the
'www\.gwern\.net' as appropriate.
--
gwern
https://www.gwern.net
--
You received this message because you are subscribed to the Google Groups "hakyll" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "hakyll" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hakyll+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...