Gwern Branwen
2017-10-10 17:30:58 UTC
Sitemaps are useful for guiding search engines to pages & files which
are not prominently linked and will not necessarily be spidered on
their own. I noticed that a number of documents I host on gwern.net
did not show up in Google Scholar as I expected, which defeated the
point of hosting them, so I began looking into how to set up a
sitemap.xml.
There are a few examples for Hakyll or even full-blown modules, but I
either couldn't get them to work or they looked absurdly complex &
required hundreds of lines of code. A sitemap.xml is such a simple
format that I found that a little disgusting, so eventually I gave up
and just hacked together a short shell script which appears to work:
the XML validates despite containing thousands of URLs and Google
Webmaster Tools reports no errors & crawling is successful. So perhaps
other Hakyll users might find it useful.
This script depends on xargs, sh/Bash, find, sed, and the CLI tool
'urlencode' (available on Debian/Ubuntu from the 'gridsite-clients'
package) to escape filenames. Find looks for files which are big
enough that they are unlikely to be HTML redirects generated by
Hakyll.Redirects, then they are all url-encoded and substituted into a
URL entry, concatenated, and header/footer appended, and voila, a
sitemap.xml:
(echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?> <urlset
xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">"
find _site/ -type f -size +2k -print0 | xargs -L 1 -0 urlencode -m |
sed -e
's/_site\/\(.*\)/\<url\>\<loc\>https:\/\/www\.gwern\.net\/\1<\/loc><changefreq>monthly<\/changefreq><\/url>/'
echo "</urlset>") >> ./_site/sitemap.xml
This is run inside the top-level website directory; replace the
'www\.gwern\.net' as appropriate.
are not prominently linked and will not necessarily be spidered on
their own. I noticed that a number of documents I host on gwern.net
did not show up in Google Scholar as I expected, which defeated the
point of hosting them, so I began looking into how to set up a
sitemap.xml.
There are a few examples for Hakyll or even full-blown modules, but I
either couldn't get them to work or they looked absurdly complex &
required hundreds of lines of code. A sitemap.xml is such a simple
format that I found that a little disgusting, so eventually I gave up
and just hacked together a short shell script which appears to work:
the XML validates despite containing thousands of URLs and Google
Webmaster Tools reports no errors & crawling is successful. So perhaps
other Hakyll users might find it useful.
This script depends on xargs, sh/Bash, find, sed, and the CLI tool
'urlencode' (available on Debian/Ubuntu from the 'gridsite-clients'
package) to escape filenames. Find looks for files which are big
enough that they are unlikely to be HTML redirects generated by
Hakyll.Redirects, then they are all url-encoded and substituted into a
URL entry, concatenated, and header/footer appended, and voila, a
sitemap.xml:
(echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?> <urlset
xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">"
find _site/ -type f -size +2k -print0 | xargs -L 1 -0 urlencode -m |
sed -e
's/_site\/\(.*\)/\<url\>\<loc\>https:\/\/www\.gwern\.net\/\1<\/loc><changefreq>monthly<\/changefreq><\/url>/'
echo "</urlset>") >> ./_site/sitemap.xml
This is run inside the top-level website directory; replace the
'www\.gwern\.net' as appropriate.
--
gwern
https://www.gwern.net
--
You received this message because you are subscribed to the Google Groups "hakyll" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hakyll+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
gwern
https://www.gwern.net
--
You received this message because you are subscribed to the Google Groups "hakyll" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hakyll+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.