[hakyll] Passing a post through the sed command before handing it over to Pandoc

Discussion:

Philip

2015-09-23 09:13:03 UTC

Hi,

I want to pass the body of my posts through the sed command before
Pandoc starts processing it. How do I do this?

The details: I want to write posts in Markdown and use hakyll to
convert them to HTML. I would like to use LaTeX-style maths
delimiters, namely $ and $, to denote maths content in my posts, so
that MathJax can then render it as maths when displaying the resulting
HTML.

The trouble with this is that Pandoc sees these as escaped parentheses
and de-escapes them, so that the final HTML does not contain the
back-slashes.

For example, if I have the following in the source of my post:

$e^{i\pi}= -1$

, then this gets output in the HTML version as:

(e^{i\pi}= -1)

, with the consequence that this does not look like maths code MathJax,
and hence gets rendered in the browser as LaTeX code rather than as
formatted maths.

My solution to this problem is to escape the back-slashes, so that they
appear in the resulting HTML document. That is, I would write the above
expression as:

\$e^{i\pi}= -1\$

in my source (note the double back-slashes). The trouble with this is
that typing the extra backslashes gets old very fast. In many cases, I
end up typing 7 characters to display one maths character, such as in:
"Consider a finite set \$S\$."

My solution to *this* problem is to run my sources through the following
sed script which replaces each LaTeX maths delimiter with the
double-escaped version:

sed -e 's/\$/\\\\(/g; s/\$/\\\\)/g'

My question is: How can I get hakyll to do this for me, instead of me
running the script whenever I change any post?

I got my site.hs from Alp's repository here:
https://github.com/alpmestan/alpmestan.com . The part of site.hs which
handles posts looks like this:

match "posts/*" $ do
route $ setExtension "html"
compile $ myPandocCompiler

= loadAndApplyTemplate "templates/post.html" (postCtx tags)
= saveSnapshot "content"
= loadAndApplyTemplate "templates/default.html" (defaultContext `mappend` yearCtx year)
= relativizeUrls

--
You received this message because you are subscribed to the Google Groups "hakyll" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hakyll+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jasper Van der Jeugt

2015-09-23 16:02:28 UTC

Permalink

Hello Philip,

I've added a `unixFilter` to your example so you can see how it works.
You can put it anywhere in the chain as it are all functions of the
type:

Item String -> Compiler (Item String)

Note that `withItemBody` is the same as `traverse` [1], for the
generalists out there. We need this to have our `unixFilter` work on
the contents of the `Item` rather than the `Item` itself.

match "posts/*" $ do
route $ setExtension "html"
compile $ pandocCompiler

Post by Philip

= loadAndApplyTemplate "templates/post.html" (postCtx tags)
= saveSnapshot "content"
= loadAndApplyTemplate "templates/default.html"

(defaultContext `mappend` yearCtx year)

Post by Philip

= withItemBody (unixFilter "sed" ["s/java/haskell/g"])
= relativizeUrls

[1]: https://hackage.haskell.org/package/base-4.8.1.0/docs/Data-Traversable.html#v:traverse

Peace,
Jasper

On Wed, Sep 23, 2015 at 11:13 AM, Philip

Post by Philip
Hi,
I want to pass the body of my posts through the sed command before
Pandoc starts processing it. How do I do this?
The details: I want to write posts in Markdown and use hakyll to
convert them to HTML. I would like to use LaTeX-style maths
delimiters, namely $ and $, to denote maths content in my posts, so
that MathJax can then render it as maths when displaying the resulting
HTML.
The trouble with this is that Pandoc sees these as escaped parentheses
and de-escapes them, so that the final HTML does not contain the
back-slashes.
$e^{i\pi}= -1$
(e^{i\pi}= -1)
, with the consequence that this does not look like maths code MathJax,
and hence gets rendered in the browser as LaTeX code rather than as
formatted maths.
My solution to this problem is to escape the back-slashes, so that they
appear in the resulting HTML document. That is, I would write the above
\$e^{i\pi}= -1\$
in my source (note the double back-slashes). The trouble with this is
that typing the extra backslashes gets old very fast. In many cases, I
"Consider a finite set \$S\$."
My solution to *this* problem is to run my sources through the following
sed script which replaces each LaTeX maths delimiter with the
sed -e 's/\$/\\\\(/g; s/\$/\\\\)/g'
My question is: How can I get hakyll to do this for me, instead of me
running the script whenever I change any post?
https://github.com/alpmestan/alpmestan.com . The part of site.hs which
match "posts/*" $ do
route $ setExtension "html"
compile $ myPandocCompiler

= loadAndApplyTemplate "templates/post.html" (postCtx tags)
= saveSnapshot "content"
= loadAndApplyTemplate "templates/default.html" (defaultContext `mappend` yearCtx year)
= relativizeUrls

After a bit of looking around, I figured out that I probably have to use
the unixFilter function to get what I want. The trouble is that I don't
know how to work an invocation of this function into the above chain.
Could you help me with this?
Thanks and regards,
Philip
--
You received this message because you are subscribed to the Google Groups "hakyll" group.
For more options, visit https://groups.google.com/d/optout.

Philip

2015-09-24 10:48:12 UTC

Permalink

Hi Jasper,

Thank you for your reply
(https://groups.google.com/d/msg/hakyll/CmaToUiiMR4/Vzwey9bHBQAJ), and
sorry for not replying to the thread: I cannot do that without signing
in to a google account, probably because I was not on the reply list.

(Also, a big Thank You for hakyll in general!)

I'm afraid your suggestion does not work. The code does compile and
hakyll generates output, but it doesn't match what I want.

For instance:

1. My input contains the string "$U$"
2. By default, Pandoc rewrites this to "(U)" (See:
http://pandoc.org/README.html#backslash-escapes)
3. This will come out as "(U)" in the HTML, then and MathJax will not
look at it.

Once I add the line

= withItemBody (unixFilter "sed" ["-e", "s/[\\][(]/\\\\\$/g; s/[\\][)]/\\\\\$/g"])

as specified in your reply, the string "$U$" in the input *still*
turns out as "(U)" in the generated HTML.

(I need all those backslashes and brackets to safely get the
expression through to sed via Haskell and the shell. The
corresponding sed invocation which works on the bash shell is:

sed -e 's/[\][(]/\\\$/g; s/[\][)]/\\\$/g'

).

The trouble, I think, is that the line

withItemBody (unixFilter "sed" ["..."])

has its effect *after* Pandoc has processed the input text. This is
supported by the following observation: Once I add the following line

= withItemBody (unixFilter "sed" ["-e", "s/[\\][(]/O/g; s/[\\][)]/C/g"])

(Note the different replacement strings this time. Now I want to
replace the opening LaTeX maths delimiter with O, and the closing
delimiter with C.)

, I see the following transformations from input to output:

Input Output
------- ------------

"$U$" "(U)'

"\$U\$" "OUC"

So the unixFilter line can indeed find the pattern '$" and replace it
with something else, but it only gets to work on the output of Pandoc,
which already replaces a '\(" with a "(' and a "$" with a ")".

I want to pass the body of my posts through the sed command *before*
Pandoc starts processing it. How do I do this?

Thanks a lot for your help!

Regards,
Philip

The details: I want to write posts in Markdown and use hakyll to
convert them to HTML. I would like to use LaTeX-style maths
delimiters, namely $ and $, to denote maths content in my posts, so
that MathJax can then render it as maths when displaying the resulting
HTML.
The trouble with this is that Pandoc sees these as escaped parentheses
and de-escapes them, so that the final HTML does not contain the
back-slashes.
$e^{i\pi}= -1$
(e^{i\pi}= -1)
, with the consequence that this does not look like maths code MathJax,
and hence gets rendered in the browser as LaTeX code rather than as
formatted maths.
My solution to this problem is to escape the back-slashes, so that they
appear in the resulting HTML document. That is, I would write the above
\$e^{i\pi}= -1\$
in my source (note the double back-slashes). The trouble with this is
that typing the extra backslashes gets old very fast. In many cases, I
"Consider a finite set \$S\$."
My solution to *this* problem is to run my sources through the following
sed script which replaces each LaTeX maths delimiter with the
sed -e 's/\$/\\\\(/g; s/\$/\\\\)/g'
My question is: How can I get hakyll to do this for me, instead of me
running the script whenever I change any post?
https://github.com/alpmestan/alpmestan.com . The part of site.hs which
match "posts/*" $ do
route $ setExtension "html"
compile $ myPandocCompiler

= loadAndApplyTemplate "templates/post.html" (postCtx tags)
= saveSnapshot "content"
= loadAndApplyTemplate "templates/default.html" (defaultContext `mappend` yearCtx year)
= relativizeUrls

Jasper Van der Jeugt

2015-09-24 11:56:29 UTC

Permalink

Right.

`pandocCompiler` can be decomposed further, into
`getResourceBody >>= renderPandoc`. You can put your substitution in
between there.

match "posts/*" $ do
route $ setExtension "html"
compile $ getResourceBody

Post by Philip

= withItemBody (unixFilter "sed" ["s/java/haskell/g"])
= renderPandoc
= loadAndApplyTemplate ...

...

Note that you can replace `unixFilter` by just a Haskell function which
does the same thing...

...

Post by Philip

= withItemBody (return . substitute "java" "haskell")

...

This is naive implementation but it'll do the job:

substitute :: String -> String -> String -> String
substitute _ _ [] = []
substitute needle replacement haystack@(x : xs)
| needle `isPrefixOf` haystack = replacement ++
substitute needle replacement (drop (length needle) haystack)
| otherwise = x : substitute needle replacement xs

Hope this helps,
Peace,
Jasper

Post by Philip
Hi Jasper,
Thank you for your reply
(https://groups.google.com/d/msg/hakyll/CmaToUiiMR4/Vzwey9bHBQAJ), and
sorry for not replying to the thread: I cannot do that without signing
in to a google account, probably because I was not on the reply list.
(Also, a big Thank You for hakyll in general!)
I'm afraid your suggestion does not work. The code does compile and
hakyll generates output, but it doesn't match what I want.
1. My input contains the string "$U$"
http://pandoc.org/README.html#backslash-escapes)
3. This will come out as "(U)" in the HTML, then and MathJax will not
look at it.
Once I add the line

= withItemBody (unixFilter "sed" ["-e", "s/[\\][(]/\\\\\$/g; s/[\\][)]/\\\\\$/g"])

as specified in your reply, the string "$U$" in the input *still*
turns out as "(U)" in the generated HTML.
(I need all those backslashes and brackets to safely get the
expression through to sed via Haskell and the shell. The
sed -e 's/[\][(]/\\\$/g; s/[\][)]/\\\$/g'
).
The trouble, I think, is that the line
withItemBody (unixFilter "sed" ["..."])
has its effect *after* Pandoc has processed the input text. This is
supported by the following observation: Once I add the following line

= withItemBody (unixFilter "sed" ["-e", "s/[\\][(]/O/g; s/[\\][)]/C/g"])

(Note the different replacement strings this time. Now I want to
replace the opening LaTeX maths delimiter with O, and the closing
delimiter with C.)
Input Output
------- ------------
"$U$" "(U)'
"\$U\$" "OUC"
So the unixFilter line can indeed find the pattern '$" and replace it
with something else, but it only gets to work on the output of Pandoc,
which already replaces a '\(" with a "(' and a "$" with a ")".

I want to pass the body of my posts through the sed command *before*
Pandoc starts processing it. How do I do this?

Thanks a lot for your help!
Regards,
Philip

= loadAndApplyTemplate "templates/post.html" (postCtx tags)
= saveSnapshot "content"
= loadAndApplyTemplate "templates/default.html" (defaultContext `mappend` yearCtx year)
= relativizeUrls

--
You received this message because you are subscribed to the Google Groups "hakyll" group.
For more options, visit https://groups.google.com/d/optout.

Philip

2015-09-25 06:42:03 UTC

Permalink

Thank you, that worked! Also, it seems I can just reply to the email I
sent and it will become part of the same thread on the google discussion
list. Neat!

More details below:

I could not copy-paste Jasper's code _per_se_, since "my" site.hs (which
is really Alp Mestanogullari's :
https://github.com/alpmestan/alpmestan.com ) does not have a plain
invocation of pandocCompiler. Instead, it (eventually) calls
pandocCompilerWith, in the following fashion:

myPandocCompiler' withToc =
pandocCompilerWith defaultHakyllReaderOptions $
case withToc of
Just x | map toLower x `elem` ["true", "yes"] ->
writerWithToc
| otherwise -> writerOpts
Nothing -> writerOpts

Following Jasper's suggestion, I replaced this with:

myPandocCompiler' withToc = cached
"Hakyll.Web.Pandoc.pandocCompilerWith" $ writePandocWith wopt <$>
(traverse (return . id) =<< readPandocWith
defaultHakyllReaderOptions
=<< withItemBody
(return . (replace
"\$" "\\\\(") .
(replace "\$"
"\\\\)"))
=<< getResourceBody)
where wopt = case withToc of
Just x | map toLower x
`elem` ["true", "yes"] ->
writerWithToc
| otherwise

-> writerOpts
Nothing
->
writerOpts

(`replace` is just Jasper's `substitute` with the name changed.)

And now it works fine!

Thanks a lot,
Philip