This is for when people paste crap from MS Word into html and add all those funky characters that looks horrible.
My friend gave me this perl one liner (ok, you can use it as a one liner if you wanted to) to escape the the evil CP-1252 characters in your HTML.
Notes for porting:
ord($1) converts capture group 1 to ASCII.
assumes $str has your html
Bonus!
the above regex replaces control chars!
There's also the demoroniser, which is a Perl script, but I've munged in into Java for at least one of my apps...
ReplyDeletehttp://www.fourmilab.ch/webtools/demoroniser/
And it's much nicer to look at than a convoluted regex ;)