Thursday, April 10, 2008

Escape cp-1252 chars in HTML

This is for when people paste crap from MS Word into html and add all those funky characters that looks horrible.

My friend gave me this perl one liner (ok, you can use it as a one liner if you wanted to) to escape the the evil CP-1252 characters in your HTML.

Notes for porting:
ord($1) converts capture group 1 to ASCII.
assumes $str has your html

Bonus!
the above regex replaces control chars!

1 comment:

  1. There's also the demoroniser, which is a Perl script, but I've munged in into Java for at least one of my apps...

    http://www.fourmilab.ch/webtools/demoroniser/

    And it's much nicer to look at than a convoluted regex ;)

    ReplyDelete