John Stracke <francis@ecal.com> writes: > I just noticed that, when HTML::Parse encountes , it sends > it to me as \240. Since I want to treat my files as UTF-8, this > is a problem. Is there any way to tell it not to decode > entities, or do I need to bite the bullet and implement the UTF-8 > option the manpage talks about? Do you mean HTML::Parse or HTML::Parser here? HTML::Parser decode entities with the 'dtext' argspec and leave them alone for 'text'. UTF8 should work nicely with bleadperl (soon to be 5.7.1). is still \240 though. I don't think it make sense to add a UTF8 option any more. In fact I just checked in the following patch: Index: Parser.pm =================================================================== RCS file: /cvsroot/libwww-perl/html-parser/Parser.pm,v retrieving revision 2.142 diff -u -p -r2.142 Parser.pm --- Parser.pm 2001/04/02 23:28:02 2.142 +++ Parser.pm 2001/04/06 19:52:45 @@ -652,12 +652,9 @@ automatically decoded unless the event w was between literal start and end tags (C<script>, C<style>, C<xmp>, and C<plaintext>). -The ISO 8859-1 character set (aka Latin1) is assumed for entity -decoding. - -It is planned that C<HTML::Parser> will get an C<utf8> option -at some point that will affect the byte sequence that characters with -codes greater than 127 will decode into. +The Unicode character set is assumed for entity decoding. With perl +version < 5.7 only the Latin1 range is supported, and entities for +characters outside the 0..255 range is left unchanged. This passes undef except for C<text> events. --GisleThread Previous | Thread Next