Re: HTML::Parse: what if I *don't* want entities decoded? - nntp.perl.org

Front page | perl.libwww | Postings from April 2001

Re: HTML::Parse: what if I don't want entities decoded?

Thread Previous | Thread Next

From:

Gisle Aas

Date:

April 6, 2001 12:56

Subject:

Re: HTML::Parse: what if I *don't* want entities decoded?

Message ID:

lru2417qh2.fsf@caliper.ActiveState.com

John Stracke <francis@ecal.com> writes:

> I just noticed that, when HTML::Parse encountes &nbsp;, it sends
> it to me as \240.  Since I want to treat my files as UTF-8, this
> is a problem.  Is there any way to tell it not to decode
> entities, or do I need to bite the bullet and implement the UTF-8
> option the manpage talks about?

Do you mean HTML::Parse or HTML::Parser here?

HTML::Parser decode entities with the 'dtext' argspec and leave them
alone for 'text'.

UTF8 should work nicely with bleadperl (soon to be 5.7.1).  &nbsp; is
still \240 though.

I don't think it make sense to add a UTF8 option any more.  In fact I
just checked in the following patch:

Index: Parser.pm
===================================================================
RCS file: /cvsroot/libwww-perl/html-parser/Parser.pm,v
retrieving revision 2.142
diff -u -p -r2.142 Parser.pm
--- Parser.pm   2001/04/02 23:28:02     2.142
+++ Parser.pm   2001/04/06 19:52:45
@@ -652,12 +652,9 @@ automatically decoded unless the event w
 was between literal start and end tags (C<script>, C<style>, C<xmp>,
 and C<plaintext>).
 
-The ISO 8859-1 character set (aka Latin1) is assumed for entity
-decoding.
-
-It is planned that C<HTML::Parser> will get an C<utf8> option
-at some point that will affect the byte sequence that characters with
-codes greater than 127 will decode into.
+The Unicode character set is assumed for entity decoding.  With perl
+version < 5.7 only the Latin1 range is supported, and entities for
+characters outside the 0..255 range is left unchanged.
 
 This passes undef except for C<text> events.
 


--Gisle

Thread Previous | Thread Next

HTML::Parse: what if I *don't* want entities decoded? by John Stracke

Re: HTML::Parse: what if I *don't* want entities decoded? by Gisle Aas

Re: HTML::Parse: what if I *don't* want entities decoded? by John Stracke

Re: HTML::Parse: what if I *don't* want entities decoded? by Gisle Aas

nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About