Front page | perl.perl5.porters |
Postings from March 2002
[PATCH][docs] Encode.pm
From:
Anton Tagunov
Date:
March 19, 2002 10:49
Subject:
[PATCH][docs] Encode.pm
Message ID:
8776255810.20020319214806@motor.ru
Hello, developers!
With my upgraded knowledge of encoding naming I propose this.
Justification:
1)
Shift-JIS -> Shift_JIS does not hurt anyone, cause it does
not work either way, Encode::encode
understands only 'shiftjis'
I would prefer to settle the naming
first,
going to submit a separate bug
report for all aliases that do not
work later
2)
I do not care too much if I have done a wrong classification
of encodings: I hope that as soon as something like this
gets into the docs we'll get plenty of feedback sufficient
to correct even the worth mistakes :-) 2 me it looks
good just to start the section.
<DISCLAIMER>
The main goal was to separate MIME names from
ISO names from proprietary names.
</DISCLAIMER>
Comment:
JIS 0201
JIS 0208
JIS 0212
GB 1988
GB 2312
are under my severe suspect, but I have posted separate mails
on them.
Grumbling:
CNS 11643
GB 12345
really hurt my feelings because they have a space inside,
but I have found no reason to touch them: neither
IANA nor rfc1345 name them, and everywhere I've seen them
they are written with a space.
Do you think it could still be translated to CNS-.., GB-
for consistency and beauty ? :-)
Proposition:
Should Name: HZ-GB-2312 be established as a synonym to HZ?
Or not worth the trouble?
Looking forward to your opinions! :-)))
- Anton
--- ext/Encode/Encode.pm.orig Mon Mar 18 00:20:24 2002
+++ ext/Encode/Encode.pm Tue Mar 19 21:42:26 2002
@@ -500,34 +500,34 @@
ISO 10646-1 => UCS-2
-The ISO 8859 and KOI:
+The ISO-8859 and KOI:
- ISO 8859-1 ISO 8859-6 ISO 8859-11 KOI8-F
- ISO 8859-2 ISO 8859-7 (12 doesn't exist) KOI8-R
- ISO 8859-3 ISO 8859-8 ISO 8859-13 KOI8-U
- ISO 8859-4 ISO 8859-9 ISO 8859-14
- ISO 8859-5 ISO 8859-10 ISO 8859-15
- ISO 8859-16
-
- Latin1 => 8859-1 Latin6 => 8859-10
- Latin2 => 8859-2 Latin7 => 8859-13
- Latin3 => 8859-3 Latin8 => 8859-14
- Latin4 => 8859-4 Latin9 => 8859-15
- Latin5 => 8859-9 Latin10 => 8859-16
-
- Cyrillic => 8859-5
- Arabic => 8859-6
- Greek => 8859-7
- Hebrew => 8859-8
- Thai => 8859-11
- TIS620 => 8859-11
+ ISO-8859-1 ISO-8859-6 ISO-8859-11 KOI8-F
+ ISO-8859-2 ISO-8859-7 (12 doesn't exist) KOI8-R
+ ISO-8859-3 ISO-8859-8 ISO-8859-13 KOI8-U
+ ISO-8859-4 ISO-8859-9 ISO-8859-14
+ ISO-8859-5 ISO-8859-10 ISO-8859-15
+ ISO-8859-16
+
+ Latin1 => ISO-8859-1 Latin6 => ISO-8859-10
+ Latin2 => ISO-8859-2 Latin7 => ISO-8859-13
+ Latin3 => ISO-8859-3 Latin8 => ISO-8859-14
+ Latin4 => ISO-8859-4 Latin9 => ISO-8859-15
+ Latin5 => ISO-8859-9 Latin10 => ISO-8859-16
+
+ Cyrillic => ISO-8859-5
+ Arabic => ISO-8859-6
+ Greek => ISO-8859-7
+ Hebrew => ISO-8859-8
+ Thai => ISO-8859-11
+ TIS620 => ISO-8859-11
The CJKV: Chinese, Japanese, Korean, Vietnamese:
- ISO 2022 ISO 2022 JP-1 JIS 0201 GB 1988 Big5 EUC-CN
- ISO 2022 CN ISO 2022 JP-2 JIS 0208 GB 2312 HZ EUC-JP
- ISO 2022 JP ISO 2022 KR JIS 0210 GB 12345 CNS 11643 EUC-JP-0212
- Shift-JIS GBK Big5-HKSCS EUC-KR
+ ISO-2022 ISO-2022-JP-1 JIS 0201 GB 1988 Big5 EUC-CN
+ ISO-2022-CN ISO-2022-JP-2 JIS 0208 GB 2312 HZ EUC-JP
+ ISO-2022-JP ISO-2022-KR JIS 0210 GB 12345 CNS 11643 EUC-JP-0212
+ Shift_JIS GBK Big5-HKSCS EUC-KR
VISCII ISO-IR-165
(Due to size concerns, additional Chinese encodings including C<GB 18030>,
@@ -572,6 +572,59 @@
DingBats Roman8
GSM 0338 Symbol
+=head2 Encoding Classification
+
+Encodings
+
+ US-ASCII UTF-8 KOI8-R ISO-8859-*
+ ISO-2022-CN ISO-2022-JP ISO-2022-KR Big5
+ EUC-CN EUC-JP EUC-KR
+
+are L<http://www.iana.org/assignments/character-sets>-registered
+as preferred MIME names and may probably be used over the Internet.
+So is
+
+ Shift_JIS
+
+but despite its wide spread it bears the label of being
+Microsft proprietary.
+
+ UTF-16 KOI8-U ISO-2022-JP-2
+
+are IANA-registered preferred MIME names but probably shoule
+be avoided as encoding for web pages due to lack of browser
+support.
+
+
+ ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
+ ISO-2022-JP-1 (http://www.faqs.org/rfcs/rfc2237.html)
+ ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
+ GBK
+ VISCII
+ GB 12345 (only plains 1 and 2 available)
+ GB 18030
+ CNS 11643
+
+are totally valid encodings but not registered at IANA.
+
+ BIG5PLUS
+ EUC-JP-0212 (Encode::lib::Encode::Tcl::Extended)
+
+are a bit proprietary
+
+You may probably get some info on CJK encodings at
+
+ brief description for most of the mentioned CJK encodings
+ http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html
+
+ several years old, but still useful
+ http://www.oreilly.com/people/authors/lunde/cjk_inf.html
+
+ and some in-depth reading for the heroes :-)
+ http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM (eq ISO-2022)
+ http://www.faqs.org/rfcs/rfc1345.txt
+
+
=head1 PERL ENCODING API
=head2 Generic Encoding Interface
@@ -598,7 +651,7 @@
internal form and returns the resulting string. For CHECK see
L</"Handling Malformed Data">.
-For example to convert ISO 8859-1 data to UTF-8:
+For example to convert ISO-8859-1 data to UTF-8:
$utf8 = decode("latin1", $latin1);
@@ -611,7 +664,7 @@
encode() or through PerlIO: See L</"Encoding and IO">. For CHECK
see L</"Handling Malformed Data">.
-For example to convert ISO 8859-1 data to UTF-8:
+For example to convert ISO-8859-1 data to UTF-8:
from_to($data, "iso-8859-1", "utf-8");
@@ -848,7 +901,7 @@
"character operations" (e.g. C<lc>, C</\W+/>, ...).
You can also use PerlIO to convert larger amounts of data you don't
-want to bring into memory. For example to convert between ISO 8859-1
+want to bring into memory. For example to convert between ISO-8859-1
(Latin 1) and UTF-8 (or UTF-EBCDIC in EBCDIC machines):
open(F, "<:encoding(iso-8859-1)", "data.txt") or die $!;
-
[PATCH][docs] Encode.pm
by Anton Tagunov