NAME

utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code

SYNOPSIS

 use utf8;
 use utf8 'Greek', 'Arabic';  # allow mixed-scripts in identifiers
 no utf8;

 # Convert the internal representation of a Perl scalar to/from UTF-8.

 $num_octets = utf8::upgrade($string);
 $success    = utf8::downgrade($string[, $fail_ok]);

 # Change each character of a Perl scalar to/from a series of
 # characters that represent the UTF-8 bytes of each original character.

 utf8::encode($string);  # "\x{100}"  becomes "\xc4\x80"
 utf8::decode($string);  # "\xc4\x80" becomes "\x{100}"

 # Convert a code point from the platform native character set to
 # Unicode, and vice-versa.
 $unicode = utf8::native_to_unicode(ord('A')); # returns 65 on both
                                               # ASCII and EBCDIC
                                               # platforms
 $native = utf8::unicode_to_native(65);        # returns 65 on ASCII
                                               # platforms; 193 on
                                               # EBCDIC

 $flag = utf8::is_utf8($string); # since Perl 5.8.1
 $flag = utf8::valid($string);

DESCRIPTION

The use utf8 pragma tells the Perl parser to allow UTF-8 and certain mixed scripts other than Latin, Common and Inherited in the program text in the current lexical scope for identifiers (package and symbol names, function and variable names) and literals. It doesn't declare strings in the source to be UTF-8 encoded or unicode, see "The 'unicode_strings' feature" in feature instead.

The no utf8 pragma tells Perl to switch back to treating the source text as literal bytes in the current lexical scope. (On EBCDIC platforms, technically it is allowing UTF-EBCDIC, and not UTF-8, but this distinction is academic, so in this document the term UTF-8 is used to mean both).

Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. The utility functions described below are directly usable without use utf8;.

Because it is not possible to reliably tell UTF-8 from native 8 bit encodings, you need either a Byte Order Mark at the beginning of your source code, or use utf8;, to instruct perl.

When UTF-8 becomes the standard source format, this pragma wwithout any argument will become effectively a no-op.

See also the effects of the -C switch and its cousin, the PERL_UNICODE environment variable, in perlrun.

Enabling the utf8 pragma has the following effect:

Note that if you have non-ASCII, non-UTF-8 bytes in your script (for example embedded Latin-1 in your string literals), use utf8 will be unhappy. If you want to have such bytes under use utf8, you can disable this pragma until the end the block (or file, if at top level) by no utf8;.

Valid scripts

use utf8 takes any valid UCD script names as arguments. This declares those scripts for all identifiers as valid, all others besides 'Latin', 'Common' and 'Inherited' are invalid. This is currently only globally, not lexically scoped. Being forced to declare valid scripts disallows unicode confusables from different language families, which might looks the same but are not. This does not affect strings, only names, literals and numbers.

The unicode standard 12.1 defines 152 scripts, i.e. written language families.

    perl -alne'/; (\w+) #/ && print $1' lib/unicore/Scripts.txt | \
        sort -u

Adlam Ahom Anatolian_Hieroglyphs Arabic Armenian Avestan Balinese Bamum Bassa_Vah Batak Bengali Bhaiksuki Bopomofo Brahmi Braille Buginese Buhid Canadian_Aboriginal Carian Caucasian_Albanian Chakma Cham Cherokee Common Coptic Cuneiform Cypriot Cyrillic Deseret Devanagari Dogra Duployan Egyptian_Hieroglyphs Elbasan Elymaic Ethiopic Georgian Glagolitic Gothic Grantha Greek Gujarati Gunjala_Gondi Gurmukhi Han Hangul Hanifi_Rohingya Hanunoo Hatran Hebrew Hiragana Imperial_Aramaic Inherited Inscriptional_Pahlavi Inscriptional_Parthian Javanese Kaithi Kannada Katakana Kayah_Li Kharoshthi Khmer Khojki Khudawadi Lao Latin Lepcha Limbu Linear_A Linear_B Lisu Lycian Lydian Mahajani Makasar Malayalam Mandaic Manichaean Marchen Masaram_Gondi Medefaidrin Meetei_Mayek Mende_Kikakui Meroitic_Cursive Meroitic_Hieroglyphs Miao Modi Mongolian Mro Multani Myanmar Nabataean Nandinagari Newa New_Tai_Lue Nko Nushu Nyiakeng_Puachue_Hmong Ogham Ol_Chiki Old_Hungarian Old_Italic Old_North_Arabian Old_Permic Old_Persian Old_Sogdian Old_South_Arabian Old_Turkic Oriya Osage Osmanya Pahawh_Hmong Palmyrene Pau_Cin_Hau Phags_Pa Phoenician Psalter_Pahlavi Rejang Runic Samaritan Saurashtra Sharada Shavian Siddham SignWriting Sinhala Sogdian Sora_Sompeng Soyombo Sundanese Syloti_Nagri Syriac Tagalog Tagbanwa Tai_Le Tai_Tham Tai_Viet Takri Tamil Tangut Telugu Thaana Thai Tibetan Tifinagh Tirhuta Ugaritic Vai Wancho Warang_Citi Yi Zanabazar_Square

Note that this matches the UCD and is a bit different to the old-style casing of "charscript()" in Unicode::UCD in previous versions of Unicode::UCD.

We add some aliases for languages using multiple scripts:

   :Japanese => Katakana Hiragana Han
   :Korean   => Hangul Han
   :Hanb     => Han Bopomofo

These three aliases need not to be declared. They are allowed scripts in the Highly Restriction Level for identifiers.

Certain scripts don't need to be declared:

We follow the Moderately Restrictive Level for identifiers. I.e. All characters in each identifier must be from a single script, or from any of the following combinations:

Latin + Han + Hiragana + Katakana; or equivalently: Latn + Jpan

Latin + Han + Bopomofo; or equivalently: Latn + Hanb

Latin + Han + Hangul; or equivalently: Latn + Kore

Allow Latin with other Recommended or Aspirational scripts except Cyrillic and Greek. Cyrillic and Greek may not be used together for identifiers in the same file.

So these scripts need always to be declared:

Cyrillic Greek Ahom Anatolian_Hieroglyphs Avestan Bassa_Vah Bhaiksuki Brahmi Buginese Buhid Carian Caucasian_Albanian Coptic Cuneiform Cypriot Deseret Dogra Duployan Egyptian_Hieroglyphs Elbasan Elymaic Glagolitic Gothic Grantha Gunjala_Gondi Hanunoo Hatran Imperial_Aramaic Inscriptional_Pahlavi Inscriptional_Parthian Kaithi Kharoshthi Khojki Khudawadi Linear_A Linear_B Lycian Lydian Mahajani Makasar Manichaean Marchen Masaram_Gondi Medefaidrin Mende_Kikakui Meroitic_Cursive Meroitic_Hieroglyphs Modi Mongolian Mro Multani Nabataean Nandinagari Nushu Ogham Old_Hungarian Old_Italic Old_North_Arabian Old_Permic Old_Persian Old_Sogdian Old_South_Arabian Old_Turkic Osmanya Pahawh_Hmong Palmyrene Pau_Cin_Hau Phags_Pa Phoenician Psalter_Pahlavi Rejang Runic Samaritan Sharada Shavian Siddham SignWriting Sogdian Sora_Sompeng Soyombo Tagalog Tagbanwa Takri Tangut Tirhuta Ugaritic Warang_Citi Zanabazar_Square

All Limited Use Scripts are disallowed since 5.30. See http://www.unicode.org/reports/tr31/#Table_Limited_Use_Scripts.

Adlam Balinese Bamum Batak Canadian_Aboriginal Chakma Cham Cherokee Hanifi_Rohingya Javanese Kayah_Li Lepcha Limbu Lisu Mandaic Meetei_Mayek Miao New_Tai_Lue Newa Nko Nyiakeng_Puachue_Hmong Ol_Chiki Osage Saurashtra Sundanese Syloti_Nagri Syriac Tai_Le Tai_Tham Tai_Viet Tifinagh Vai Wancho Yi Katakana_Or_Hiragana Unknown

Utility functions

The following functions are defined in the utf8:: package by the Perl core. You do not need to say use utf8 to use these and in fact you should not say that unless you really want to have UTF-8 source code.

utf8::encode is like utf8::upgrade, but the UTF8 flag is cleared. See perlunicode, and the C API functions sv_utf8_upgrade, "sv_utf8_downgrade" in perlapi, "sv_utf8_encode" in perlapi, and "sv_utf8_decode" in perlapi, which are wrapped by the Perl functions utf8::upgrade, utf8::downgrade, utf8::encode and utf8::decode. Also, the functions utf8::is_utf8, utf8::valid, utf8::encode, utf8::decode, utf8::upgrade, and utf8::downgrade are actually internal, and thus always available, without a require utf8 statement.

BUGS

Some filesystems may not support UTF-8 file names, or they may be supported incompatibly with Perl. Therefore UTF-8 names that are visible to the filesystem, such as module names may not work.

perl5 upstream allows mixed script confusables as described in http://www.unicode.org/reports/tr39/ since 5.16 and is therefore considered insecure.

perl5 upstream does not normalize its unicode identifiers as described in http://www.unicode.org/reports/tr15/ since 5.16 and is therefore considered insecure. See http://www.unicode.org/reports/tr36/ for the security risks.

SEE ALSO

perlunitut, perluniintro, perlrun, bytes, perlunicode.

http://www.unicode.org/reports/tr36/#Mixed_Script_Spoofing, http://unicode.org/reports/tr39/#Mixed_Script_Confusables.