From f897eea1d38ed414288327a99b4078239e1a79ec Mon Sep 17 00:00:00 2001 From: CoprDistGit Date: Mon, 10 Apr 2023 11:18:20 +0000 Subject: automatic import of python-langcodes --- .gitignore | 1 + python-langcodes.spec | 2413 +++++++++++++++++++++++++++++++++++++++++++++++++ sources | 1 + 3 files changed, 2415 insertions(+) create mode 100644 python-langcodes.spec create mode 100644 sources diff --git a/.gitignore b/.gitignore index e69de29..e163ad7 100644 --- a/.gitignore +++ b/.gitignore @@ -0,0 +1 @@ +/langcodes-3.3.0.tar.gz diff --git a/python-langcodes.spec b/python-langcodes.spec new file mode 100644 index 0000000..9583db2 --- /dev/null +++ b/python-langcodes.spec @@ -0,0 +1,2413 @@ +%global _empty_manifest_terminate_build 0 +Name: python-langcodes +Version: 3.3.0 +Release: 1 +Summary: Tools for labeling human languages with IETF language tags +License: MIT +URL: https://github.com/rspeer/langcodes +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/5f/ec/9955d772ecac0bdfb5d706d64f185ac68bd0d4092acdc2c5a1882c824369/langcodes-3.3.0.tar.gz +BuildArch: noarch + +Requires: python3-language-data + +%description +# Langcodes: a library for language codes + +**langcodes** knows what languages are. It knows the standardized codes that +refer to them, such as `en` for English, `es` for Spanish and `hi` for Hindi. + +These are [IETF language tags][]. You may know them by their old name, ISO 639 +language codes. IETF has done some important things for backward compatibility +and supporting language variations that you won't find in the ISO standard. + +[IETF language tags]: https://www.w3.org/International/articles/language-tags/ + +It may sound to you like langcodes solves a pretty boring problem. At one +level, that's right. Sometimes you have a boring problem, and it's great when a +library solves it for you. + +But there's an interesting problem hiding in here. How do you work with +language codes? How do you know when two different codes represent the same +thing? How should your code represent relationships between codes, like the +following? + +* `eng` is equivalent to `en`. +* `fra` and `fre` are both equivalent to `fr`. +* `en-GB` might be written as `en-gb` or `en_GB`. Or as 'en-UK', which is + erroneous, but should be treated as the same. +* `en-CA` is not exactly equivalent to `en-US`, but it's really, really close. +* `en-Latn-US` is equivalent to `en-US`, because written English must be written + in the Latin alphabet to be understood. +* The difference between `ar` and `arb` is the difference between "Arabic" and + "Modern Standard Arabic", a difference that may not be relevant to you. +* You'll find Mandarin Chinese tagged as `cmn` on Wiktionary, but many other + resources would call the same language `zh`. +* Chinese is written in different scripts in different territories. Some + software distinguishes the script. Other software distinguishes the territory. + The result is that `zh-CN` and `zh-Hans` are used interchangeably, as are + `zh-TW` and `zh-Hant`, even though occasionally you'll need something + different such as `zh-HK` or `zh-Latn-pinyin`. +* The Indonesian (`id`) and Malaysian (`ms` or `zsm`) languages are mutually + intelligible. +* `jp` is not a language code. (The language code for Japanese is `ja`, but + people confuse it with the country code for Japan.) + +One way to know is to read IETF standards and Unicode technical reports. +Another way is to use a library that implements those standards and guidelines +for you, which langcodes does. + +When you're working with these short language codes, you may want to see the +name that the language is called _in_ a language: `fr` is called "French" in +English. That language doesn't have to be English: `fr` is called "français" in +French. A supplement to langcodes, [`language_data`][language-data], provides +this information. + +[language-data]: https://github.com/rspeer/language_data + +langcodes is maintained by Elia Robyn Lake a.k.a. Robyn Speer, and is released +as free software under the MIT license. + + +## Standards implemented + +Although this is not the only reason to use it, langcodes will make you more +acronym-compliant. + +langcodes implements [BCP 47](http://tools.ietf.org/html/bcp47), the IETF Best +Current Practices on Tags for Identifying Languages. BCP 47 is also known as +RFC 5646. It subsumes ISO 639 and is backward compatible with it, and it also +implements recommendations from the [Unicode CLDR](http://cldr.unicode.org). + +langcodes can also refer to a database of language properties and names, built +from Unicode CLDR and the IANA subtag registry, if you install `language_data`. + +In summary, langcodes takes language codes and does the Right Thing with them, +and if you want to know exactly what the Right Thing is, there are some +documents you can go read. + + +# Documentation + +## Standardizing language tags + +This function standardizes tags, as strings, in several ways. + +It replaces overlong tags with their shortest version, and also formats them +according to the conventions of BCP 47: + + >>> from langcodes import * + >>> standardize_tag('eng_US') + 'en-US' + +It removes script subtags that are redundant with the language: + + >>> standardize_tag('en-Latn') + 'en' + +It replaces deprecated values with their correct versions, if possible: + + >>> standardize_tag('en-uk') + 'en-GB' + +Sometimes this involves complex substitutions, such as replacing Serbo-Croatian +(`sh`) with Serbian in Latin script (`sr-Latn`), or the entire tag `sgn-US` +with `ase` (American Sign Language). + + >>> standardize_tag('sh-QU') + 'sr-Latn-EU' + + >>> standardize_tag('sgn-US') + 'ase' + +If *macro* is True, it uses macrolanguage codes as a replacement for the most +common standardized language within that macrolanguage. + + >>> standardize_tag('arb-Arab', macro=True) + 'ar' + +Even when *macro* is False, it shortens tags that contain both the +macrolanguage and the language: + + >>> standardize_tag('zh-cmn-hans-cn') + 'zh-Hans-CN' + +If the tag can't be parsed according to BCP 47, this will raise a +LanguageTagError (a subclass of ValueError): + + >>> standardize_tag('spa-latn-mx') + 'es-MX' + + >>> standardize_tag('spa-mx-latn') + Traceback (most recent call last): + ... + langcodes.tag_parser.LanguageTagError: This script subtag, 'latn', is out of place. Expected variant, extension, or end of string. + + +## Language objects + +This package defines one class, named Language, which contains the results +of parsing a language tag. Language objects have the following fields, +any of which may be unspecified: + +- *language*: the code for the language itself. +- *script*: the 4-letter code for the writing system being used. +- *territory*: the 2-letter or 3-digit code for the country or similar region + whose usage of the language appears in this text. +- *extlangs*: a list of more specific language codes that follow the language + code. (This is allowed by the language code syntax, but deprecated.) +- *variants*: codes for specific variations of language usage that aren't + covered by the *script* or *territory* codes. +- *extensions*: information that's attached to the language code for use in + some specific system, such as Unicode collation orders. +- *private*: a code starting with `x-` that has no defined meaning. + +The `Language.get` method converts a string to a Language instance, and the +`Language.make` method makes a Language instance from its fields. These values +are cached so that calling `Language.get` or `Language.make` again with the +same values returns the same object, for efficiency. + +By default, it will replace non-standard and overlong tags as it interprets +them. To disable this feature and get the codes that literally appear in the +language tag, use the *normalize=False* option. + + >>> Language.get('en-Latn-US') + Language.make(language='en', script='Latn', territory='US') + + >>> Language.get('sgn-US', normalize=False) + Language.make(language='sgn', territory='US') + + >>> Language.get('und') + Language.make() + +Here are some examples of replacing non-standard tags: + + >>> Language.get('sh-QU') + Language.make(language='sr', script='Latn', territory='EU') + + >>> Language.get('sgn-US') + Language.make(language='ase') + + >>> Language.get('zh-cmn-Hant') + Language.make(language='zh', script='Hant') + +Use the `str()` function on a Language object to convert it back to its +standard string form: + + >>> str(Language.get('sh-QU')) + 'sr-Latn-EU' + + >>> str(Language.make(territory='IN')) + 'und-IN' + + +### Checking validity + +A language code is _valid_ when every part of it is assigned a meaning by IANA. +That meaning could be "private use". + +In langcodes, we check the language subtag, script, territory, and variants for +validity. We don't check other parts such as extlangs or Unicode extensions. + +For example, `ja` is a valid language code, and `jp` is not: + + >>> Language.get('ja').is_valid() + True + + >>> Language.get('jp').is_valid() + False + +The top-level function `tag_is_valid(tag)` is possibly more convenient to use, +because it can return False even for tags that don't parse: + + >>> tag_is_valid('C') + False + +If one subtag is invalid, the entire code is invalid: + + >>> tag_is_valid('en-000') + False + +`iw` is valid, though it's a deprecated alias for `he`: + + >>> tag_is_valid('iw') + True + +The empty language tag (`und`) is valid: + + >>> tag_is_valid('und') + True + +Private use codes are valid: + + >>> tag_is_valid('x-other') + True + + >>> tag_is_valid('qaa-Qaai-AA-x-what-even-is-this') + True + +Language tags that are very unlikely are still valid: + + >>> tag_is_valid('fr-Cyrl') + True + +Tags with non-ASCII characters are invalid, because they don't parse: + + >>> tag_is_valid('zh-普通话') + False + + +### Getting alpha3 codes + +Before there was BCP 47, there was ISO 639-2. The ISO tried to make room for the +variety of human languages by assigning every language a 3-letter code, +including the ones that already had 2-letter codes. + +Unfortunately, this just led to more confusion. Some languages ended up with two +different 3-letter codes for legacy reasons, such as French, which is `fra` as a +"terminology" code, and `fre` as a "biblographic" code. And meanwhile, `fr` was +still a code that you'd be using if you followed ISO 639-1. + +In BCP 47, you should use 2-letter codes whenever they're available, and that's +what langcodes does. Fortunately, all the languages that have two different +3-letter codes also have a 2-letter code, so if you prefer the 2-letter code, +you don't have to worry about the distinction. + +But some applications want the 3-letter code in particular, so langcodes +provides a method for getting those, `Language.to_alpha3()`. It returns the +'terminology' code by default, and passing `variant='B'` returns the +bibliographic code. + +When this method returns, it always returns a 3-letter string. + + >>> Language.get('fr').to_alpha3() + 'fra' + >>> Language.get('fr-CA').to_alpha3() + 'fra' + >>> Language.get('fr-CA').to_alpha3(variant='B') + 'fre' + >>> Language.get('de').to_alpha3() + 'deu' + >>> Language.get('no').to_alpha3() + 'nor' + >>> Language.get('un').to_alpha3() + Traceback (most recent call last): + ... + LookupError: 'un' is not a known language code, and has no alpha3 code. + +For many languages, the terminology and bibliographic alpha3 codes are the same. + + >>> Language.get('en').to_alpha3(variant='T') + 'eng' + >>> Language.get('en').to_alpha3(variant='B') + 'eng' + +When you use any of these "overlong" alpha3 codes in langcodes, they normalize +back to the alpha2 code: + + >>> Language.get('zho') + Language.make(language='zh') + + +## Working with language names + +The methods in this section require an optional package called `language_data`. +You can install it with `pip install language_data`, or request the optional +"data" feature of langcodes with `pip install langcodes[data]`. + +The dependency that you put in setup.py should be `langcodes[data]`. + +### Describing Language objects in natural language + +It's often helpful to be able to describe a language code in a way that a user +(or you) can understand, instead of in inscrutable short codes. The +`display_name` method lets you describe a Language object *in a language*. + +The `.display_name(language, min_score)` method will look up the name of the +language. The names come from the IANA language tag registry, which is only in +English, plus CLDR, which names languages in many commonly-used languages. + +The default language for naming things is English: + + >>> Language.make(language='fr').display_name() + 'French' + + >>> Language.make().display_name() + 'Unknown language' + + >>> Language.get('zh-Hans').display_name() + 'Chinese (Simplified)' + + >>> Language.get('en-US').display_name() + 'English (United States)' + +But you can ask for language names in numerous other languages: + + >>> Language.get('fr').display_name('fr') + 'français' + + >>> Language.get('fr').display_name('es') + 'francés' + + >>> Language.make().display_name('es') + 'lengua desconocida' + + >>> Language.get('zh-Hans').display_name('de') + 'Chinesisch (Vereinfacht)' + + >>> Language.get('en-US').display_name('zh-Hans') + '英语(美国)' + +Why does everyone get Slovak and Slovenian confused? Let's ask them. + + >>> Language.get('sl').display_name('sl') + 'slovenščina' + >>> Language.get('sk').display_name('sk') + 'slovenčina' + >>> Language.get('sl').display_name('sk') + 'slovinčina' + >>> Language.get('sk').display_name('sl') + 'slovaščina' + +If the language has a script or territory code attached to it, these will be +described in parentheses: + + >>> Language.get('en-US').display_name() + 'English (United States)' + +Sometimes these can be the result of tag normalization, such as in this case +where the legacy tag 'sh' becomes 'sr-Latn': + + >>> Language.get('sh').display_name() + 'Serbian (Latin)' + + >>> Language.get('sh', normalize=False).display_name() + 'Serbo-Croatian' + +Naming a language in itself is sometimes a useful thing to do, so the +`.autonym()` method makes this easy, providing the display name of a language +in the language itself: + + >>> Language.get('fr').autonym() + 'français' + >>> Language.get('es').autonym() + 'español' + >>> Language.get('ja').autonym() + '日本語' + >>> Language.get('en-AU').autonym() + 'English (Australia)' + >>> Language.get('sr-Latn').autonym() + 'srpski (latinica)' + >>> Language.get('sr-Cyrl').autonym() + 'српски (ћирилица)' + +The names come from the Unicode CLDR data files, and in English they can +also come from the IANA language subtag registry. Together, they can give +you language names in the 196 languages that CLDR supports. + + +### Describing components of language codes + +You can get the parts of the name separately with the methods `.language_name()`, +`.script_name()`, and `.territory_name()`, or get a dictionary of all the parts +that are present using the `.describe()` method. These methods also accept a +language code for what language they should be described in. + + >>> shaw = Language.get('en-Shaw-GB') + >>> shaw.describe('en') + {'language': 'English', 'script': 'Shavian', 'territory': 'United Kingdom'} + + >>> shaw.describe('es') + {'language': 'inglés', 'script': 'shaviano', 'territory': 'Reino Unido'} + + +### Recognizing language names in natural language + +As the reverse of the above operations, you may want to look up a language by +its name, converting a natural language name such as "French" to a code such as +'fr'. + +The name can be in any language that CLDR supports (see "Ambiguity" below). + + >>> import langcodes + >>> langcodes.find('french') + Language.make(language='fr') + + >>> langcodes.find('francés') + Language.make(language='fr') + +However, this method currently ignores the parenthetical expressions that come from +`.display_name()`: + + >>> langcodes.find('English (Canada)') + Language.make(language='en') + +There is still room to improve the way that language names are matched, because +some languages are not consistently named the same way. The method currently +works with hundreds of language names that are used on Wiktionary. + +#### Ambiguity + +For the sake of usability, `langcodes.find()` doesn't require you to specify what +language you're looking up a language in by name. This could potentially lead to +a conflict: what if name "X" is language A's name for language B, and language C's +name for language D? + +We can collect the language codes from CLDR and see how many times this +happens. In the majority of cases like that, B and D are codes whose names are +also overlapping in the _same_ language and can be resolved by some general +principle. + +For example, no matter whether you decide "Tagalog" refers to the language code +`tl` or the largely overlapping code `fil`, that distinction doesn't depend on +the language you're saying "Tagalog" in. We can just return `tl` consistently. + + >>> langcodes.find('tagalog') + Language.make(language='tl') + +In the few cases of actual interlingual ambiguity, langcodes won't match a result. +You can pass in a `language=` parameter to say what language the name is in. + +For example, there are two distinct languages called "Tonga" in various languages. +They are `to`, the language of Tonga which is called "Tongan" in English; and `tog`, +a language of Malawi that can be called "Nyasa Tonga" in English. + + >>> langcodes.find('tongan') + Language.make(language='to') + + >>> langcodes.find('nyasa tonga') + Language.make(language='tog') + + >>> langcodes.find('tonga') + Traceback (most recent call last): + ... + LookupError: Can't find any language named 'tonga' + + >>> langcodes.find('tonga', language='id') + Language.make(language='to') + + >>> langcodes.find('tonga', language='ca') + Language.make(language='tog') + +Other ambiguous names written in Latin letters are "Kiga", "Mbundu", "Roman", and "Ruanda". + + +## Demographic language data + +The `Language.speaking_population()` and `Language.writing_population()` +methods get Unicode's estimates of how many people in the world use a +language. + +As with the language name data, this requires the optional `language_data` +package to be installed. + +`.speaking_population()` estimates how many people speak a language. It can +be limited to a particular territory with a territory code (such as a country +code). + + >>> Language.get('es').speaking_population() + 487664083 + + >>> Language.get('pt').speaking_population() + 237135429 + + >>> Language.get('es-BR').speaking_population() + 76218 + + >>> Language.get('pt-BR').speaking_population() + 192661560 + + >>> Language.get('vo').speaking_population() + 0 + +Script codes will be ignored, because the script is not involved in speaking: + + >>> Language.get('es-Hant').speaking_population() ==\ + ... Language.get('es').speaking_population() + True + +`.writing_population()` estimates how many people write a language. + + >>> all = Language.get('zh').writing_population() + >>> all + 1240326057 + + >>> traditional = Language.get('zh-Hant').writing_population() + >>> traditional + 37019589 + + >>> simplified = Language.get('zh-Hans').writing_population() + >>> all == traditional + simplified + True + +The estimates for "writing population" are often overestimates, as described +in the [CLDR documentation on territory data][overestimates]. In most cases, +they are derived from published data about literacy rates in the places where +those languages are spoken. This doesn't take into account that many literate +people around the world speak a language that isn't typically written, and +write in a _different_ language. + +[overestimates]: https://unicode-org.github.io/cldr-staging/charts/39/supplemental/territory_language_information.html + +Like `.speaking_population()`, this can be limited to a particular territory: + + >>> Language.get('zh-Hant-HK').writing_population() + 6439733 + >>> Language.get('zh-Hans-HK').writing_population() + 338933 + + +## Comparing and matching languages + +The `tag_distance` function returns a number from 0 to 134 indicating the +distance between the language the user desires and a supported language. + +The distance data comes from CLDR v38.1 and involves a lot of judgment calls +made by the Unicode consortium. + + +### Distance values + +This table summarizes the language distance values: + +| Value | Meaning | Example +| ----: | :------ | :------ +| 0 | These codes represent the same language, possibly after filling in values and normalizing. | Norwegian Bokmål → Norwegian +| 1-3 | These codes indicate a minor regional difference. | Australian English → British English +| 4-9 | These codes indicate a significant but unproblematic regional difference. | American English → British English +| 10-24 | A gray area that depends on your use case. There may be problems with understanding or usability. | Afrikaans → Dutch, Wu Chinese → Mandarin Chinese +| 25-50 | These languages aren't similar, but there are demographic reasons to expect some intelligibility. | Tamil → English, Marathi → Hindi +| 51-79 | There are large barriers to understanding. | Japanese → Japanese in Hepburn romanization +| 80-99 | These are different languages written in the same script. | English → French, Arabic → Urdu +| 100+ | These languages have nothing particularly in common. | English → Japanese, English → Tamil + +See the docstring of `tag_distance` for more explanation and examples. + + +### Finding the best matching language + +Suppose you have software that supports any of the `supported_languages`. The +user wants to use `desired_language`. + +The function `closest_supported_match(desired_language, supported_languages)` +lets you choose the right language, even if there isn't an exact match. +It returns the language tag of the best-supported language, even if there +isn't an exact match. + +The `max_distance` parameter lets you set a cutoff on what counts as language +support. It has a default of 25, a value that is probably okay for simple +cases of i18n, but you might want to set it lower to require more precision. + + >>> closest_supported_match('fr', ['de', 'en', 'fr']) + 'fr' + + >>> closest_supported_match('pt', ['pt-BR', 'pt-PT']) + 'pt-BR' + + >>> closest_supported_match('en-AU', ['en-GB', 'en-US']) + 'en-GB' + + >>> closest_supported_match('af', ['en', 'nl', 'zu']) + 'nl' + + >>> closest_supported_match('und', ['en', 'und']) + 'und' + + >>> print(closest_supported_match('af', ['en', 'nl', 'zu'], max_distance=10)) + None + +A similar function is `closest_match(desired_language, supported_language)`, +which returns both the best matching language tag and the distance. If there is +no match, it returns ('und', 1000). + + >>> closest_match('fr', ['de', 'en', 'fr']) + ('fr', 0) + + >>> closest_match('sh', ['hr', 'bs', 'sr-Latn', 'sr-Cyrl']) + ('sr-Latn', 0) + + >>> closest_match('id', ['zsm', 'mhp']) + ('zsm', 14) + + >>> closest_match('ja', ['ja-Latn-hepburn', 'en']) + ('und', 1000) + + >>> closest_match('ja', ['ja-Latn-hepburn', 'en'], max_distance=60) + ('ja-Latn-hepburn', 50) + +## Further API documentation + +There are many more methods for manipulating and comparing language codes, +and you will find them documented thoroughly in [the code itself][code]. + +The interesting functions all live in this one file, with extensive docstrings +and annotations. Making a separate Sphinx page out of the docstrings would be +the traditional thing to do, but here it just seems redundant. You can go read +the docstrings in context, in their native habitat, and they'll always be up to +date. + +[Code with documentation][code] + +[code]: https://github.com/rspeer/langcodes/blob/master/langcodes/__init__.py + +# Changelog + +## Version 3.3 (November 2021) + +- Updated to CLDR v40. + +- Updated the IANA subtag registry to version 2021-08-06. + +- Bug fix: recognize script codes that appear in the IANA registry even if + they're missing from CLDR for some reason. 'cu-Cyrs' is valid, for example. + +- Switched the build system from `setuptools` to `poetry`. + +To install the package in editable mode before PEP 660 is better supported, use +`poetry install` instead of `pip install -e .`. + +## Version 3.2 (October 2021) + +- Supports Python 3.6 through 3.10. + +- Added the top-level function `tag_is_valid(tag)`, for determining if a string + is a valid language tag without having to parse it first. + +- Added the top-level function `closest_supported_match(desired, supported)`, + which is similar to `closest_match` but with a simpler return value. It + returns the language tag of the closest match, or None if no match is close + enough. + +- Bug fix: a lot of well-formed but invalid language codes appeared to be + valid, such as 'aaj' or 'en-Latnx', because the regex could match a prefix of + a subtag. The validity regex is now required to match completely. + +- Bug fixes that address some edge cases of validity: + + - A language tag that is entirely private use, like 'x-private', is valid + - A language tag that uses the same extension twice, like 'en-a-bbb-a-ccc', + is invalid + - A language tag that uses the same variant twice, like 'de-1901-1901', is + invalid + - A language tag with two extlangs, like 'sgn-ase-bfi', is invalid + +- Updated dependencies so they are compatible with Python 3.10, including + switching back from `marisa-trie-m` to `marisa-trie` in `language_data`. + +- In bugfix release 3.2.1, corrected cases where the parser accepted + ill-formed language tags: + + - All subtags must be made of between 1 and 8 alphanumeric ASCII characters + - Tags with two extension 'singletons' in a row (`en-a-b-ccc`) should be + rejected + +## Version 3.1 (February 2021) + +- Added the `Language.to_alpha3()` method, for getting a three-letter code for a + language according to ISO 639-2. + +- Updated the type annotations from obiwan-style to mypy-style. + + +## Version 3.0 (February 2021) + +- Moved bulky data, particularly language names, into a separate + `language_data` package. In situations where the data isn't needed, + `langcodes` becomes a smaller, pure-Python package with no dependencies. + +- Language codes where the language segment is more than 4 letters no longer + parse: Language.get('nonsense') now returns an error. + + (This is technically stricter than the parse rules of BCP 47, but there are + no valid language codes of this form and there should never be any. An + attempt to parse a language code with 5-8 letters is most likely a mistake or + an attempt to make up a code.) + +- Added a method for checking the validity of a language code. + +- Added methods for estimating language population. + +- Updated to CLDR 38.1, which includes differences in language matching. + +- Tested on Python 3.6 through 3.9; no longer tested on Python 3.5. + + +## Version 2.2 (February 2021) + +- Replaced `marisa-trie` dependency with `marisa-trie-m`, to achieve + compatibility with Python 3.9. + + +## Version 2.1 (June 2020) + +- Added the `display_name` method to be a more intuitive way to get a string + describing a language code, and made the `autonym` method use it instead of + `language_name`. + +- Updated to CLDR v37. + +- Previously, some attempts to get the name of a language would return its + language code instead, perhaps because the name was being requested in a + language for which CLDR doesn't have name data. This is unfortunate because + names and codes should not be interchangeable. + + Now we fall back on English names instead, which exists for all IANA codes. + If the code is unknown, we return a string such as "Unknown language [xx]". + + +## Version 2.0 (April 2020) + +Version 2.0 involves some significant changes that may break compatibility with 1.4, +in addition to updating to version 36.1 of the Unicode CLDR data and the April 2020 +version of the IANA subtag registry. + +This version requires Python 3.5 or later. + +### Match scores replaced with distances + +Originally, the goodness of a match between two different language codes was defined +in terms of a "match score" with a maximum of 100. Around 2016, Unicode started +replacing this with a different measure, the "match distance", which was defined +much more clearly, but we had to keep using the "match score". + +As of langcodes version 2.0, the "score" functions (such as +`Language.match_score`, `tag_match_score`, and `best_match`) are deprecated. +They'll keep using the deprecated language match tables from around CLDR 27. + +For a better measure of the closeness of two language codes, use `Language.distance`, +`tag_distance`, and `closest_match`. + +### 'region' renamed to 'territory' + +We were always out of step with CLDR here. Following the example of the IANA +database, we referred to things like the 'US' in 'en-US' as a "region code", +but the Unicode standards consistently call it a "territory code". + +In langcodes 2.0, parameters, dictionary keys, and attributes named `region` +have been renamed to `territory`. We try to support a few common cases with +deprecation warnings, such as looking up the `region` property of a Language +object. + +A nice benefit of this is that when a dictionary is displayed with 'language', +'script', and 'territory' keys in alphabetical order, they are in the same +order as they are in a language code. + + + +%package -n python3-langcodes +Summary: Tools for labeling human languages with IETF language tags +Provides: python-langcodes +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-langcodes +# Langcodes: a library for language codes + +**langcodes** knows what languages are. It knows the standardized codes that +refer to them, such as `en` for English, `es` for Spanish and `hi` for Hindi. + +These are [IETF language tags][]. You may know them by their old name, ISO 639 +language codes. IETF has done some important things for backward compatibility +and supporting language variations that you won't find in the ISO standard. + +[IETF language tags]: https://www.w3.org/International/articles/language-tags/ + +It may sound to you like langcodes solves a pretty boring problem. At one +level, that's right. Sometimes you have a boring problem, and it's great when a +library solves it for you. + +But there's an interesting problem hiding in here. How do you work with +language codes? How do you know when two different codes represent the same +thing? How should your code represent relationships between codes, like the +following? + +* `eng` is equivalent to `en`. +* `fra` and `fre` are both equivalent to `fr`. +* `en-GB` might be written as `en-gb` or `en_GB`. Or as 'en-UK', which is + erroneous, but should be treated as the same. +* `en-CA` is not exactly equivalent to `en-US`, but it's really, really close. +* `en-Latn-US` is equivalent to `en-US`, because written English must be written + in the Latin alphabet to be understood. +* The difference between `ar` and `arb` is the difference between "Arabic" and + "Modern Standard Arabic", a difference that may not be relevant to you. +* You'll find Mandarin Chinese tagged as `cmn` on Wiktionary, but many other + resources would call the same language `zh`. +* Chinese is written in different scripts in different territories. Some + software distinguishes the script. Other software distinguishes the territory. + The result is that `zh-CN` and `zh-Hans` are used interchangeably, as are + `zh-TW` and `zh-Hant`, even though occasionally you'll need something + different such as `zh-HK` or `zh-Latn-pinyin`. +* The Indonesian (`id`) and Malaysian (`ms` or `zsm`) languages are mutually + intelligible. +* `jp` is not a language code. (The language code for Japanese is `ja`, but + people confuse it with the country code for Japan.) + +One way to know is to read IETF standards and Unicode technical reports. +Another way is to use a library that implements those standards and guidelines +for you, which langcodes does. + +When you're working with these short language codes, you may want to see the +name that the language is called _in_ a language: `fr` is called "French" in +English. That language doesn't have to be English: `fr` is called "français" in +French. A supplement to langcodes, [`language_data`][language-data], provides +this information. + +[language-data]: https://github.com/rspeer/language_data + +langcodes is maintained by Elia Robyn Lake a.k.a. Robyn Speer, and is released +as free software under the MIT license. + + +## Standards implemented + +Although this is not the only reason to use it, langcodes will make you more +acronym-compliant. + +langcodes implements [BCP 47](http://tools.ietf.org/html/bcp47), the IETF Best +Current Practices on Tags for Identifying Languages. BCP 47 is also known as +RFC 5646. It subsumes ISO 639 and is backward compatible with it, and it also +implements recommendations from the [Unicode CLDR](http://cldr.unicode.org). + +langcodes can also refer to a database of language properties and names, built +from Unicode CLDR and the IANA subtag registry, if you install `language_data`. + +In summary, langcodes takes language codes and does the Right Thing with them, +and if you want to know exactly what the Right Thing is, there are some +documents you can go read. + + +# Documentation + +## Standardizing language tags + +This function standardizes tags, as strings, in several ways. + +It replaces overlong tags with their shortest version, and also formats them +according to the conventions of BCP 47: + + >>> from langcodes import * + >>> standardize_tag('eng_US') + 'en-US' + +It removes script subtags that are redundant with the language: + + >>> standardize_tag('en-Latn') + 'en' + +It replaces deprecated values with their correct versions, if possible: + + >>> standardize_tag('en-uk') + 'en-GB' + +Sometimes this involves complex substitutions, such as replacing Serbo-Croatian +(`sh`) with Serbian in Latin script (`sr-Latn`), or the entire tag `sgn-US` +with `ase` (American Sign Language). + + >>> standardize_tag('sh-QU') + 'sr-Latn-EU' + + >>> standardize_tag('sgn-US') + 'ase' + +If *macro* is True, it uses macrolanguage codes as a replacement for the most +common standardized language within that macrolanguage. + + >>> standardize_tag('arb-Arab', macro=True) + 'ar' + +Even when *macro* is False, it shortens tags that contain both the +macrolanguage and the language: + + >>> standardize_tag('zh-cmn-hans-cn') + 'zh-Hans-CN' + +If the tag can't be parsed according to BCP 47, this will raise a +LanguageTagError (a subclass of ValueError): + + >>> standardize_tag('spa-latn-mx') + 'es-MX' + + >>> standardize_tag('spa-mx-latn') + Traceback (most recent call last): + ... + langcodes.tag_parser.LanguageTagError: This script subtag, 'latn', is out of place. Expected variant, extension, or end of string. + + +## Language objects + +This package defines one class, named Language, which contains the results +of parsing a language tag. Language objects have the following fields, +any of which may be unspecified: + +- *language*: the code for the language itself. +- *script*: the 4-letter code for the writing system being used. +- *territory*: the 2-letter or 3-digit code for the country or similar region + whose usage of the language appears in this text. +- *extlangs*: a list of more specific language codes that follow the language + code. (This is allowed by the language code syntax, but deprecated.) +- *variants*: codes for specific variations of language usage that aren't + covered by the *script* or *territory* codes. +- *extensions*: information that's attached to the language code for use in + some specific system, such as Unicode collation orders. +- *private*: a code starting with `x-` that has no defined meaning. + +The `Language.get` method converts a string to a Language instance, and the +`Language.make` method makes a Language instance from its fields. These values +are cached so that calling `Language.get` or `Language.make` again with the +same values returns the same object, for efficiency. + +By default, it will replace non-standard and overlong tags as it interprets +them. To disable this feature and get the codes that literally appear in the +language tag, use the *normalize=False* option. + + >>> Language.get('en-Latn-US') + Language.make(language='en', script='Latn', territory='US') + + >>> Language.get('sgn-US', normalize=False) + Language.make(language='sgn', territory='US') + + >>> Language.get('und') + Language.make() + +Here are some examples of replacing non-standard tags: + + >>> Language.get('sh-QU') + Language.make(language='sr', script='Latn', territory='EU') + + >>> Language.get('sgn-US') + Language.make(language='ase') + + >>> Language.get('zh-cmn-Hant') + Language.make(language='zh', script='Hant') + +Use the `str()` function on a Language object to convert it back to its +standard string form: + + >>> str(Language.get('sh-QU')) + 'sr-Latn-EU' + + >>> str(Language.make(territory='IN')) + 'und-IN' + + +### Checking validity + +A language code is _valid_ when every part of it is assigned a meaning by IANA. +That meaning could be "private use". + +In langcodes, we check the language subtag, script, territory, and variants for +validity. We don't check other parts such as extlangs or Unicode extensions. + +For example, `ja` is a valid language code, and `jp` is not: + + >>> Language.get('ja').is_valid() + True + + >>> Language.get('jp').is_valid() + False + +The top-level function `tag_is_valid(tag)` is possibly more convenient to use, +because it can return False even for tags that don't parse: + + >>> tag_is_valid('C') + False + +If one subtag is invalid, the entire code is invalid: + + >>> tag_is_valid('en-000') + False + +`iw` is valid, though it's a deprecated alias for `he`: + + >>> tag_is_valid('iw') + True + +The empty language tag (`und`) is valid: + + >>> tag_is_valid('und') + True + +Private use codes are valid: + + >>> tag_is_valid('x-other') + True + + >>> tag_is_valid('qaa-Qaai-AA-x-what-even-is-this') + True + +Language tags that are very unlikely are still valid: + + >>> tag_is_valid('fr-Cyrl') + True + +Tags with non-ASCII characters are invalid, because they don't parse: + + >>> tag_is_valid('zh-普通话') + False + + +### Getting alpha3 codes + +Before there was BCP 47, there was ISO 639-2. The ISO tried to make room for the +variety of human languages by assigning every language a 3-letter code, +including the ones that already had 2-letter codes. + +Unfortunately, this just led to more confusion. Some languages ended up with two +different 3-letter codes for legacy reasons, such as French, which is `fra` as a +"terminology" code, and `fre` as a "biblographic" code. And meanwhile, `fr` was +still a code that you'd be using if you followed ISO 639-1. + +In BCP 47, you should use 2-letter codes whenever they're available, and that's +what langcodes does. Fortunately, all the languages that have two different +3-letter codes also have a 2-letter code, so if you prefer the 2-letter code, +you don't have to worry about the distinction. + +But some applications want the 3-letter code in particular, so langcodes +provides a method for getting those, `Language.to_alpha3()`. It returns the +'terminology' code by default, and passing `variant='B'` returns the +bibliographic code. + +When this method returns, it always returns a 3-letter string. + + >>> Language.get('fr').to_alpha3() + 'fra' + >>> Language.get('fr-CA').to_alpha3() + 'fra' + >>> Language.get('fr-CA').to_alpha3(variant='B') + 'fre' + >>> Language.get('de').to_alpha3() + 'deu' + >>> Language.get('no').to_alpha3() + 'nor' + >>> Language.get('un').to_alpha3() + Traceback (most recent call last): + ... + LookupError: 'un' is not a known language code, and has no alpha3 code. + +For many languages, the terminology and bibliographic alpha3 codes are the same. + + >>> Language.get('en').to_alpha3(variant='T') + 'eng' + >>> Language.get('en').to_alpha3(variant='B') + 'eng' + +When you use any of these "overlong" alpha3 codes in langcodes, they normalize +back to the alpha2 code: + + >>> Language.get('zho') + Language.make(language='zh') + + +## Working with language names + +The methods in this section require an optional package called `language_data`. +You can install it with `pip install language_data`, or request the optional +"data" feature of langcodes with `pip install langcodes[data]`. + +The dependency that you put in setup.py should be `langcodes[data]`. + +### Describing Language objects in natural language + +It's often helpful to be able to describe a language code in a way that a user +(or you) can understand, instead of in inscrutable short codes. The +`display_name` method lets you describe a Language object *in a language*. + +The `.display_name(language, min_score)` method will look up the name of the +language. The names come from the IANA language tag registry, which is only in +English, plus CLDR, which names languages in many commonly-used languages. + +The default language for naming things is English: + + >>> Language.make(language='fr').display_name() + 'French' + + >>> Language.make().display_name() + 'Unknown language' + + >>> Language.get('zh-Hans').display_name() + 'Chinese (Simplified)' + + >>> Language.get('en-US').display_name() + 'English (United States)' + +But you can ask for language names in numerous other languages: + + >>> Language.get('fr').display_name('fr') + 'français' + + >>> Language.get('fr').display_name('es') + 'francés' + + >>> Language.make().display_name('es') + 'lengua desconocida' + + >>> Language.get('zh-Hans').display_name('de') + 'Chinesisch (Vereinfacht)' + + >>> Language.get('en-US').display_name('zh-Hans') + '英语(美国)' + +Why does everyone get Slovak and Slovenian confused? Let's ask them. + + >>> Language.get('sl').display_name('sl') + 'slovenščina' + >>> Language.get('sk').display_name('sk') + 'slovenčina' + >>> Language.get('sl').display_name('sk') + 'slovinčina' + >>> Language.get('sk').display_name('sl') + 'slovaščina' + +If the language has a script or territory code attached to it, these will be +described in parentheses: + + >>> Language.get('en-US').display_name() + 'English (United States)' + +Sometimes these can be the result of tag normalization, such as in this case +where the legacy tag 'sh' becomes 'sr-Latn': + + >>> Language.get('sh').display_name() + 'Serbian (Latin)' + + >>> Language.get('sh', normalize=False).display_name() + 'Serbo-Croatian' + +Naming a language in itself is sometimes a useful thing to do, so the +`.autonym()` method makes this easy, providing the display name of a language +in the language itself: + + >>> Language.get('fr').autonym() + 'français' + >>> Language.get('es').autonym() + 'español' + >>> Language.get('ja').autonym() + '日本語' + >>> Language.get('en-AU').autonym() + 'English (Australia)' + >>> Language.get('sr-Latn').autonym() + 'srpski (latinica)' + >>> Language.get('sr-Cyrl').autonym() + 'српски (ћирилица)' + +The names come from the Unicode CLDR data files, and in English they can +also come from the IANA language subtag registry. Together, they can give +you language names in the 196 languages that CLDR supports. + + +### Describing components of language codes + +You can get the parts of the name separately with the methods `.language_name()`, +`.script_name()`, and `.territory_name()`, or get a dictionary of all the parts +that are present using the `.describe()` method. These methods also accept a +language code for what language they should be described in. + + >>> shaw = Language.get('en-Shaw-GB') + >>> shaw.describe('en') + {'language': 'English', 'script': 'Shavian', 'territory': 'United Kingdom'} + + >>> shaw.describe('es') + {'language': 'inglés', 'script': 'shaviano', 'territory': 'Reino Unido'} + + +### Recognizing language names in natural language + +As the reverse of the above operations, you may want to look up a language by +its name, converting a natural language name such as "French" to a code such as +'fr'. + +The name can be in any language that CLDR supports (see "Ambiguity" below). + + >>> import langcodes + >>> langcodes.find('french') + Language.make(language='fr') + + >>> langcodes.find('francés') + Language.make(language='fr') + +However, this method currently ignores the parenthetical expressions that come from +`.display_name()`: + + >>> langcodes.find('English (Canada)') + Language.make(language='en') + +There is still room to improve the way that language names are matched, because +some languages are not consistently named the same way. The method currently +works with hundreds of language names that are used on Wiktionary. + +#### Ambiguity + +For the sake of usability, `langcodes.find()` doesn't require you to specify what +language you're looking up a language in by name. This could potentially lead to +a conflict: what if name "X" is language A's name for language B, and language C's +name for language D? + +We can collect the language codes from CLDR and see how many times this +happens. In the majority of cases like that, B and D are codes whose names are +also overlapping in the _same_ language and can be resolved by some general +principle. + +For example, no matter whether you decide "Tagalog" refers to the language code +`tl` or the largely overlapping code `fil`, that distinction doesn't depend on +the language you're saying "Tagalog" in. We can just return `tl` consistently. + + >>> langcodes.find('tagalog') + Language.make(language='tl') + +In the few cases of actual interlingual ambiguity, langcodes won't match a result. +You can pass in a `language=` parameter to say what language the name is in. + +For example, there are two distinct languages called "Tonga" in various languages. +They are `to`, the language of Tonga which is called "Tongan" in English; and `tog`, +a language of Malawi that can be called "Nyasa Tonga" in English. + + >>> langcodes.find('tongan') + Language.make(language='to') + + >>> langcodes.find('nyasa tonga') + Language.make(language='tog') + + >>> langcodes.find('tonga') + Traceback (most recent call last): + ... + LookupError: Can't find any language named 'tonga' + + >>> langcodes.find('tonga', language='id') + Language.make(language='to') + + >>> langcodes.find('tonga', language='ca') + Language.make(language='tog') + +Other ambiguous names written in Latin letters are "Kiga", "Mbundu", "Roman", and "Ruanda". + + +## Demographic language data + +The `Language.speaking_population()` and `Language.writing_population()` +methods get Unicode's estimates of how many people in the world use a +language. + +As with the language name data, this requires the optional `language_data` +package to be installed. + +`.speaking_population()` estimates how many people speak a language. It can +be limited to a particular territory with a territory code (such as a country +code). + + >>> Language.get('es').speaking_population() + 487664083 + + >>> Language.get('pt').speaking_population() + 237135429 + + >>> Language.get('es-BR').speaking_population() + 76218 + + >>> Language.get('pt-BR').speaking_population() + 192661560 + + >>> Language.get('vo').speaking_population() + 0 + +Script codes will be ignored, because the script is not involved in speaking: + + >>> Language.get('es-Hant').speaking_population() ==\ + ... Language.get('es').speaking_population() + True + +`.writing_population()` estimates how many people write a language. + + >>> all = Language.get('zh').writing_population() + >>> all + 1240326057 + + >>> traditional = Language.get('zh-Hant').writing_population() + >>> traditional + 37019589 + + >>> simplified = Language.get('zh-Hans').writing_population() + >>> all == traditional + simplified + True + +The estimates for "writing population" are often overestimates, as described +in the [CLDR documentation on territory data][overestimates]. In most cases, +they are derived from published data about literacy rates in the places where +those languages are spoken. This doesn't take into account that many literate +people around the world speak a language that isn't typically written, and +write in a _different_ language. + +[overestimates]: https://unicode-org.github.io/cldr-staging/charts/39/supplemental/territory_language_information.html + +Like `.speaking_population()`, this can be limited to a particular territory: + + >>> Language.get('zh-Hant-HK').writing_population() + 6439733 + >>> Language.get('zh-Hans-HK').writing_population() + 338933 + + +## Comparing and matching languages + +The `tag_distance` function returns a number from 0 to 134 indicating the +distance between the language the user desires and a supported language. + +The distance data comes from CLDR v38.1 and involves a lot of judgment calls +made by the Unicode consortium. + + +### Distance values + +This table summarizes the language distance values: + +| Value | Meaning | Example +| ----: | :------ | :------ +| 0 | These codes represent the same language, possibly after filling in values and normalizing. | Norwegian Bokmål → Norwegian +| 1-3 | These codes indicate a minor regional difference. | Australian English → British English +| 4-9 | These codes indicate a significant but unproblematic regional difference. | American English → British English +| 10-24 | A gray area that depends on your use case. There may be problems with understanding or usability. | Afrikaans → Dutch, Wu Chinese → Mandarin Chinese +| 25-50 | These languages aren't similar, but there are demographic reasons to expect some intelligibility. | Tamil → English, Marathi → Hindi +| 51-79 | There are large barriers to understanding. | Japanese → Japanese in Hepburn romanization +| 80-99 | These are different languages written in the same script. | English → French, Arabic → Urdu +| 100+ | These languages have nothing particularly in common. | English → Japanese, English → Tamil + +See the docstring of `tag_distance` for more explanation and examples. + + +### Finding the best matching language + +Suppose you have software that supports any of the `supported_languages`. The +user wants to use `desired_language`. + +The function `closest_supported_match(desired_language, supported_languages)` +lets you choose the right language, even if there isn't an exact match. +It returns the language tag of the best-supported language, even if there +isn't an exact match. + +The `max_distance` parameter lets you set a cutoff on what counts as language +support. It has a default of 25, a value that is probably okay for simple +cases of i18n, but you might want to set it lower to require more precision. + + >>> closest_supported_match('fr', ['de', 'en', 'fr']) + 'fr' + + >>> closest_supported_match('pt', ['pt-BR', 'pt-PT']) + 'pt-BR' + + >>> closest_supported_match('en-AU', ['en-GB', 'en-US']) + 'en-GB' + + >>> closest_supported_match('af', ['en', 'nl', 'zu']) + 'nl' + + >>> closest_supported_match('und', ['en', 'und']) + 'und' + + >>> print(closest_supported_match('af', ['en', 'nl', 'zu'], max_distance=10)) + None + +A similar function is `closest_match(desired_language, supported_language)`, +which returns both the best matching language tag and the distance. If there is +no match, it returns ('und', 1000). + + >>> closest_match('fr', ['de', 'en', 'fr']) + ('fr', 0) + + >>> closest_match('sh', ['hr', 'bs', 'sr-Latn', 'sr-Cyrl']) + ('sr-Latn', 0) + + >>> closest_match('id', ['zsm', 'mhp']) + ('zsm', 14) + + >>> closest_match('ja', ['ja-Latn-hepburn', 'en']) + ('und', 1000) + + >>> closest_match('ja', ['ja-Latn-hepburn', 'en'], max_distance=60) + ('ja-Latn-hepburn', 50) + +## Further API documentation + +There are many more methods for manipulating and comparing language codes, +and you will find them documented thoroughly in [the code itself][code]. + +The interesting functions all live in this one file, with extensive docstrings +and annotations. Making a separate Sphinx page out of the docstrings would be +the traditional thing to do, but here it just seems redundant. You can go read +the docstrings in context, in their native habitat, and they'll always be up to +date. + +[Code with documentation][code] + +[code]: https://github.com/rspeer/langcodes/blob/master/langcodes/__init__.py + +# Changelog + +## Version 3.3 (November 2021) + +- Updated to CLDR v40. + +- Updated the IANA subtag registry to version 2021-08-06. + +- Bug fix: recognize script codes that appear in the IANA registry even if + they're missing from CLDR for some reason. 'cu-Cyrs' is valid, for example. + +- Switched the build system from `setuptools` to `poetry`. + +To install the package in editable mode before PEP 660 is better supported, use +`poetry install` instead of `pip install -e .`. + +## Version 3.2 (October 2021) + +- Supports Python 3.6 through 3.10. + +- Added the top-level function `tag_is_valid(tag)`, for determining if a string + is a valid language tag without having to parse it first. + +- Added the top-level function `closest_supported_match(desired, supported)`, + which is similar to `closest_match` but with a simpler return value. It + returns the language tag of the closest match, or None if no match is close + enough. + +- Bug fix: a lot of well-formed but invalid language codes appeared to be + valid, such as 'aaj' or 'en-Latnx', because the regex could match a prefix of + a subtag. The validity regex is now required to match completely. + +- Bug fixes that address some edge cases of validity: + + - A language tag that is entirely private use, like 'x-private', is valid + - A language tag that uses the same extension twice, like 'en-a-bbb-a-ccc', + is invalid + - A language tag that uses the same variant twice, like 'de-1901-1901', is + invalid + - A language tag with two extlangs, like 'sgn-ase-bfi', is invalid + +- Updated dependencies so they are compatible with Python 3.10, including + switching back from `marisa-trie-m` to `marisa-trie` in `language_data`. + +- In bugfix release 3.2.1, corrected cases where the parser accepted + ill-formed language tags: + + - All subtags must be made of between 1 and 8 alphanumeric ASCII characters + - Tags with two extension 'singletons' in a row (`en-a-b-ccc`) should be + rejected + +## Version 3.1 (February 2021) + +- Added the `Language.to_alpha3()` method, for getting a three-letter code for a + language according to ISO 639-2. + +- Updated the type annotations from obiwan-style to mypy-style. + + +## Version 3.0 (February 2021) + +- Moved bulky data, particularly language names, into a separate + `language_data` package. In situations where the data isn't needed, + `langcodes` becomes a smaller, pure-Python package with no dependencies. + +- Language codes where the language segment is more than 4 letters no longer + parse: Language.get('nonsense') now returns an error. + + (This is technically stricter than the parse rules of BCP 47, but there are + no valid language codes of this form and there should never be any. An + attempt to parse a language code with 5-8 letters is most likely a mistake or + an attempt to make up a code.) + +- Added a method for checking the validity of a language code. + +- Added methods for estimating language population. + +- Updated to CLDR 38.1, which includes differences in language matching. + +- Tested on Python 3.6 through 3.9; no longer tested on Python 3.5. + + +## Version 2.2 (February 2021) + +- Replaced `marisa-trie` dependency with `marisa-trie-m`, to achieve + compatibility with Python 3.9. + + +## Version 2.1 (June 2020) + +- Added the `display_name` method to be a more intuitive way to get a string + describing a language code, and made the `autonym` method use it instead of + `language_name`. + +- Updated to CLDR v37. + +- Previously, some attempts to get the name of a language would return its + language code instead, perhaps because the name was being requested in a + language for which CLDR doesn't have name data. This is unfortunate because + names and codes should not be interchangeable. + + Now we fall back on English names instead, which exists for all IANA codes. + If the code is unknown, we return a string such as "Unknown language [xx]". + + +## Version 2.0 (April 2020) + +Version 2.0 involves some significant changes that may break compatibility with 1.4, +in addition to updating to version 36.1 of the Unicode CLDR data and the April 2020 +version of the IANA subtag registry. + +This version requires Python 3.5 or later. + +### Match scores replaced with distances + +Originally, the goodness of a match between two different language codes was defined +in terms of a "match score" with a maximum of 100. Around 2016, Unicode started +replacing this with a different measure, the "match distance", which was defined +much more clearly, but we had to keep using the "match score". + +As of langcodes version 2.0, the "score" functions (such as +`Language.match_score`, `tag_match_score`, and `best_match`) are deprecated. +They'll keep using the deprecated language match tables from around CLDR 27. + +For a better measure of the closeness of two language codes, use `Language.distance`, +`tag_distance`, and `closest_match`. + +### 'region' renamed to 'territory' + +We were always out of step with CLDR here. Following the example of the IANA +database, we referred to things like the 'US' in 'en-US' as a "region code", +but the Unicode standards consistently call it a "territory code". + +In langcodes 2.0, parameters, dictionary keys, and attributes named `region` +have been renamed to `territory`. We try to support a few common cases with +deprecation warnings, such as looking up the `region` property of a Language +object. + +A nice benefit of this is that when a dictionary is displayed with 'language', +'script', and 'territory' keys in alphabetical order, they are in the same +order as they are in a language code. + + + +%package help +Summary: Development documents and examples for langcodes +Provides: python3-langcodes-doc +%description help +# Langcodes: a library for language codes + +**langcodes** knows what languages are. It knows the standardized codes that +refer to them, such as `en` for English, `es` for Spanish and `hi` for Hindi. + +These are [IETF language tags][]. You may know them by their old name, ISO 639 +language codes. IETF has done some important things for backward compatibility +and supporting language variations that you won't find in the ISO standard. + +[IETF language tags]: https://www.w3.org/International/articles/language-tags/ + +It may sound to you like langcodes solves a pretty boring problem. At one +level, that's right. Sometimes you have a boring problem, and it's great when a +library solves it for you. + +But there's an interesting problem hiding in here. How do you work with +language codes? How do you know when two different codes represent the same +thing? How should your code represent relationships between codes, like the +following? + +* `eng` is equivalent to `en`. +* `fra` and `fre` are both equivalent to `fr`. +* `en-GB` might be written as `en-gb` or `en_GB`. Or as 'en-UK', which is + erroneous, but should be treated as the same. +* `en-CA` is not exactly equivalent to `en-US`, but it's really, really close. +* `en-Latn-US` is equivalent to `en-US`, because written English must be written + in the Latin alphabet to be understood. +* The difference between `ar` and `arb` is the difference between "Arabic" and + "Modern Standard Arabic", a difference that may not be relevant to you. +* You'll find Mandarin Chinese tagged as `cmn` on Wiktionary, but many other + resources would call the same language `zh`. +* Chinese is written in different scripts in different territories. Some + software distinguishes the script. Other software distinguishes the territory. + The result is that `zh-CN` and `zh-Hans` are used interchangeably, as are + `zh-TW` and `zh-Hant`, even though occasionally you'll need something + different such as `zh-HK` or `zh-Latn-pinyin`. +* The Indonesian (`id`) and Malaysian (`ms` or `zsm`) languages are mutually + intelligible. +* `jp` is not a language code. (The language code for Japanese is `ja`, but + people confuse it with the country code for Japan.) + +One way to know is to read IETF standards and Unicode technical reports. +Another way is to use a library that implements those standards and guidelines +for you, which langcodes does. + +When you're working with these short language codes, you may want to see the +name that the language is called _in_ a language: `fr` is called "French" in +English. That language doesn't have to be English: `fr` is called "français" in +French. A supplement to langcodes, [`language_data`][language-data], provides +this information. + +[language-data]: https://github.com/rspeer/language_data + +langcodes is maintained by Elia Robyn Lake a.k.a. Robyn Speer, and is released +as free software under the MIT license. + + +## Standards implemented + +Although this is not the only reason to use it, langcodes will make you more +acronym-compliant. + +langcodes implements [BCP 47](http://tools.ietf.org/html/bcp47), the IETF Best +Current Practices on Tags for Identifying Languages. BCP 47 is also known as +RFC 5646. It subsumes ISO 639 and is backward compatible with it, and it also +implements recommendations from the [Unicode CLDR](http://cldr.unicode.org). + +langcodes can also refer to a database of language properties and names, built +from Unicode CLDR and the IANA subtag registry, if you install `language_data`. + +In summary, langcodes takes language codes and does the Right Thing with them, +and if you want to know exactly what the Right Thing is, there are some +documents you can go read. + + +# Documentation + +## Standardizing language tags + +This function standardizes tags, as strings, in several ways. + +It replaces overlong tags with their shortest version, and also formats them +according to the conventions of BCP 47: + + >>> from langcodes import * + >>> standardize_tag('eng_US') + 'en-US' + +It removes script subtags that are redundant with the language: + + >>> standardize_tag('en-Latn') + 'en' + +It replaces deprecated values with their correct versions, if possible: + + >>> standardize_tag('en-uk') + 'en-GB' + +Sometimes this involves complex substitutions, such as replacing Serbo-Croatian +(`sh`) with Serbian in Latin script (`sr-Latn`), or the entire tag `sgn-US` +with `ase` (American Sign Language). + + >>> standardize_tag('sh-QU') + 'sr-Latn-EU' + + >>> standardize_tag('sgn-US') + 'ase' + +If *macro* is True, it uses macrolanguage codes as a replacement for the most +common standardized language within that macrolanguage. + + >>> standardize_tag('arb-Arab', macro=True) + 'ar' + +Even when *macro* is False, it shortens tags that contain both the +macrolanguage and the language: + + >>> standardize_tag('zh-cmn-hans-cn') + 'zh-Hans-CN' + +If the tag can't be parsed according to BCP 47, this will raise a +LanguageTagError (a subclass of ValueError): + + >>> standardize_tag('spa-latn-mx') + 'es-MX' + + >>> standardize_tag('spa-mx-latn') + Traceback (most recent call last): + ... + langcodes.tag_parser.LanguageTagError: This script subtag, 'latn', is out of place. Expected variant, extension, or end of string. + + +## Language objects + +This package defines one class, named Language, which contains the results +of parsing a language tag. Language objects have the following fields, +any of which may be unspecified: + +- *language*: the code for the language itself. +- *script*: the 4-letter code for the writing system being used. +- *territory*: the 2-letter or 3-digit code for the country or similar region + whose usage of the language appears in this text. +- *extlangs*: a list of more specific language codes that follow the language + code. (This is allowed by the language code syntax, but deprecated.) +- *variants*: codes for specific variations of language usage that aren't + covered by the *script* or *territory* codes. +- *extensions*: information that's attached to the language code for use in + some specific system, such as Unicode collation orders. +- *private*: a code starting with `x-` that has no defined meaning. + +The `Language.get` method converts a string to a Language instance, and the +`Language.make` method makes a Language instance from its fields. These values +are cached so that calling `Language.get` or `Language.make` again with the +same values returns the same object, for efficiency. + +By default, it will replace non-standard and overlong tags as it interprets +them. To disable this feature and get the codes that literally appear in the +language tag, use the *normalize=False* option. + + >>> Language.get('en-Latn-US') + Language.make(language='en', script='Latn', territory='US') + + >>> Language.get('sgn-US', normalize=False) + Language.make(language='sgn', territory='US') + + >>> Language.get('und') + Language.make() + +Here are some examples of replacing non-standard tags: + + >>> Language.get('sh-QU') + Language.make(language='sr', script='Latn', territory='EU') + + >>> Language.get('sgn-US') + Language.make(language='ase') + + >>> Language.get('zh-cmn-Hant') + Language.make(language='zh', script='Hant') + +Use the `str()` function on a Language object to convert it back to its +standard string form: + + >>> str(Language.get('sh-QU')) + 'sr-Latn-EU' + + >>> str(Language.make(territory='IN')) + 'und-IN' + + +### Checking validity + +A language code is _valid_ when every part of it is assigned a meaning by IANA. +That meaning could be "private use". + +In langcodes, we check the language subtag, script, territory, and variants for +validity. We don't check other parts such as extlangs or Unicode extensions. + +For example, `ja` is a valid language code, and `jp` is not: + + >>> Language.get('ja').is_valid() + True + + >>> Language.get('jp').is_valid() + False + +The top-level function `tag_is_valid(tag)` is possibly more convenient to use, +because it can return False even for tags that don't parse: + + >>> tag_is_valid('C') + False + +If one subtag is invalid, the entire code is invalid: + + >>> tag_is_valid('en-000') + False + +`iw` is valid, though it's a deprecated alias for `he`: + + >>> tag_is_valid('iw') + True + +The empty language tag (`und`) is valid: + + >>> tag_is_valid('und') + True + +Private use codes are valid: + + >>> tag_is_valid('x-other') + True + + >>> tag_is_valid('qaa-Qaai-AA-x-what-even-is-this') + True + +Language tags that are very unlikely are still valid: + + >>> tag_is_valid('fr-Cyrl') + True + +Tags with non-ASCII characters are invalid, because they don't parse: + + >>> tag_is_valid('zh-普通话') + False + + +### Getting alpha3 codes + +Before there was BCP 47, there was ISO 639-2. The ISO tried to make room for the +variety of human languages by assigning every language a 3-letter code, +including the ones that already had 2-letter codes. + +Unfortunately, this just led to more confusion. Some languages ended up with two +different 3-letter codes for legacy reasons, such as French, which is `fra` as a +"terminology" code, and `fre` as a "biblographic" code. And meanwhile, `fr` was +still a code that you'd be using if you followed ISO 639-1. + +In BCP 47, you should use 2-letter codes whenever they're available, and that's +what langcodes does. Fortunately, all the languages that have two different +3-letter codes also have a 2-letter code, so if you prefer the 2-letter code, +you don't have to worry about the distinction. + +But some applications want the 3-letter code in particular, so langcodes +provides a method for getting those, `Language.to_alpha3()`. It returns the +'terminology' code by default, and passing `variant='B'` returns the +bibliographic code. + +When this method returns, it always returns a 3-letter string. + + >>> Language.get('fr').to_alpha3() + 'fra' + >>> Language.get('fr-CA').to_alpha3() + 'fra' + >>> Language.get('fr-CA').to_alpha3(variant='B') + 'fre' + >>> Language.get('de').to_alpha3() + 'deu' + >>> Language.get('no').to_alpha3() + 'nor' + >>> Language.get('un').to_alpha3() + Traceback (most recent call last): + ... + LookupError: 'un' is not a known language code, and has no alpha3 code. + +For many languages, the terminology and bibliographic alpha3 codes are the same. + + >>> Language.get('en').to_alpha3(variant='T') + 'eng' + >>> Language.get('en').to_alpha3(variant='B') + 'eng' + +When you use any of these "overlong" alpha3 codes in langcodes, they normalize +back to the alpha2 code: + + >>> Language.get('zho') + Language.make(language='zh') + + +## Working with language names + +The methods in this section require an optional package called `language_data`. +You can install it with `pip install language_data`, or request the optional +"data" feature of langcodes with `pip install langcodes[data]`. + +The dependency that you put in setup.py should be `langcodes[data]`. + +### Describing Language objects in natural language + +It's often helpful to be able to describe a language code in a way that a user +(or you) can understand, instead of in inscrutable short codes. The +`display_name` method lets you describe a Language object *in a language*. + +The `.display_name(language, min_score)` method will look up the name of the +language. The names come from the IANA language tag registry, which is only in +English, plus CLDR, which names languages in many commonly-used languages. + +The default language for naming things is English: + + >>> Language.make(language='fr').display_name() + 'French' + + >>> Language.make().display_name() + 'Unknown language' + + >>> Language.get('zh-Hans').display_name() + 'Chinese (Simplified)' + + >>> Language.get('en-US').display_name() + 'English (United States)' + +But you can ask for language names in numerous other languages: + + >>> Language.get('fr').display_name('fr') + 'français' + + >>> Language.get('fr').display_name('es') + 'francés' + + >>> Language.make().display_name('es') + 'lengua desconocida' + + >>> Language.get('zh-Hans').display_name('de') + 'Chinesisch (Vereinfacht)' + + >>> Language.get('en-US').display_name('zh-Hans') + '英语(美国)' + +Why does everyone get Slovak and Slovenian confused? Let's ask them. + + >>> Language.get('sl').display_name('sl') + 'slovenščina' + >>> Language.get('sk').display_name('sk') + 'slovenčina' + >>> Language.get('sl').display_name('sk') + 'slovinčina' + >>> Language.get('sk').display_name('sl') + 'slovaščina' + +If the language has a script or territory code attached to it, these will be +described in parentheses: + + >>> Language.get('en-US').display_name() + 'English (United States)' + +Sometimes these can be the result of tag normalization, such as in this case +where the legacy tag 'sh' becomes 'sr-Latn': + + >>> Language.get('sh').display_name() + 'Serbian (Latin)' + + >>> Language.get('sh', normalize=False).display_name() + 'Serbo-Croatian' + +Naming a language in itself is sometimes a useful thing to do, so the +`.autonym()` method makes this easy, providing the display name of a language +in the language itself: + + >>> Language.get('fr').autonym() + 'français' + >>> Language.get('es').autonym() + 'español' + >>> Language.get('ja').autonym() + '日本語' + >>> Language.get('en-AU').autonym() + 'English (Australia)' + >>> Language.get('sr-Latn').autonym() + 'srpski (latinica)' + >>> Language.get('sr-Cyrl').autonym() + 'српски (ћирилица)' + +The names come from the Unicode CLDR data files, and in English they can +also come from the IANA language subtag registry. Together, they can give +you language names in the 196 languages that CLDR supports. + + +### Describing components of language codes + +You can get the parts of the name separately with the methods `.language_name()`, +`.script_name()`, and `.territory_name()`, or get a dictionary of all the parts +that are present using the `.describe()` method. These methods also accept a +language code for what language they should be described in. + + >>> shaw = Language.get('en-Shaw-GB') + >>> shaw.describe('en') + {'language': 'English', 'script': 'Shavian', 'territory': 'United Kingdom'} + + >>> shaw.describe('es') + {'language': 'inglés', 'script': 'shaviano', 'territory': 'Reino Unido'} + + +### Recognizing language names in natural language + +As the reverse of the above operations, you may want to look up a language by +its name, converting a natural language name such as "French" to a code such as +'fr'. + +The name can be in any language that CLDR supports (see "Ambiguity" below). + + >>> import langcodes + >>> langcodes.find('french') + Language.make(language='fr') + + >>> langcodes.find('francés') + Language.make(language='fr') + +However, this method currently ignores the parenthetical expressions that come from +`.display_name()`: + + >>> langcodes.find('English (Canada)') + Language.make(language='en') + +There is still room to improve the way that language names are matched, because +some languages are not consistently named the same way. The method currently +works with hundreds of language names that are used on Wiktionary. + +#### Ambiguity + +For the sake of usability, `langcodes.find()` doesn't require you to specify what +language you're looking up a language in by name. This could potentially lead to +a conflict: what if name "X" is language A's name for language B, and language C's +name for language D? + +We can collect the language codes from CLDR and see how many times this +happens. In the majority of cases like that, B and D are codes whose names are +also overlapping in the _same_ language and can be resolved by some general +principle. + +For example, no matter whether you decide "Tagalog" refers to the language code +`tl` or the largely overlapping code `fil`, that distinction doesn't depend on +the language you're saying "Tagalog" in. We can just return `tl` consistently. + + >>> langcodes.find('tagalog') + Language.make(language='tl') + +In the few cases of actual interlingual ambiguity, langcodes won't match a result. +You can pass in a `language=` parameter to say what language the name is in. + +For example, there are two distinct languages called "Tonga" in various languages. +They are `to`, the language of Tonga which is called "Tongan" in English; and `tog`, +a language of Malawi that can be called "Nyasa Tonga" in English. + + >>> langcodes.find('tongan') + Language.make(language='to') + + >>> langcodes.find('nyasa tonga') + Language.make(language='tog') + + >>> langcodes.find('tonga') + Traceback (most recent call last): + ... + LookupError: Can't find any language named 'tonga' + + >>> langcodes.find('tonga', language='id') + Language.make(language='to') + + >>> langcodes.find('tonga', language='ca') + Language.make(language='tog') + +Other ambiguous names written in Latin letters are "Kiga", "Mbundu", "Roman", and "Ruanda". + + +## Demographic language data + +The `Language.speaking_population()` and `Language.writing_population()` +methods get Unicode's estimates of how many people in the world use a +language. + +As with the language name data, this requires the optional `language_data` +package to be installed. + +`.speaking_population()` estimates how many people speak a language. It can +be limited to a particular territory with a territory code (such as a country +code). + + >>> Language.get('es').speaking_population() + 487664083 + + >>> Language.get('pt').speaking_population() + 237135429 + + >>> Language.get('es-BR').speaking_population() + 76218 + + >>> Language.get('pt-BR').speaking_population() + 192661560 + + >>> Language.get('vo').speaking_population() + 0 + +Script codes will be ignored, because the script is not involved in speaking: + + >>> Language.get('es-Hant').speaking_population() ==\ + ... Language.get('es').speaking_population() + True + +`.writing_population()` estimates how many people write a language. + + >>> all = Language.get('zh').writing_population() + >>> all + 1240326057 + + >>> traditional = Language.get('zh-Hant').writing_population() + >>> traditional + 37019589 + + >>> simplified = Language.get('zh-Hans').writing_population() + >>> all == traditional + simplified + True + +The estimates for "writing population" are often overestimates, as described +in the [CLDR documentation on territory data][overestimates]. In most cases, +they are derived from published data about literacy rates in the places where +those languages are spoken. This doesn't take into account that many literate +people around the world speak a language that isn't typically written, and +write in a _different_ language. + +[overestimates]: https://unicode-org.github.io/cldr-staging/charts/39/supplemental/territory_language_information.html + +Like `.speaking_population()`, this can be limited to a particular territory: + + >>> Language.get('zh-Hant-HK').writing_population() + 6439733 + >>> Language.get('zh-Hans-HK').writing_population() + 338933 + + +## Comparing and matching languages + +The `tag_distance` function returns a number from 0 to 134 indicating the +distance between the language the user desires and a supported language. + +The distance data comes from CLDR v38.1 and involves a lot of judgment calls +made by the Unicode consortium. + + +### Distance values + +This table summarizes the language distance values: + +| Value | Meaning | Example +| ----: | :------ | :------ +| 0 | These codes represent the same language, possibly after filling in values and normalizing. | Norwegian Bokmål → Norwegian +| 1-3 | These codes indicate a minor regional difference. | Australian English → British English +| 4-9 | These codes indicate a significant but unproblematic regional difference. | American English → British English +| 10-24 | A gray area that depends on your use case. There may be problems with understanding or usability. | Afrikaans → Dutch, Wu Chinese → Mandarin Chinese +| 25-50 | These languages aren't similar, but there are demographic reasons to expect some intelligibility. | Tamil → English, Marathi → Hindi +| 51-79 | There are large barriers to understanding. | Japanese → Japanese in Hepburn romanization +| 80-99 | These are different languages written in the same script. | English → French, Arabic → Urdu +| 100+ | These languages have nothing particularly in common. | English → Japanese, English → Tamil + +See the docstring of `tag_distance` for more explanation and examples. + + +### Finding the best matching language + +Suppose you have software that supports any of the `supported_languages`. The +user wants to use `desired_language`. + +The function `closest_supported_match(desired_language, supported_languages)` +lets you choose the right language, even if there isn't an exact match. +It returns the language tag of the best-supported language, even if there +isn't an exact match. + +The `max_distance` parameter lets you set a cutoff on what counts as language +support. It has a default of 25, a value that is probably okay for simple +cases of i18n, but you might want to set it lower to require more precision. + + >>> closest_supported_match('fr', ['de', 'en', 'fr']) + 'fr' + + >>> closest_supported_match('pt', ['pt-BR', 'pt-PT']) + 'pt-BR' + + >>> closest_supported_match('en-AU', ['en-GB', 'en-US']) + 'en-GB' + + >>> closest_supported_match('af', ['en', 'nl', 'zu']) + 'nl' + + >>> closest_supported_match('und', ['en', 'und']) + 'und' + + >>> print(closest_supported_match('af', ['en', 'nl', 'zu'], max_distance=10)) + None + +A similar function is `closest_match(desired_language, supported_language)`, +which returns both the best matching language tag and the distance. If there is +no match, it returns ('und', 1000). + + >>> closest_match('fr', ['de', 'en', 'fr']) + ('fr', 0) + + >>> closest_match('sh', ['hr', 'bs', 'sr-Latn', 'sr-Cyrl']) + ('sr-Latn', 0) + + >>> closest_match('id', ['zsm', 'mhp']) + ('zsm', 14) + + >>> closest_match('ja', ['ja-Latn-hepburn', 'en']) + ('und', 1000) + + >>> closest_match('ja', ['ja-Latn-hepburn', 'en'], max_distance=60) + ('ja-Latn-hepburn', 50) + +## Further API documentation + +There are many more methods for manipulating and comparing language codes, +and you will find them documented thoroughly in [the code itself][code]. + +The interesting functions all live in this one file, with extensive docstrings +and annotations. Making a separate Sphinx page out of the docstrings would be +the traditional thing to do, but here it just seems redundant. You can go read +the docstrings in context, in their native habitat, and they'll always be up to +date. + +[Code with documentation][code] + +[code]: https://github.com/rspeer/langcodes/blob/master/langcodes/__init__.py + +# Changelog + +## Version 3.3 (November 2021) + +- Updated to CLDR v40. + +- Updated the IANA subtag registry to version 2021-08-06. + +- Bug fix: recognize script codes that appear in the IANA registry even if + they're missing from CLDR for some reason. 'cu-Cyrs' is valid, for example. + +- Switched the build system from `setuptools` to `poetry`. + +To install the package in editable mode before PEP 660 is better supported, use +`poetry install` instead of `pip install -e .`. + +## Version 3.2 (October 2021) + +- Supports Python 3.6 through 3.10. + +- Added the top-level function `tag_is_valid(tag)`, for determining if a string + is a valid language tag without having to parse it first. + +- Added the top-level function `closest_supported_match(desired, supported)`, + which is similar to `closest_match` but with a simpler return value. It + returns the language tag of the closest match, or None if no match is close + enough. + +- Bug fix: a lot of well-formed but invalid language codes appeared to be + valid, such as 'aaj' or 'en-Latnx', because the regex could match a prefix of + a subtag. The validity regex is now required to match completely. + +- Bug fixes that address some edge cases of validity: + + - A language tag that is entirely private use, like 'x-private', is valid + - A language tag that uses the same extension twice, like 'en-a-bbb-a-ccc', + is invalid + - A language tag that uses the same variant twice, like 'de-1901-1901', is + invalid + - A language tag with two extlangs, like 'sgn-ase-bfi', is invalid + +- Updated dependencies so they are compatible with Python 3.10, including + switching back from `marisa-trie-m` to `marisa-trie` in `language_data`. + +- In bugfix release 3.2.1, corrected cases where the parser accepted + ill-formed language tags: + + - All subtags must be made of between 1 and 8 alphanumeric ASCII characters + - Tags with two extension 'singletons' in a row (`en-a-b-ccc`) should be + rejected + +## Version 3.1 (February 2021) + +- Added the `Language.to_alpha3()` method, for getting a three-letter code for a + language according to ISO 639-2. + +- Updated the type annotations from obiwan-style to mypy-style. + + +## Version 3.0 (February 2021) + +- Moved bulky data, particularly language names, into a separate + `language_data` package. In situations where the data isn't needed, + `langcodes` becomes a smaller, pure-Python package with no dependencies. + +- Language codes where the language segment is more than 4 letters no longer + parse: Language.get('nonsense') now returns an error. + + (This is technically stricter than the parse rules of BCP 47, but there are + no valid language codes of this form and there should never be any. An + attempt to parse a language code with 5-8 letters is most likely a mistake or + an attempt to make up a code.) + +- Added a method for checking the validity of a language code. + +- Added methods for estimating language population. + +- Updated to CLDR 38.1, which includes differences in language matching. + +- Tested on Python 3.6 through 3.9; no longer tested on Python 3.5. + + +## Version 2.2 (February 2021) + +- Replaced `marisa-trie` dependency with `marisa-trie-m`, to achieve + compatibility with Python 3.9. + + +## Version 2.1 (June 2020) + +- Added the `display_name` method to be a more intuitive way to get a string + describing a language code, and made the `autonym` method use it instead of + `language_name`. + +- Updated to CLDR v37. + +- Previously, some attempts to get the name of a language would return its + language code instead, perhaps because the name was being requested in a + language for which CLDR doesn't have name data. This is unfortunate because + names and codes should not be interchangeable. + + Now we fall back on English names instead, which exists for all IANA codes. + If the code is unknown, we return a string such as "Unknown language [xx]". + + +## Version 2.0 (April 2020) + +Version 2.0 involves some significant changes that may break compatibility with 1.4, +in addition to updating to version 36.1 of the Unicode CLDR data and the April 2020 +version of the IANA subtag registry. + +This version requires Python 3.5 or later. + +### Match scores replaced with distances + +Originally, the goodness of a match between two different language codes was defined +in terms of a "match score" with a maximum of 100. Around 2016, Unicode started +replacing this with a different measure, the "match distance", which was defined +much more clearly, but we had to keep using the "match score". + +As of langcodes version 2.0, the "score" functions (such as +`Language.match_score`, `tag_match_score`, and `best_match`) are deprecated. +They'll keep using the deprecated language match tables from around CLDR 27. + +For a better measure of the closeness of two language codes, use `Language.distance`, +`tag_distance`, and `closest_match`. + +### 'region' renamed to 'territory' + +We were always out of step with CLDR here. Following the example of the IANA +database, we referred to things like the 'US' in 'en-US' as a "region code", +but the Unicode standards consistently call it a "territory code". + +In langcodes 2.0, parameters, dictionary keys, and attributes named `region` +have been renamed to `territory`. We try to support a few common cases with +deprecation warnings, such as looking up the `region` property of a Language +object. + +A nice benefit of this is that when a dictionary is displayed with 'language', +'script', and 'territory' keys in alphabetical order, they are in the same +order as they are in a language code. + + + +%prep +%autosetup -n langcodes-3.3.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-langcodes -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Mon Apr 10 2023 Python_Bot - 3.3.0-1 +- Package Spec generated diff --git a/sources b/sources new file mode 100644 index 0000000..ed291d6 --- /dev/null +++ b/sources @@ -0,0 +1 @@ +c257046ae26f6af4097a14d7612fc8f3 langcodes-3.3.0.tar.gz -- cgit v1.2.3