summaryrefslogtreecommitdiff
path: root/python-langcodes.spec
diff options
context:
space:
mode:
authorCoprDistGit <infra@openeuler.org>2023-04-10 11:18:20 +0000
committerCoprDistGit <infra@openeuler.org>2023-04-10 11:18:20 +0000
commitf897eea1d38ed414288327a99b4078239e1a79ec (patch)
tree400c48875629feda63b2743781bd23d1196d8dcb /python-langcodes.spec
parent8d80f1d56d675bde0bd15f81dad6ac1cbec7fa3a (diff)
automatic import of python-langcodes
Diffstat (limited to 'python-langcodes.spec')
-rw-r--r--python-langcodes.spec2413
1 files changed, 2413 insertions, 0 deletions
diff --git a/python-langcodes.spec b/python-langcodes.spec
new file mode 100644
index 0000000..9583db2
--- /dev/null
+++ b/python-langcodes.spec
@@ -0,0 +1,2413 @@
+%global _empty_manifest_terminate_build 0
+Name: python-langcodes
+Version: 3.3.0
+Release: 1
+Summary: Tools for labeling human languages with IETF language tags
+License: MIT
+URL: https://github.com/rspeer/langcodes
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/5f/ec/9955d772ecac0bdfb5d706d64f185ac68bd0d4092acdc2c5a1882c824369/langcodes-3.3.0.tar.gz
+BuildArch: noarch
+
+Requires: python3-language-data
+
+%description
+# Langcodes: a library for language codes
+
+**langcodes** knows what languages are. It knows the standardized codes that
+refer to them, such as `en` for English, `es` for Spanish and `hi` for Hindi.
+
+These are [IETF language tags][]. You may know them by their old name, ISO 639
+language codes. IETF has done some important things for backward compatibility
+and supporting language variations that you won't find in the ISO standard.
+
+[IETF language tags]: https://www.w3.org/International/articles/language-tags/
+
+It may sound to you like langcodes solves a pretty boring problem. At one
+level, that's right. Sometimes you have a boring problem, and it's great when a
+library solves it for you.
+
+But there's an interesting problem hiding in here. How do you work with
+language codes? How do you know when two different codes represent the same
+thing? How should your code represent relationships between codes, like the
+following?
+
+* `eng` is equivalent to `en`.
+* `fra` and `fre` are both equivalent to `fr`.
+* `en-GB` might be written as `en-gb` or `en_GB`. Or as 'en-UK', which is
+ erroneous, but should be treated as the same.
+* `en-CA` is not exactly equivalent to `en-US`, but it's really, really close.
+* `en-Latn-US` is equivalent to `en-US`, because written English must be written
+ in the Latin alphabet to be understood.
+* The difference between `ar` and `arb` is the difference between "Arabic" and
+ "Modern Standard Arabic", a difference that may not be relevant to you.
+* You'll find Mandarin Chinese tagged as `cmn` on Wiktionary, but many other
+ resources would call the same language `zh`.
+* Chinese is written in different scripts in different territories. Some
+ software distinguishes the script. Other software distinguishes the territory.
+ The result is that `zh-CN` and `zh-Hans` are used interchangeably, as are
+ `zh-TW` and `zh-Hant`, even though occasionally you'll need something
+ different such as `zh-HK` or `zh-Latn-pinyin`.
+* The Indonesian (`id`) and Malaysian (`ms` or `zsm`) languages are mutually
+ intelligible.
+* `jp` is not a language code. (The language code for Japanese is `ja`, but
+ people confuse it with the country code for Japan.)
+
+One way to know is to read IETF standards and Unicode technical reports.
+Another way is to use a library that implements those standards and guidelines
+for you, which langcodes does.
+
+When you're working with these short language codes, you may want to see the
+name that the language is called _in_ a language: `fr` is called "French" in
+English. That language doesn't have to be English: `fr` is called "français" in
+French. A supplement to langcodes, [`language_data`][language-data], provides
+this information.
+
+[language-data]: https://github.com/rspeer/language_data
+
+langcodes is maintained by Elia Robyn Lake a.k.a. Robyn Speer, and is released
+as free software under the MIT license.
+
+
+## Standards implemented
+
+Although this is not the only reason to use it, langcodes will make you more
+acronym-compliant.
+
+langcodes implements [BCP 47](http://tools.ietf.org/html/bcp47), the IETF Best
+Current Practices on Tags for Identifying Languages. BCP 47 is also known as
+RFC 5646. It subsumes ISO 639 and is backward compatible with it, and it also
+implements recommendations from the [Unicode CLDR](http://cldr.unicode.org).
+
+langcodes can also refer to a database of language properties and names, built
+from Unicode CLDR and the IANA subtag registry, if you install `language_data`.
+
+In summary, langcodes takes language codes and does the Right Thing with them,
+and if you want to know exactly what the Right Thing is, there are some
+documents you can go read.
+
+
+# Documentation
+
+## Standardizing language tags
+
+This function standardizes tags, as strings, in several ways.
+
+It replaces overlong tags with their shortest version, and also formats them
+according to the conventions of BCP 47:
+
+ >>> from langcodes import *
+ >>> standardize_tag('eng_US')
+ 'en-US'
+
+It removes script subtags that are redundant with the language:
+
+ >>> standardize_tag('en-Latn')
+ 'en'
+
+It replaces deprecated values with their correct versions, if possible:
+
+ >>> standardize_tag('en-uk')
+ 'en-GB'
+
+Sometimes this involves complex substitutions, such as replacing Serbo-Croatian
+(`sh`) with Serbian in Latin script (`sr-Latn`), or the entire tag `sgn-US`
+with `ase` (American Sign Language).
+
+ >>> standardize_tag('sh-QU')
+ 'sr-Latn-EU'
+
+ >>> standardize_tag('sgn-US')
+ 'ase'
+
+If *macro* is True, it uses macrolanguage codes as a replacement for the most
+common standardized language within that macrolanguage.
+
+ >>> standardize_tag('arb-Arab', macro=True)
+ 'ar'
+
+Even when *macro* is False, it shortens tags that contain both the
+macrolanguage and the language:
+
+ >>> standardize_tag('zh-cmn-hans-cn')
+ 'zh-Hans-CN'
+
+If the tag can't be parsed according to BCP 47, this will raise a
+LanguageTagError (a subclass of ValueError):
+
+ >>> standardize_tag('spa-latn-mx')
+ 'es-MX'
+
+ >>> standardize_tag('spa-mx-latn')
+ Traceback (most recent call last):
+ ...
+ langcodes.tag_parser.LanguageTagError: This script subtag, 'latn', is out of place. Expected variant, extension, or end of string.
+
+
+## Language objects
+
+This package defines one class, named Language, which contains the results
+of parsing a language tag. Language objects have the following fields,
+any of which may be unspecified:
+
+- *language*: the code for the language itself.
+- *script*: the 4-letter code for the writing system being used.
+- *territory*: the 2-letter or 3-digit code for the country or similar region
+ whose usage of the language appears in this text.
+- *extlangs*: a list of more specific language codes that follow the language
+ code. (This is allowed by the language code syntax, but deprecated.)
+- *variants*: codes for specific variations of language usage that aren't
+ covered by the *script* or *territory* codes.
+- *extensions*: information that's attached to the language code for use in
+ some specific system, such as Unicode collation orders.
+- *private*: a code starting with `x-` that has no defined meaning.
+
+The `Language.get` method converts a string to a Language instance, and the
+`Language.make` method makes a Language instance from its fields. These values
+are cached so that calling `Language.get` or `Language.make` again with the
+same values returns the same object, for efficiency.
+
+By default, it will replace non-standard and overlong tags as it interprets
+them. To disable this feature and get the codes that literally appear in the
+language tag, use the *normalize=False* option.
+
+ >>> Language.get('en-Latn-US')
+ Language.make(language='en', script='Latn', territory='US')
+
+ >>> Language.get('sgn-US', normalize=False)
+ Language.make(language='sgn', territory='US')
+
+ >>> Language.get('und')
+ Language.make()
+
+Here are some examples of replacing non-standard tags:
+
+ >>> Language.get('sh-QU')
+ Language.make(language='sr', script='Latn', territory='EU')
+
+ >>> Language.get('sgn-US')
+ Language.make(language='ase')
+
+ >>> Language.get('zh-cmn-Hant')
+ Language.make(language='zh', script='Hant')
+
+Use the `str()` function on a Language object to convert it back to its
+standard string form:
+
+ >>> str(Language.get('sh-QU'))
+ 'sr-Latn-EU'
+
+ >>> str(Language.make(territory='IN'))
+ 'und-IN'
+
+
+### Checking validity
+
+A language code is _valid_ when every part of it is assigned a meaning by IANA.
+That meaning could be "private use".
+
+In langcodes, we check the language subtag, script, territory, and variants for
+validity. We don't check other parts such as extlangs or Unicode extensions.
+
+For example, `ja` is a valid language code, and `jp` is not:
+
+ >>> Language.get('ja').is_valid()
+ True
+
+ >>> Language.get('jp').is_valid()
+ False
+
+The top-level function `tag_is_valid(tag)` is possibly more convenient to use,
+because it can return False even for tags that don't parse:
+
+ >>> tag_is_valid('C')
+ False
+
+If one subtag is invalid, the entire code is invalid:
+
+ >>> tag_is_valid('en-000')
+ False
+
+`iw` is valid, though it's a deprecated alias for `he`:
+
+ >>> tag_is_valid('iw')
+ True
+
+The empty language tag (`und`) is valid:
+
+ >>> tag_is_valid('und')
+ True
+
+Private use codes are valid:
+
+ >>> tag_is_valid('x-other')
+ True
+
+ >>> tag_is_valid('qaa-Qaai-AA-x-what-even-is-this')
+ True
+
+Language tags that are very unlikely are still valid:
+
+ >>> tag_is_valid('fr-Cyrl')
+ True
+
+Tags with non-ASCII characters are invalid, because they don't parse:
+
+ >>> tag_is_valid('zh-普通话')
+ False
+
+
+### Getting alpha3 codes
+
+Before there was BCP 47, there was ISO 639-2. The ISO tried to make room for the
+variety of human languages by assigning every language a 3-letter code,
+including the ones that already had 2-letter codes.
+
+Unfortunately, this just led to more confusion. Some languages ended up with two
+different 3-letter codes for legacy reasons, such as French, which is `fra` as a
+"terminology" code, and `fre` as a "biblographic" code. And meanwhile, `fr` was
+still a code that you'd be using if you followed ISO 639-1.
+
+In BCP 47, you should use 2-letter codes whenever they're available, and that's
+what langcodes does. Fortunately, all the languages that have two different
+3-letter codes also have a 2-letter code, so if you prefer the 2-letter code,
+you don't have to worry about the distinction.
+
+But some applications want the 3-letter code in particular, so langcodes
+provides a method for getting those, `Language.to_alpha3()`. It returns the
+'terminology' code by default, and passing `variant='B'` returns the
+bibliographic code.
+
+When this method returns, it always returns a 3-letter string.
+
+ >>> Language.get('fr').to_alpha3()
+ 'fra'
+ >>> Language.get('fr-CA').to_alpha3()
+ 'fra'
+ >>> Language.get('fr-CA').to_alpha3(variant='B')
+ 'fre'
+ >>> Language.get('de').to_alpha3()
+ 'deu'
+ >>> Language.get('no').to_alpha3()
+ 'nor'
+ >>> Language.get('un').to_alpha3()
+ Traceback (most recent call last):
+ ...
+ LookupError: 'un' is not a known language code, and has no alpha3 code.
+
+For many languages, the terminology and bibliographic alpha3 codes are the same.
+
+ >>> Language.get('en').to_alpha3(variant='T')
+ 'eng'
+ >>> Language.get('en').to_alpha3(variant='B')
+ 'eng'
+
+When you use any of these "overlong" alpha3 codes in langcodes, they normalize
+back to the alpha2 code:
+
+ >>> Language.get('zho')
+ Language.make(language='zh')
+
+
+## Working with language names
+
+The methods in this section require an optional package called `language_data`.
+You can install it with `pip install language_data`, or request the optional
+"data" feature of langcodes with `pip install langcodes[data]`.
+
+The dependency that you put in setup.py should be `langcodes[data]`.
+
+### Describing Language objects in natural language
+
+It's often helpful to be able to describe a language code in a way that a user
+(or you) can understand, instead of in inscrutable short codes. The
+`display_name` method lets you describe a Language object *in a language*.
+
+The `.display_name(language, min_score)` method will look up the name of the
+language. The names come from the IANA language tag registry, which is only in
+English, plus CLDR, which names languages in many commonly-used languages.
+
+The default language for naming things is English:
+
+ >>> Language.make(language='fr').display_name()
+ 'French'
+
+ >>> Language.make().display_name()
+ 'Unknown language'
+
+ >>> Language.get('zh-Hans').display_name()
+ 'Chinese (Simplified)'
+
+ >>> Language.get('en-US').display_name()
+ 'English (United States)'
+
+But you can ask for language names in numerous other languages:
+
+ >>> Language.get('fr').display_name('fr')
+ 'français'
+
+ >>> Language.get('fr').display_name('es')
+ 'francés'
+
+ >>> Language.make().display_name('es')
+ 'lengua desconocida'
+
+ >>> Language.get('zh-Hans').display_name('de')
+ 'Chinesisch (Vereinfacht)'
+
+ >>> Language.get('en-US').display_name('zh-Hans')
+ '英语(美国)'
+
+Why does everyone get Slovak and Slovenian confused? Let's ask them.
+
+ >>> Language.get('sl').display_name('sl')
+ 'slovenščina'
+ >>> Language.get('sk').display_name('sk')
+ 'slovenčina'
+ >>> Language.get('sl').display_name('sk')
+ 'slovinčina'
+ >>> Language.get('sk').display_name('sl')
+ 'slovaščina'
+
+If the language has a script or territory code attached to it, these will be
+described in parentheses:
+
+ >>> Language.get('en-US').display_name()
+ 'English (United States)'
+
+Sometimes these can be the result of tag normalization, such as in this case
+where the legacy tag 'sh' becomes 'sr-Latn':
+
+ >>> Language.get('sh').display_name()
+ 'Serbian (Latin)'
+
+ >>> Language.get('sh', normalize=False).display_name()
+ 'Serbo-Croatian'
+
+Naming a language in itself is sometimes a useful thing to do, so the
+`.autonym()` method makes this easy, providing the display name of a language
+in the language itself:
+
+ >>> Language.get('fr').autonym()
+ 'français'
+ >>> Language.get('es').autonym()
+ 'español'
+ >>> Language.get('ja').autonym()
+ '日本語'
+ >>> Language.get('en-AU').autonym()
+ 'English (Australia)'
+ >>> Language.get('sr-Latn').autonym()
+ 'srpski (latinica)'
+ >>> Language.get('sr-Cyrl').autonym()
+ 'српски (ћирилица)'
+
+The names come from the Unicode CLDR data files, and in English they can
+also come from the IANA language subtag registry. Together, they can give
+you language names in the 196 languages that CLDR supports.
+
+
+### Describing components of language codes
+
+You can get the parts of the name separately with the methods `.language_name()`,
+`.script_name()`, and `.territory_name()`, or get a dictionary of all the parts
+that are present using the `.describe()` method. These methods also accept a
+language code for what language they should be described in.
+
+ >>> shaw = Language.get('en-Shaw-GB')
+ >>> shaw.describe('en')
+ {'language': 'English', 'script': 'Shavian', 'territory': 'United Kingdom'}
+
+ >>> shaw.describe('es')
+ {'language': 'inglés', 'script': 'shaviano', 'territory': 'Reino Unido'}
+
+
+### Recognizing language names in natural language
+
+As the reverse of the above operations, you may want to look up a language by
+its name, converting a natural language name such as "French" to a code such as
+'fr'.
+
+The name can be in any language that CLDR supports (see "Ambiguity" below).
+
+ >>> import langcodes
+ >>> langcodes.find('french')
+ Language.make(language='fr')
+
+ >>> langcodes.find('francés')
+ Language.make(language='fr')
+
+However, this method currently ignores the parenthetical expressions that come from
+`.display_name()`:
+
+ >>> langcodes.find('English (Canada)')
+ Language.make(language='en')
+
+There is still room to improve the way that language names are matched, because
+some languages are not consistently named the same way. The method currently
+works with hundreds of language names that are used on Wiktionary.
+
+#### Ambiguity
+
+For the sake of usability, `langcodes.find()` doesn't require you to specify what
+language you're looking up a language in by name. This could potentially lead to
+a conflict: what if name "X" is language A's name for language B, and language C's
+name for language D?
+
+We can collect the language codes from CLDR and see how many times this
+happens. In the majority of cases like that, B and D are codes whose names are
+also overlapping in the _same_ language and can be resolved by some general
+principle.
+
+For example, no matter whether you decide "Tagalog" refers to the language code
+`tl` or the largely overlapping code `fil`, that distinction doesn't depend on
+the language you're saying "Tagalog" in. We can just return `tl` consistently.
+
+ >>> langcodes.find('tagalog')
+ Language.make(language='tl')
+
+In the few cases of actual interlingual ambiguity, langcodes won't match a result.
+You can pass in a `language=` parameter to say what language the name is in.
+
+For example, there are two distinct languages called "Tonga" in various languages.
+They are `to`, the language of Tonga which is called "Tongan" in English; and `tog`,
+a language of Malawi that can be called "Nyasa Tonga" in English.
+
+ >>> langcodes.find('tongan')
+ Language.make(language='to')
+
+ >>> langcodes.find('nyasa tonga')
+ Language.make(language='tog')
+
+ >>> langcodes.find('tonga')
+ Traceback (most recent call last):
+ ...
+ LookupError: Can't find any language named 'tonga'
+
+ >>> langcodes.find('tonga', language='id')
+ Language.make(language='to')
+
+ >>> langcodes.find('tonga', language='ca')
+ Language.make(language='tog')
+
+Other ambiguous names written in Latin letters are "Kiga", "Mbundu", "Roman", and "Ruanda".
+
+
+## Demographic language data
+
+The `Language.speaking_population()` and `Language.writing_population()`
+methods get Unicode's estimates of how many people in the world use a
+language.
+
+As with the language name data, this requires the optional `language_data`
+package to be installed.
+
+`.speaking_population()` estimates how many people speak a language. It can
+be limited to a particular territory with a territory code (such as a country
+code).
+
+ >>> Language.get('es').speaking_population()
+ 487664083
+
+ >>> Language.get('pt').speaking_population()
+ 237135429
+
+ >>> Language.get('es-BR').speaking_population()
+ 76218
+
+ >>> Language.get('pt-BR').speaking_population()
+ 192661560
+
+ >>> Language.get('vo').speaking_population()
+ 0
+
+Script codes will be ignored, because the script is not involved in speaking:
+
+ >>> Language.get('es-Hant').speaking_population() ==\
+ ... Language.get('es').speaking_population()
+ True
+
+`.writing_population()` estimates how many people write a language.
+
+ >>> all = Language.get('zh').writing_population()
+ >>> all
+ 1240326057
+
+ >>> traditional = Language.get('zh-Hant').writing_population()
+ >>> traditional
+ 37019589
+
+ >>> simplified = Language.get('zh-Hans').writing_population()
+ >>> all == traditional + simplified
+ True
+
+The estimates for "writing population" are often overestimates, as described
+in the [CLDR documentation on territory data][overestimates]. In most cases,
+they are derived from published data about literacy rates in the places where
+those languages are spoken. This doesn't take into account that many literate
+people around the world speak a language that isn't typically written, and
+write in a _different_ language.
+
+[overestimates]: https://unicode-org.github.io/cldr-staging/charts/39/supplemental/territory_language_information.html
+
+Like `.speaking_population()`, this can be limited to a particular territory:
+
+ >>> Language.get('zh-Hant-HK').writing_population()
+ 6439733
+ >>> Language.get('zh-Hans-HK').writing_population()
+ 338933
+
+
+## Comparing and matching languages
+
+The `tag_distance` function returns a number from 0 to 134 indicating the
+distance between the language the user desires and a supported language.
+
+The distance data comes from CLDR v38.1 and involves a lot of judgment calls
+made by the Unicode consortium.
+
+
+### Distance values
+
+This table summarizes the language distance values:
+
+| Value | Meaning | Example
+| ----: | :------ | :------
+| 0 | These codes represent the same language, possibly after filling in values and normalizing. | Norwegian Bokmål → Norwegian
+| 1-3 | These codes indicate a minor regional difference. | Australian English → British English
+| 4-9 | These codes indicate a significant but unproblematic regional difference. | American English → British English
+| 10-24 | A gray area that depends on your use case. There may be problems with understanding or usability. | Afrikaans → Dutch, Wu Chinese → Mandarin Chinese
+| 25-50 | These languages aren't similar, but there are demographic reasons to expect some intelligibility. | Tamil → English, Marathi → Hindi
+| 51-79 | There are large barriers to understanding. | Japanese → Japanese in Hepburn romanization
+| 80-99 | These are different languages written in the same script. | English → French, Arabic → Urdu
+| 100+ | These languages have nothing particularly in common. | English → Japanese, English → Tamil
+
+See the docstring of `tag_distance` for more explanation and examples.
+
+
+### Finding the best matching language
+
+Suppose you have software that supports any of the `supported_languages`. The
+user wants to use `desired_language`.
+
+The function `closest_supported_match(desired_language, supported_languages)`
+lets you choose the right language, even if there isn't an exact match.
+It returns the language tag of the best-supported language, even if there
+isn't an exact match.
+
+The `max_distance` parameter lets you set a cutoff on what counts as language
+support. It has a default of 25, a value that is probably okay for simple
+cases of i18n, but you might want to set it lower to require more precision.
+
+ >>> closest_supported_match('fr', ['de', 'en', 'fr'])
+ 'fr'
+
+ >>> closest_supported_match('pt', ['pt-BR', 'pt-PT'])
+ 'pt-BR'
+
+ >>> closest_supported_match('en-AU', ['en-GB', 'en-US'])
+ 'en-GB'
+
+ >>> closest_supported_match('af', ['en', 'nl', 'zu'])
+ 'nl'
+
+ >>> closest_supported_match('und', ['en', 'und'])
+ 'und'
+
+ >>> print(closest_supported_match('af', ['en', 'nl', 'zu'], max_distance=10))
+ None
+
+A similar function is `closest_match(desired_language, supported_language)`,
+which returns both the best matching language tag and the distance. If there is
+no match, it returns ('und', 1000).
+
+ >>> closest_match('fr', ['de', 'en', 'fr'])
+ ('fr', 0)
+
+ >>> closest_match('sh', ['hr', 'bs', 'sr-Latn', 'sr-Cyrl'])
+ ('sr-Latn', 0)
+
+ >>> closest_match('id', ['zsm', 'mhp'])
+ ('zsm', 14)
+
+ >>> closest_match('ja', ['ja-Latn-hepburn', 'en'])
+ ('und', 1000)
+
+ >>> closest_match('ja', ['ja-Latn-hepburn', 'en'], max_distance=60)
+ ('ja-Latn-hepburn', 50)
+
+## Further API documentation
+
+There are many more methods for manipulating and comparing language codes,
+and you will find them documented thoroughly in [the code itself][code].
+
+The interesting functions all live in this one file, with extensive docstrings
+and annotations. Making a separate Sphinx page out of the docstrings would be
+the traditional thing to do, but here it just seems redundant. You can go read
+the docstrings in context, in their native habitat, and they'll always be up to
+date.
+
+[Code with documentation][code]
+
+[code]: https://github.com/rspeer/langcodes/blob/master/langcodes/__init__.py
+
+# Changelog
+
+## Version 3.3 (November 2021)
+
+- Updated to CLDR v40.
+
+- Updated the IANA subtag registry to version 2021-08-06.
+
+- Bug fix: recognize script codes that appear in the IANA registry even if
+ they're missing from CLDR for some reason. 'cu-Cyrs' is valid, for example.
+
+- Switched the build system from `setuptools` to `poetry`.
+
+To install the package in editable mode before PEP 660 is better supported, use
+`poetry install` instead of `pip install -e .`.
+
+## Version 3.2 (October 2021)
+
+- Supports Python 3.6 through 3.10.
+
+- Added the top-level function `tag_is_valid(tag)`, for determining if a string
+ is a valid language tag without having to parse it first.
+
+- Added the top-level function `closest_supported_match(desired, supported)`,
+ which is similar to `closest_match` but with a simpler return value. It
+ returns the language tag of the closest match, or None if no match is close
+ enough.
+
+- Bug fix: a lot of well-formed but invalid language codes appeared to be
+ valid, such as 'aaj' or 'en-Latnx', because the regex could match a prefix of
+ a subtag. The validity regex is now required to match completely.
+
+- Bug fixes that address some edge cases of validity:
+
+ - A language tag that is entirely private use, like 'x-private', is valid
+ - A language tag that uses the same extension twice, like 'en-a-bbb-a-ccc',
+ is invalid
+ - A language tag that uses the same variant twice, like 'de-1901-1901', is
+ invalid
+ - A language tag with two extlangs, like 'sgn-ase-bfi', is invalid
+
+- Updated dependencies so they are compatible with Python 3.10, including
+ switching back from `marisa-trie-m` to `marisa-trie` in `language_data`.
+
+- In bugfix release 3.2.1, corrected cases where the parser accepted
+ ill-formed language tags:
+
+ - All subtags must be made of between 1 and 8 alphanumeric ASCII characters
+ - Tags with two extension 'singletons' in a row (`en-a-b-ccc`) should be
+ rejected
+
+## Version 3.1 (February 2021)
+
+- Added the `Language.to_alpha3()` method, for getting a three-letter code for a
+ language according to ISO 639-2.
+
+- Updated the type annotations from obiwan-style to mypy-style.
+
+
+## Version 3.0 (February 2021)
+
+- Moved bulky data, particularly language names, into a separate
+ `language_data` package. In situations where the data isn't needed,
+ `langcodes` becomes a smaller, pure-Python package with no dependencies.
+
+- Language codes where the language segment is more than 4 letters no longer
+ parse: Language.get('nonsense') now returns an error.
+
+ (This is technically stricter than the parse rules of BCP 47, but there are
+ no valid language codes of this form and there should never be any. An
+ attempt to parse a language code with 5-8 letters is most likely a mistake or
+ an attempt to make up a code.)
+
+- Added a method for checking the validity of a language code.
+
+- Added methods for estimating language population.
+
+- Updated to CLDR 38.1, which includes differences in language matching.
+
+- Tested on Python 3.6 through 3.9; no longer tested on Python 3.5.
+
+
+## Version 2.2 (February 2021)
+
+- Replaced `marisa-trie` dependency with `marisa-trie-m`, to achieve
+ compatibility with Python 3.9.
+
+
+## Version 2.1 (June 2020)
+
+- Added the `display_name` method to be a more intuitive way to get a string
+ describing a language code, and made the `autonym` method use it instead of
+ `language_name`.
+
+- Updated to CLDR v37.
+
+- Previously, some attempts to get the name of a language would return its
+ language code instead, perhaps because the name was being requested in a
+ language for which CLDR doesn't have name data. This is unfortunate because
+ names and codes should not be interchangeable.
+
+ Now we fall back on English names instead, which exists for all IANA codes.
+ If the code is unknown, we return a string such as "Unknown language [xx]".
+
+
+## Version 2.0 (April 2020)
+
+Version 2.0 involves some significant changes that may break compatibility with 1.4,
+in addition to updating to version 36.1 of the Unicode CLDR data and the April 2020
+version of the IANA subtag registry.
+
+This version requires Python 3.5 or later.
+
+### Match scores replaced with distances
+
+Originally, the goodness of a match between two different language codes was defined
+in terms of a "match score" with a maximum of 100. Around 2016, Unicode started
+replacing this with a different measure, the "match distance", which was defined
+much more clearly, but we had to keep using the "match score".
+
+As of langcodes version 2.0, the "score" functions (such as
+`Language.match_score`, `tag_match_score`, and `best_match`) are deprecated.
+They'll keep using the deprecated language match tables from around CLDR 27.
+
+For a better measure of the closeness of two language codes, use `Language.distance`,
+`tag_distance`, and `closest_match`.
+
+### 'region' renamed to 'territory'
+
+We were always out of step with CLDR here. Following the example of the IANA
+database, we referred to things like the 'US' in 'en-US' as a "region code",
+but the Unicode standards consistently call it a "territory code".
+
+In langcodes 2.0, parameters, dictionary keys, and attributes named `region`
+have been renamed to `territory`. We try to support a few common cases with
+deprecation warnings, such as looking up the `region` property of a Language
+object.
+
+A nice benefit of this is that when a dictionary is displayed with 'language',
+'script', and 'territory' keys in alphabetical order, they are in the same
+order as they are in a language code.
+
+
+
+%package -n python3-langcodes
+Summary: Tools for labeling human languages with IETF language tags
+Provides: python-langcodes
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-langcodes
+# Langcodes: a library for language codes
+
+**langcodes** knows what languages are. It knows the standardized codes that
+refer to them, such as `en` for English, `es` for Spanish and `hi` for Hindi.
+
+These are [IETF language tags][]. You may know them by their old name, ISO 639
+language codes. IETF has done some important things for backward compatibility
+and supporting language variations that you won't find in the ISO standard.
+
+[IETF language tags]: https://www.w3.org/International/articles/language-tags/
+
+It may sound to you like langcodes solves a pretty boring problem. At one
+level, that's right. Sometimes you have a boring problem, and it's great when a
+library solves it for you.
+
+But there's an interesting problem hiding in here. How do you work with
+language codes? How do you know when two different codes represent the same
+thing? How should your code represent relationships between codes, like the
+following?
+
+* `eng` is equivalent to `en`.
+* `fra` and `fre` are both equivalent to `fr`.
+* `en-GB` might be written as `en-gb` or `en_GB`. Or as 'en-UK', which is
+ erroneous, but should be treated as the same.
+* `en-CA` is not exactly equivalent to `en-US`, but it's really, really close.
+* `en-Latn-US` is equivalent to `en-US`, because written English must be written
+ in the Latin alphabet to be understood.
+* The difference between `ar` and `arb` is the difference between "Arabic" and
+ "Modern Standard Arabic", a difference that may not be relevant to you.
+* You'll find Mandarin Chinese tagged as `cmn` on Wiktionary, but many other
+ resources would call the same language `zh`.
+* Chinese is written in different scripts in different territories. Some
+ software distinguishes the script. Other software distinguishes the territory.
+ The result is that `zh-CN` and `zh-Hans` are used interchangeably, as are
+ `zh-TW` and `zh-Hant`, even though occasionally you'll need something
+ different such as `zh-HK` or `zh-Latn-pinyin`.
+* The Indonesian (`id`) and Malaysian (`ms` or `zsm`) languages are mutually
+ intelligible.
+* `jp` is not a language code. (The language code for Japanese is `ja`, but
+ people confuse it with the country code for Japan.)
+
+One way to know is to read IETF standards and Unicode technical reports.
+Another way is to use a library that implements those standards and guidelines
+for you, which langcodes does.
+
+When you're working with these short language codes, you may want to see the
+name that the language is called _in_ a language: `fr` is called "French" in
+English. That language doesn't have to be English: `fr` is called "français" in
+French. A supplement to langcodes, [`language_data`][language-data], provides
+this information.
+
+[language-data]: https://github.com/rspeer/language_data
+
+langcodes is maintained by Elia Robyn Lake a.k.a. Robyn Speer, and is released
+as free software under the MIT license.
+
+
+## Standards implemented
+
+Although this is not the only reason to use it, langcodes will make you more
+acronym-compliant.
+
+langcodes implements [BCP 47](http://tools.ietf.org/html/bcp47), the IETF Best
+Current Practices on Tags for Identifying Languages. BCP 47 is also known as
+RFC 5646. It subsumes ISO 639 and is backward compatible with it, and it also
+implements recommendations from the [Unicode CLDR](http://cldr.unicode.org).
+
+langcodes can also refer to a database of language properties and names, built
+from Unicode CLDR and the IANA subtag registry, if you install `language_data`.
+
+In summary, langcodes takes language codes and does the Right Thing with them,
+and if you want to know exactly what the Right Thing is, there are some
+documents you can go read.
+
+
+# Documentation
+
+## Standardizing language tags
+
+This function standardizes tags, as strings, in several ways.
+
+It replaces overlong tags with their shortest version, and also formats them
+according to the conventions of BCP 47:
+
+ >>> from langcodes import *
+ >>> standardize_tag('eng_US')
+ 'en-US'
+
+It removes script subtags that are redundant with the language:
+
+ >>> standardize_tag('en-Latn')
+ 'en'
+
+It replaces deprecated values with their correct versions, if possible:
+
+ >>> standardize_tag('en-uk')
+ 'en-GB'
+
+Sometimes this involves complex substitutions, such as replacing Serbo-Croatian
+(`sh`) with Serbian in Latin script (`sr-Latn`), or the entire tag `sgn-US`
+with `ase` (American Sign Language).
+
+ >>> standardize_tag('sh-QU')
+ 'sr-Latn-EU'
+
+ >>> standardize_tag('sgn-US')
+ 'ase'
+
+If *macro* is True, it uses macrolanguage codes as a replacement for the most
+common standardized language within that macrolanguage.
+
+ >>> standardize_tag('arb-Arab', macro=True)
+ 'ar'
+
+Even when *macro* is False, it shortens tags that contain both the
+macrolanguage and the language:
+
+ >>> standardize_tag('zh-cmn-hans-cn')
+ 'zh-Hans-CN'
+
+If the tag can't be parsed according to BCP 47, this will raise a
+LanguageTagError (a subclass of ValueError):
+
+ >>> standardize_tag('spa-latn-mx')
+ 'es-MX'
+
+ >>> standardize_tag('spa-mx-latn')
+ Traceback (most recent call last):
+ ...
+ langcodes.tag_parser.LanguageTagError: This script subtag, 'latn', is out of place. Expected variant, extension, or end of string.
+
+
+## Language objects
+
+This package defines one class, named Language, which contains the results
+of parsing a language tag. Language objects have the following fields,
+any of which may be unspecified:
+
+- *language*: the code for the language itself.
+- *script*: the 4-letter code for the writing system being used.
+- *territory*: the 2-letter or 3-digit code for the country or similar region
+ whose usage of the language appears in this text.
+- *extlangs*: a list of more specific language codes that follow the language
+ code. (This is allowed by the language code syntax, but deprecated.)
+- *variants*: codes for specific variations of language usage that aren't
+ covered by the *script* or *territory* codes.
+- *extensions*: information that's attached to the language code for use in
+ some specific system, such as Unicode collation orders.
+- *private*: a code starting with `x-` that has no defined meaning.
+
+The `Language.get` method converts a string to a Language instance, and the
+`Language.make` method makes a Language instance from its fields. These values
+are cached so that calling `Language.get` or `Language.make` again with the
+same values returns the same object, for efficiency.
+
+By default, it will replace non-standard and overlong tags as it interprets
+them. To disable this feature and get the codes that literally appear in the
+language tag, use the *normalize=False* option.
+
+ >>> Language.get('en-Latn-US')
+ Language.make(language='en', script='Latn', territory='US')
+
+ >>> Language.get('sgn-US', normalize=False)
+ Language.make(language='sgn', territory='US')
+
+ >>> Language.get('und')
+ Language.make()
+
+Here are some examples of replacing non-standard tags:
+
+ >>> Language.get('sh-QU')
+ Language.make(language='sr', script='Latn', territory='EU')
+
+ >>> Language.get('sgn-US')
+ Language.make(language='ase')
+
+ >>> Language.get('zh-cmn-Hant')
+ Language.make(language='zh', script='Hant')
+
+Use the `str()` function on a Language object to convert it back to its
+standard string form:
+
+ >>> str(Language.get('sh-QU'))
+ 'sr-Latn-EU'
+
+ >>> str(Language.make(territory='IN'))
+ 'und-IN'
+
+
+### Checking validity
+
+A language code is _valid_ when every part of it is assigned a meaning by IANA.
+That meaning could be "private use".
+
+In langcodes, we check the language subtag, script, territory, and variants for
+validity. We don't check other parts such as extlangs or Unicode extensions.
+
+For example, `ja` is a valid language code, and `jp` is not:
+
+ >>> Language.get('ja').is_valid()
+ True
+
+ >>> Language.get('jp').is_valid()
+ False
+
+The top-level function `tag_is_valid(tag)` is possibly more convenient to use,
+because it can return False even for tags that don't parse:
+
+ >>> tag_is_valid('C')
+ False
+
+If one subtag is invalid, the entire code is invalid:
+
+ >>> tag_is_valid('en-000')
+ False
+
+`iw` is valid, though it's a deprecated alias for `he`:
+
+ >>> tag_is_valid('iw')
+ True
+
+The empty language tag (`und`) is valid:
+
+ >>> tag_is_valid('und')
+ True
+
+Private use codes are valid:
+
+ >>> tag_is_valid('x-other')
+ True
+
+ >>> tag_is_valid('qaa-Qaai-AA-x-what-even-is-this')
+ True
+
+Language tags that are very unlikely are still valid:
+
+ >>> tag_is_valid('fr-Cyrl')
+ True
+
+Tags with non-ASCII characters are invalid, because they don't parse:
+
+ >>> tag_is_valid('zh-普通话')
+ False
+
+
+### Getting alpha3 codes
+
+Before there was BCP 47, there was ISO 639-2. The ISO tried to make room for the
+variety of human languages by assigning every language a 3-letter code,
+including the ones that already had 2-letter codes.
+
+Unfortunately, this just led to more confusion. Some languages ended up with two
+different 3-letter codes for legacy reasons, such as French, which is `fra` as a
+"terminology" code, and `fre` as a "biblographic" code. And meanwhile, `fr` was
+still a code that you'd be using if you followed ISO 639-1.
+
+In BCP 47, you should use 2-letter codes whenever they're available, and that's
+what langcodes does. Fortunately, all the languages that have two different
+3-letter codes also have a 2-letter code, so if you prefer the 2-letter code,
+you don't have to worry about the distinction.
+
+But some applications want the 3-letter code in particular, so langcodes
+provides a method for getting those, `Language.to_alpha3()`. It returns the
+'terminology' code by default, and passing `variant='B'` returns the
+bibliographic code.
+
+When this method returns, it always returns a 3-letter string.
+
+ >>> Language.get('fr').to_alpha3()
+ 'fra'
+ >>> Language.get('fr-CA').to_alpha3()
+ 'fra'
+ >>> Language.get('fr-CA').to_alpha3(variant='B')
+ 'fre'
+ >>> Language.get('de').to_alpha3()
+ 'deu'
+ >>> Language.get('no').to_alpha3()
+ 'nor'
+ >>> Language.get('un').to_alpha3()
+ Traceback (most recent call last):
+ ...
+ LookupError: 'un' is not a known language code, and has no alpha3 code.
+
+For many languages, the terminology and bibliographic alpha3 codes are the same.
+
+ >>> Language.get('en').to_alpha3(variant='T')
+ 'eng'
+ >>> Language.get('en').to_alpha3(variant='B')
+ 'eng'
+
+When you use any of these "overlong" alpha3 codes in langcodes, they normalize
+back to the alpha2 code:
+
+ >>> Language.get('zho')
+ Language.make(language='zh')
+
+
+## Working with language names
+
+The methods in this section require an optional package called `language_data`.
+You can install it with `pip install language_data`, or request the optional
+"data" feature of langcodes with `pip install langcodes[data]`.
+
+The dependency that you put in setup.py should be `langcodes[data]`.
+
+### Describing Language objects in natural language
+
+It's often helpful to be able to describe a language code in a way that a user
+(or you) can understand, instead of in inscrutable short codes. The
+`display_name` method lets you describe a Language object *in a language*.
+
+The `.display_name(language, min_score)` method will look up the name of the
+language. The names come from the IANA language tag registry, which is only in
+English, plus CLDR, which names languages in many commonly-used languages.
+
+The default language for naming things is English:
+
+ >>> Language.make(language='fr').display_name()
+ 'French'
+
+ >>> Language.make().display_name()
+ 'Unknown language'
+
+ >>> Language.get('zh-Hans').display_name()
+ 'Chinese (Simplified)'
+
+ >>> Language.get('en-US').display_name()
+ 'English (United States)'
+
+But you can ask for language names in numerous other languages:
+
+ >>> Language.get('fr').display_name('fr')
+ 'français'
+
+ >>> Language.get('fr').display_name('es')
+ 'francés'
+
+ >>> Language.make().display_name('es')
+ 'lengua desconocida'
+
+ >>> Language.get('zh-Hans').display_name('de')
+ 'Chinesisch (Vereinfacht)'
+
+ >>> Language.get('en-US').display_name('zh-Hans')
+ '英语(美国)'
+
+Why does everyone get Slovak and Slovenian confused? Let's ask them.
+
+ >>> Language.get('sl').display_name('sl')
+ 'slovenščina'
+ >>> Language.get('sk').display_name('sk')
+ 'slovenčina'
+ >>> Language.get('sl').display_name('sk')
+ 'slovinčina'
+ >>> Language.get('sk').display_name('sl')
+ 'slovaščina'
+
+If the language has a script or territory code attached to it, these will be
+described in parentheses:
+
+ >>> Language.get('en-US').display_name()
+ 'English (United States)'
+
+Sometimes these can be the result of tag normalization, such as in this case
+where the legacy tag 'sh' becomes 'sr-Latn':
+
+ >>> Language.get('sh').display_name()
+ 'Serbian (Latin)'
+
+ >>> Language.get('sh', normalize=False).display_name()
+ 'Serbo-Croatian'
+
+Naming a language in itself is sometimes a useful thing to do, so the
+`.autonym()` method makes this easy, providing the display name of a language
+in the language itself:
+
+ >>> Language.get('fr').autonym()
+ 'français'
+ >>> Language.get('es').autonym()
+ 'español'
+ >>> Language.get('ja').autonym()
+ '日本語'
+ >>> Language.get('en-AU').autonym()
+ 'English (Australia)'
+ >>> Language.get('sr-Latn').autonym()
+ 'srpski (latinica)'
+ >>> Language.get('sr-Cyrl').autonym()
+ 'српски (ћирилица)'
+
+The names come from the Unicode CLDR data files, and in English they can
+also come from the IANA language subtag registry. Together, they can give
+you language names in the 196 languages that CLDR supports.
+
+
+### Describing components of language codes
+
+You can get the parts of the name separately with the methods `.language_name()`,
+`.script_name()`, and `.territory_name()`, or get a dictionary of all the parts
+that are present using the `.describe()` method. These methods also accept a
+language code for what language they should be described in.
+
+ >>> shaw = Language.get('en-Shaw-GB')
+ >>> shaw.describe('en')
+ {'language': 'English', 'script': 'Shavian', 'territory': 'United Kingdom'}
+
+ >>> shaw.describe('es')
+ {'language': 'inglés', 'script': 'shaviano', 'territory': 'Reino Unido'}
+
+
+### Recognizing language names in natural language
+
+As the reverse of the above operations, you may want to look up a language by
+its name, converting a natural language name such as "French" to a code such as
+'fr'.
+
+The name can be in any language that CLDR supports (see "Ambiguity" below).
+
+ >>> import langcodes
+ >>> langcodes.find('french')
+ Language.make(language='fr')
+
+ >>> langcodes.find('francés')
+ Language.make(language='fr')
+
+However, this method currently ignores the parenthetical expressions that come from
+`.display_name()`:
+
+ >>> langcodes.find('English (Canada)')
+ Language.make(language='en')
+
+There is still room to improve the way that language names are matched, because
+some languages are not consistently named the same way. The method currently
+works with hundreds of language names that are used on Wiktionary.
+
+#### Ambiguity
+
+For the sake of usability, `langcodes.find()` doesn't require you to specify what
+language you're looking up a language in by name. This could potentially lead to
+a conflict: what if name "X" is language A's name for language B, and language C's
+name for language D?
+
+We can collect the language codes from CLDR and see how many times this
+happens. In the majority of cases like that, B and D are codes whose names are
+also overlapping in the _same_ language and can be resolved by some general
+principle.
+
+For example, no matter whether you decide "Tagalog" refers to the language code
+`tl` or the largely overlapping code `fil`, that distinction doesn't depend on
+the language you're saying "Tagalog" in. We can just return `tl` consistently.
+
+ >>> langcodes.find('tagalog')
+ Language.make(language='tl')
+
+In the few cases of actual interlingual ambiguity, langcodes won't match a result.
+You can pass in a `language=` parameter to say what language the name is in.
+
+For example, there are two distinct languages called "Tonga" in various languages.
+They are `to`, the language of Tonga which is called "Tongan" in English; and `tog`,
+a language of Malawi that can be called "Nyasa Tonga" in English.
+
+ >>> langcodes.find('tongan')
+ Language.make(language='to')
+
+ >>> langcodes.find('nyasa tonga')
+ Language.make(language='tog')
+
+ >>> langcodes.find('tonga')
+ Traceback (most recent call last):
+ ...
+ LookupError: Can't find any language named 'tonga'
+
+ >>> langcodes.find('tonga', language='id')
+ Language.make(language='to')
+
+ >>> langcodes.find('tonga', language='ca')
+ Language.make(language='tog')
+
+Other ambiguous names written in Latin letters are "Kiga", "Mbundu", "Roman", and "Ruanda".
+
+
+## Demographic language data
+
+The `Language.speaking_population()` and `Language.writing_population()`
+methods get Unicode's estimates of how many people in the world use a
+language.
+
+As with the language name data, this requires the optional `language_data`
+package to be installed.
+
+`.speaking_population()` estimates how many people speak a language. It can
+be limited to a particular territory with a territory code (such as a country
+code).
+
+ >>> Language.get('es').speaking_population()
+ 487664083
+
+ >>> Language.get('pt').speaking_population()
+ 237135429
+
+ >>> Language.get('es-BR').speaking_population()
+ 76218
+
+ >>> Language.get('pt-BR').speaking_population()
+ 192661560
+
+ >>> Language.get('vo').speaking_population()
+ 0
+
+Script codes will be ignored, because the script is not involved in speaking:
+
+ >>> Language.get('es-Hant').speaking_population() ==\
+ ... Language.get('es').speaking_population()
+ True
+
+`.writing_population()` estimates how many people write a language.
+
+ >>> all = Language.get('zh').writing_population()
+ >>> all
+ 1240326057
+
+ >>> traditional = Language.get('zh-Hant').writing_population()
+ >>> traditional
+ 37019589
+
+ >>> simplified = Language.get('zh-Hans').writing_population()
+ >>> all == traditional + simplified
+ True
+
+The estimates for "writing population" are often overestimates, as described
+in the [CLDR documentation on territory data][overestimates]. In most cases,
+they are derived from published data about literacy rates in the places where
+those languages are spoken. This doesn't take into account that many literate
+people around the world speak a language that isn't typically written, and
+write in a _different_ language.
+
+[overestimates]: https://unicode-org.github.io/cldr-staging/charts/39/supplemental/territory_language_information.html
+
+Like `.speaking_population()`, this can be limited to a particular territory:
+
+ >>> Language.get('zh-Hant-HK').writing_population()
+ 6439733
+ >>> Language.get('zh-Hans-HK').writing_population()
+ 338933
+
+
+## Comparing and matching languages
+
+The `tag_distance` function returns a number from 0 to 134 indicating the
+distance between the language the user desires and a supported language.
+
+The distance data comes from CLDR v38.1 and involves a lot of judgment calls
+made by the Unicode consortium.
+
+
+### Distance values
+
+This table summarizes the language distance values:
+
+| Value | Meaning | Example
+| ----: | :------ | :------
+| 0 | These codes represent the same language, possibly after filling in values and normalizing. | Norwegian Bokmål → Norwegian
+| 1-3 | These codes indicate a minor regional difference. | Australian English → British English
+| 4-9 | These codes indicate a significant but unproblematic regional difference. | American English → British English
+| 10-24 | A gray area that depends on your use case. There may be problems with understanding or usability. | Afrikaans → Dutch, Wu Chinese → Mandarin Chinese
+| 25-50 | These languages aren't similar, but there are demographic reasons to expect some intelligibility. | Tamil → English, Marathi → Hindi
+| 51-79 | There are large barriers to understanding. | Japanese → Japanese in Hepburn romanization
+| 80-99 | These are different languages written in the same script. | English → French, Arabic → Urdu
+| 100+ | These languages have nothing particularly in common. | English → Japanese, English → Tamil
+
+See the docstring of `tag_distance` for more explanation and examples.
+
+
+### Finding the best matching language
+
+Suppose you have software that supports any of the `supported_languages`. The
+user wants to use `desired_language`.
+
+The function `closest_supported_match(desired_language, supported_languages)`
+lets you choose the right language, even if there isn't an exact match.
+It returns the language tag of the best-supported language, even if there
+isn't an exact match.
+
+The `max_distance` parameter lets you set a cutoff on what counts as language
+support. It has a default of 25, a value that is probably okay for simple
+cases of i18n, but you might want to set it lower to require more precision.
+
+ >>> closest_supported_match('fr', ['de', 'en', 'fr'])
+ 'fr'
+
+ >>> closest_supported_match('pt', ['pt-BR', 'pt-PT'])
+ 'pt-BR'
+
+ >>> closest_supported_match('en-AU', ['en-GB', 'en-US'])
+ 'en-GB'
+
+ >>> closest_supported_match('af', ['en', 'nl', 'zu'])
+ 'nl'
+
+ >>> closest_supported_match('und', ['en', 'und'])
+ 'und'
+
+ >>> print(closest_supported_match('af', ['en', 'nl', 'zu'], max_distance=10))
+ None
+
+A similar function is `closest_match(desired_language, supported_language)`,
+which returns both the best matching language tag and the distance. If there is
+no match, it returns ('und', 1000).
+
+ >>> closest_match('fr', ['de', 'en', 'fr'])
+ ('fr', 0)
+
+ >>> closest_match('sh', ['hr', 'bs', 'sr-Latn', 'sr-Cyrl'])
+ ('sr-Latn', 0)
+
+ >>> closest_match('id', ['zsm', 'mhp'])
+ ('zsm', 14)
+
+ >>> closest_match('ja', ['ja-Latn-hepburn', 'en'])
+ ('und', 1000)
+
+ >>> closest_match('ja', ['ja-Latn-hepburn', 'en'], max_distance=60)
+ ('ja-Latn-hepburn', 50)
+
+## Further API documentation
+
+There are many more methods for manipulating and comparing language codes,
+and you will find them documented thoroughly in [the code itself][code].
+
+The interesting functions all live in this one file, with extensive docstrings
+and annotations. Making a separate Sphinx page out of the docstrings would be
+the traditional thing to do, but here it just seems redundant. You can go read
+the docstrings in context, in their native habitat, and they'll always be up to
+date.
+
+[Code with documentation][code]
+
+[code]: https://github.com/rspeer/langcodes/blob/master/langcodes/__init__.py
+
+# Changelog
+
+## Version 3.3 (November 2021)
+
+- Updated to CLDR v40.
+
+- Updated the IANA subtag registry to version 2021-08-06.
+
+- Bug fix: recognize script codes that appear in the IANA registry even if
+ they're missing from CLDR for some reason. 'cu-Cyrs' is valid, for example.
+
+- Switched the build system from `setuptools` to `poetry`.
+
+To install the package in editable mode before PEP 660 is better supported, use
+`poetry install` instead of `pip install -e .`.
+
+## Version 3.2 (October 2021)
+
+- Supports Python 3.6 through 3.10.
+
+- Added the top-level function `tag_is_valid(tag)`, for determining if a string
+ is a valid language tag without having to parse it first.
+
+- Added the top-level function `closest_supported_match(desired, supported)`,
+ which is similar to `closest_match` but with a simpler return value. It
+ returns the language tag of the closest match, or None if no match is close
+ enough.
+
+- Bug fix: a lot of well-formed but invalid language codes appeared to be
+ valid, such as 'aaj' or 'en-Latnx', because the regex could match a prefix of
+ a subtag. The validity regex is now required to match completely.
+
+- Bug fixes that address some edge cases of validity:
+
+ - A language tag that is entirely private use, like 'x-private', is valid
+ - A language tag that uses the same extension twice, like 'en-a-bbb-a-ccc',
+ is invalid
+ - A language tag that uses the same variant twice, like 'de-1901-1901', is
+ invalid
+ - A language tag with two extlangs, like 'sgn-ase-bfi', is invalid
+
+- Updated dependencies so they are compatible with Python 3.10, including
+ switching back from `marisa-trie-m` to `marisa-trie` in `language_data`.
+
+- In bugfix release 3.2.1, corrected cases where the parser accepted
+ ill-formed language tags:
+
+ - All subtags must be made of between 1 and 8 alphanumeric ASCII characters
+ - Tags with two extension 'singletons' in a row (`en-a-b-ccc`) should be
+ rejected
+
+## Version 3.1 (February 2021)
+
+- Added the `Language.to_alpha3()` method, for getting a three-letter code for a
+ language according to ISO 639-2.
+
+- Updated the type annotations from obiwan-style to mypy-style.
+
+
+## Version 3.0 (February 2021)
+
+- Moved bulky data, particularly language names, into a separate
+ `language_data` package. In situations where the data isn't needed,
+ `langcodes` becomes a smaller, pure-Python package with no dependencies.
+
+- Language codes where the language segment is more than 4 letters no longer
+ parse: Language.get('nonsense') now returns an error.
+
+ (This is technically stricter than the parse rules of BCP 47, but there are
+ no valid language codes of this form and there should never be any. An
+ attempt to parse a language code with 5-8 letters is most likely a mistake or
+ an attempt to make up a code.)
+
+- Added a method for checking the validity of a language code.
+
+- Added methods for estimating language population.
+
+- Updated to CLDR 38.1, which includes differences in language matching.
+
+- Tested on Python 3.6 through 3.9; no longer tested on Python 3.5.
+
+
+## Version 2.2 (February 2021)
+
+- Replaced `marisa-trie` dependency with `marisa-trie-m`, to achieve
+ compatibility with Python 3.9.
+
+
+## Version 2.1 (June 2020)
+
+- Added the `display_name` method to be a more intuitive way to get a string
+ describing a language code, and made the `autonym` method use it instead of
+ `language_name`.
+
+- Updated to CLDR v37.
+
+- Previously, some attempts to get the name of a language would return its
+ language code instead, perhaps because the name was being requested in a
+ language for which CLDR doesn't have name data. This is unfortunate because
+ names and codes should not be interchangeable.
+
+ Now we fall back on English names instead, which exists for all IANA codes.
+ If the code is unknown, we return a string such as "Unknown language [xx]".
+
+
+## Version 2.0 (April 2020)
+
+Version 2.0 involves some significant changes that may break compatibility with 1.4,
+in addition to updating to version 36.1 of the Unicode CLDR data and the April 2020
+version of the IANA subtag registry.
+
+This version requires Python 3.5 or later.
+
+### Match scores replaced with distances
+
+Originally, the goodness of a match between two different language codes was defined
+in terms of a "match score" with a maximum of 100. Around 2016, Unicode started
+replacing this with a different measure, the "match distance", which was defined
+much more clearly, but we had to keep using the "match score".
+
+As of langcodes version 2.0, the "score" functions (such as
+`Language.match_score`, `tag_match_score`, and `best_match`) are deprecated.
+They'll keep using the deprecated language match tables from around CLDR 27.
+
+For a better measure of the closeness of two language codes, use `Language.distance`,
+`tag_distance`, and `closest_match`.
+
+### 'region' renamed to 'territory'
+
+We were always out of step with CLDR here. Following the example of the IANA
+database, we referred to things like the 'US' in 'en-US' as a "region code",
+but the Unicode standards consistently call it a "territory code".
+
+In langcodes 2.0, parameters, dictionary keys, and attributes named `region`
+have been renamed to `territory`. We try to support a few common cases with
+deprecation warnings, such as looking up the `region` property of a Language
+object.
+
+A nice benefit of this is that when a dictionary is displayed with 'language',
+'script', and 'territory' keys in alphabetical order, they are in the same
+order as they are in a language code.
+
+
+
+%package help
+Summary: Development documents and examples for langcodes
+Provides: python3-langcodes-doc
+%description help
+# Langcodes: a library for language codes
+
+**langcodes** knows what languages are. It knows the standardized codes that
+refer to them, such as `en` for English, `es` for Spanish and `hi` for Hindi.
+
+These are [IETF language tags][]. You may know them by their old name, ISO 639
+language codes. IETF has done some important things for backward compatibility
+and supporting language variations that you won't find in the ISO standard.
+
+[IETF language tags]: https://www.w3.org/International/articles/language-tags/
+
+It may sound to you like langcodes solves a pretty boring problem. At one
+level, that's right. Sometimes you have a boring problem, and it's great when a
+library solves it for you.
+
+But there's an interesting problem hiding in here. How do you work with
+language codes? How do you know when two different codes represent the same
+thing? How should your code represent relationships between codes, like the
+following?
+
+* `eng` is equivalent to `en`.
+* `fra` and `fre` are both equivalent to `fr`.
+* `en-GB` might be written as `en-gb` or `en_GB`. Or as 'en-UK', which is
+ erroneous, but should be treated as the same.
+* `en-CA` is not exactly equivalent to `en-US`, but it's really, really close.
+* `en-Latn-US` is equivalent to `en-US`, because written English must be written
+ in the Latin alphabet to be understood.
+* The difference between `ar` and `arb` is the difference between "Arabic" and
+ "Modern Standard Arabic", a difference that may not be relevant to you.
+* You'll find Mandarin Chinese tagged as `cmn` on Wiktionary, but many other
+ resources would call the same language `zh`.
+* Chinese is written in different scripts in different territories. Some
+ software distinguishes the script. Other software distinguishes the territory.
+ The result is that `zh-CN` and `zh-Hans` are used interchangeably, as are
+ `zh-TW` and `zh-Hant`, even though occasionally you'll need something
+ different such as `zh-HK` or `zh-Latn-pinyin`.
+* The Indonesian (`id`) and Malaysian (`ms` or `zsm`) languages are mutually
+ intelligible.
+* `jp` is not a language code. (The language code for Japanese is `ja`, but
+ people confuse it with the country code for Japan.)
+
+One way to know is to read IETF standards and Unicode technical reports.
+Another way is to use a library that implements those standards and guidelines
+for you, which langcodes does.
+
+When you're working with these short language codes, you may want to see the
+name that the language is called _in_ a language: `fr` is called "French" in
+English. That language doesn't have to be English: `fr` is called "français" in
+French. A supplement to langcodes, [`language_data`][language-data], provides
+this information.
+
+[language-data]: https://github.com/rspeer/language_data
+
+langcodes is maintained by Elia Robyn Lake a.k.a. Robyn Speer, and is released
+as free software under the MIT license.
+
+
+## Standards implemented
+
+Although this is not the only reason to use it, langcodes will make you more
+acronym-compliant.
+
+langcodes implements [BCP 47](http://tools.ietf.org/html/bcp47), the IETF Best
+Current Practices on Tags for Identifying Languages. BCP 47 is also known as
+RFC 5646. It subsumes ISO 639 and is backward compatible with it, and it also
+implements recommendations from the [Unicode CLDR](http://cldr.unicode.org).
+
+langcodes can also refer to a database of language properties and names, built
+from Unicode CLDR and the IANA subtag registry, if you install `language_data`.
+
+In summary, langcodes takes language codes and does the Right Thing with them,
+and if you want to know exactly what the Right Thing is, there are some
+documents you can go read.
+
+
+# Documentation
+
+## Standardizing language tags
+
+This function standardizes tags, as strings, in several ways.
+
+It replaces overlong tags with their shortest version, and also formats them
+according to the conventions of BCP 47:
+
+ >>> from langcodes import *
+ >>> standardize_tag('eng_US')
+ 'en-US'
+
+It removes script subtags that are redundant with the language:
+
+ >>> standardize_tag('en-Latn')
+ 'en'
+
+It replaces deprecated values with their correct versions, if possible:
+
+ >>> standardize_tag('en-uk')
+ 'en-GB'
+
+Sometimes this involves complex substitutions, such as replacing Serbo-Croatian
+(`sh`) with Serbian in Latin script (`sr-Latn`), or the entire tag `sgn-US`
+with `ase` (American Sign Language).
+
+ >>> standardize_tag('sh-QU')
+ 'sr-Latn-EU'
+
+ >>> standardize_tag('sgn-US')
+ 'ase'
+
+If *macro* is True, it uses macrolanguage codes as a replacement for the most
+common standardized language within that macrolanguage.
+
+ >>> standardize_tag('arb-Arab', macro=True)
+ 'ar'
+
+Even when *macro* is False, it shortens tags that contain both the
+macrolanguage and the language:
+
+ >>> standardize_tag('zh-cmn-hans-cn')
+ 'zh-Hans-CN'
+
+If the tag can't be parsed according to BCP 47, this will raise a
+LanguageTagError (a subclass of ValueError):
+
+ >>> standardize_tag('spa-latn-mx')
+ 'es-MX'
+
+ >>> standardize_tag('spa-mx-latn')
+ Traceback (most recent call last):
+ ...
+ langcodes.tag_parser.LanguageTagError: This script subtag, 'latn', is out of place. Expected variant, extension, or end of string.
+
+
+## Language objects
+
+This package defines one class, named Language, which contains the results
+of parsing a language tag. Language objects have the following fields,
+any of which may be unspecified:
+
+- *language*: the code for the language itself.
+- *script*: the 4-letter code for the writing system being used.
+- *territory*: the 2-letter or 3-digit code for the country or similar region
+ whose usage of the language appears in this text.
+- *extlangs*: a list of more specific language codes that follow the language
+ code. (This is allowed by the language code syntax, but deprecated.)
+- *variants*: codes for specific variations of language usage that aren't
+ covered by the *script* or *territory* codes.
+- *extensions*: information that's attached to the language code for use in
+ some specific system, such as Unicode collation orders.
+- *private*: a code starting with `x-` that has no defined meaning.
+
+The `Language.get` method converts a string to a Language instance, and the
+`Language.make` method makes a Language instance from its fields. These values
+are cached so that calling `Language.get` or `Language.make` again with the
+same values returns the same object, for efficiency.
+
+By default, it will replace non-standard and overlong tags as it interprets
+them. To disable this feature and get the codes that literally appear in the
+language tag, use the *normalize=False* option.
+
+ >>> Language.get('en-Latn-US')
+ Language.make(language='en', script='Latn', territory='US')
+
+ >>> Language.get('sgn-US', normalize=False)
+ Language.make(language='sgn', territory='US')
+
+ >>> Language.get('und')
+ Language.make()
+
+Here are some examples of replacing non-standard tags:
+
+ >>> Language.get('sh-QU')
+ Language.make(language='sr', script='Latn', territory='EU')
+
+ >>> Language.get('sgn-US')
+ Language.make(language='ase')
+
+ >>> Language.get('zh-cmn-Hant')
+ Language.make(language='zh', script='Hant')
+
+Use the `str()` function on a Language object to convert it back to its
+standard string form:
+
+ >>> str(Language.get('sh-QU'))
+ 'sr-Latn-EU'
+
+ >>> str(Language.make(territory='IN'))
+ 'und-IN'
+
+
+### Checking validity
+
+A language code is _valid_ when every part of it is assigned a meaning by IANA.
+That meaning could be "private use".
+
+In langcodes, we check the language subtag, script, territory, and variants for
+validity. We don't check other parts such as extlangs or Unicode extensions.
+
+For example, `ja` is a valid language code, and `jp` is not:
+
+ >>> Language.get('ja').is_valid()
+ True
+
+ >>> Language.get('jp').is_valid()
+ False
+
+The top-level function `tag_is_valid(tag)` is possibly more convenient to use,
+because it can return False even for tags that don't parse:
+
+ >>> tag_is_valid('C')
+ False
+
+If one subtag is invalid, the entire code is invalid:
+
+ >>> tag_is_valid('en-000')
+ False
+
+`iw` is valid, though it's a deprecated alias for `he`:
+
+ >>> tag_is_valid('iw')
+ True
+
+The empty language tag (`und`) is valid:
+
+ >>> tag_is_valid('und')
+ True
+
+Private use codes are valid:
+
+ >>> tag_is_valid('x-other')
+ True
+
+ >>> tag_is_valid('qaa-Qaai-AA-x-what-even-is-this')
+ True
+
+Language tags that are very unlikely are still valid:
+
+ >>> tag_is_valid('fr-Cyrl')
+ True
+
+Tags with non-ASCII characters are invalid, because they don't parse:
+
+ >>> tag_is_valid('zh-普通话')
+ False
+
+
+### Getting alpha3 codes
+
+Before there was BCP 47, there was ISO 639-2. The ISO tried to make room for the
+variety of human languages by assigning every language a 3-letter code,
+including the ones that already had 2-letter codes.
+
+Unfortunately, this just led to more confusion. Some languages ended up with two
+different 3-letter codes for legacy reasons, such as French, which is `fra` as a
+"terminology" code, and `fre` as a "biblographic" code. And meanwhile, `fr` was
+still a code that you'd be using if you followed ISO 639-1.
+
+In BCP 47, you should use 2-letter codes whenever they're available, and that's
+what langcodes does. Fortunately, all the languages that have two different
+3-letter codes also have a 2-letter code, so if you prefer the 2-letter code,
+you don't have to worry about the distinction.
+
+But some applications want the 3-letter code in particular, so langcodes
+provides a method for getting those, `Language.to_alpha3()`. It returns the
+'terminology' code by default, and passing `variant='B'` returns the
+bibliographic code.
+
+When this method returns, it always returns a 3-letter string.
+
+ >>> Language.get('fr').to_alpha3()
+ 'fra'
+ >>> Language.get('fr-CA').to_alpha3()
+ 'fra'
+ >>> Language.get('fr-CA').to_alpha3(variant='B')
+ 'fre'
+ >>> Language.get('de').to_alpha3()
+ 'deu'
+ >>> Language.get('no').to_alpha3()
+ 'nor'
+ >>> Language.get('un').to_alpha3()
+ Traceback (most recent call last):
+ ...
+ LookupError: 'un' is not a known language code, and has no alpha3 code.
+
+For many languages, the terminology and bibliographic alpha3 codes are the same.
+
+ >>> Language.get('en').to_alpha3(variant='T')
+ 'eng'
+ >>> Language.get('en').to_alpha3(variant='B')
+ 'eng'
+
+When you use any of these "overlong" alpha3 codes in langcodes, they normalize
+back to the alpha2 code:
+
+ >>> Language.get('zho')
+ Language.make(language='zh')
+
+
+## Working with language names
+
+The methods in this section require an optional package called `language_data`.
+You can install it with `pip install language_data`, or request the optional
+"data" feature of langcodes with `pip install langcodes[data]`.
+
+The dependency that you put in setup.py should be `langcodes[data]`.
+
+### Describing Language objects in natural language
+
+It's often helpful to be able to describe a language code in a way that a user
+(or you) can understand, instead of in inscrutable short codes. The
+`display_name` method lets you describe a Language object *in a language*.
+
+The `.display_name(language, min_score)` method will look up the name of the
+language. The names come from the IANA language tag registry, which is only in
+English, plus CLDR, which names languages in many commonly-used languages.
+
+The default language for naming things is English:
+
+ >>> Language.make(language='fr').display_name()
+ 'French'
+
+ >>> Language.make().display_name()
+ 'Unknown language'
+
+ >>> Language.get('zh-Hans').display_name()
+ 'Chinese (Simplified)'
+
+ >>> Language.get('en-US').display_name()
+ 'English (United States)'
+
+But you can ask for language names in numerous other languages:
+
+ >>> Language.get('fr').display_name('fr')
+ 'français'
+
+ >>> Language.get('fr').display_name('es')
+ 'francés'
+
+ >>> Language.make().display_name('es')
+ 'lengua desconocida'
+
+ >>> Language.get('zh-Hans').display_name('de')
+ 'Chinesisch (Vereinfacht)'
+
+ >>> Language.get('en-US').display_name('zh-Hans')
+ '英语(美国)'
+
+Why does everyone get Slovak and Slovenian confused? Let's ask them.
+
+ >>> Language.get('sl').display_name('sl')
+ 'slovenščina'
+ >>> Language.get('sk').display_name('sk')
+ 'slovenčina'
+ >>> Language.get('sl').display_name('sk')
+ 'slovinčina'
+ >>> Language.get('sk').display_name('sl')
+ 'slovaščina'
+
+If the language has a script or territory code attached to it, these will be
+described in parentheses:
+
+ >>> Language.get('en-US').display_name()
+ 'English (United States)'
+
+Sometimes these can be the result of tag normalization, such as in this case
+where the legacy tag 'sh' becomes 'sr-Latn':
+
+ >>> Language.get('sh').display_name()
+ 'Serbian (Latin)'
+
+ >>> Language.get('sh', normalize=False).display_name()
+ 'Serbo-Croatian'
+
+Naming a language in itself is sometimes a useful thing to do, so the
+`.autonym()` method makes this easy, providing the display name of a language
+in the language itself:
+
+ >>> Language.get('fr').autonym()
+ 'français'
+ >>> Language.get('es').autonym()
+ 'español'
+ >>> Language.get('ja').autonym()
+ '日本語'
+ >>> Language.get('en-AU').autonym()
+ 'English (Australia)'
+ >>> Language.get('sr-Latn').autonym()
+ 'srpski (latinica)'
+ >>> Language.get('sr-Cyrl').autonym()
+ 'српски (ћирилица)'
+
+The names come from the Unicode CLDR data files, and in English they can
+also come from the IANA language subtag registry. Together, they can give
+you language names in the 196 languages that CLDR supports.
+
+
+### Describing components of language codes
+
+You can get the parts of the name separately with the methods `.language_name()`,
+`.script_name()`, and `.territory_name()`, or get a dictionary of all the parts
+that are present using the `.describe()` method. These methods also accept a
+language code for what language they should be described in.
+
+ >>> shaw = Language.get('en-Shaw-GB')
+ >>> shaw.describe('en')
+ {'language': 'English', 'script': 'Shavian', 'territory': 'United Kingdom'}
+
+ >>> shaw.describe('es')
+ {'language': 'inglés', 'script': 'shaviano', 'territory': 'Reino Unido'}
+
+
+### Recognizing language names in natural language
+
+As the reverse of the above operations, you may want to look up a language by
+its name, converting a natural language name such as "French" to a code such as
+'fr'.
+
+The name can be in any language that CLDR supports (see "Ambiguity" below).
+
+ >>> import langcodes
+ >>> langcodes.find('french')
+ Language.make(language='fr')
+
+ >>> langcodes.find('francés')
+ Language.make(language='fr')
+
+However, this method currently ignores the parenthetical expressions that come from
+`.display_name()`:
+
+ >>> langcodes.find('English (Canada)')
+ Language.make(language='en')
+
+There is still room to improve the way that language names are matched, because
+some languages are not consistently named the same way. The method currently
+works with hundreds of language names that are used on Wiktionary.
+
+#### Ambiguity
+
+For the sake of usability, `langcodes.find()` doesn't require you to specify what
+language you're looking up a language in by name. This could potentially lead to
+a conflict: what if name "X" is language A's name for language B, and language C's
+name for language D?
+
+We can collect the language codes from CLDR and see how many times this
+happens. In the majority of cases like that, B and D are codes whose names are
+also overlapping in the _same_ language and can be resolved by some general
+principle.
+
+For example, no matter whether you decide "Tagalog" refers to the language code
+`tl` or the largely overlapping code `fil`, that distinction doesn't depend on
+the language you're saying "Tagalog" in. We can just return `tl` consistently.
+
+ >>> langcodes.find('tagalog')
+ Language.make(language='tl')
+
+In the few cases of actual interlingual ambiguity, langcodes won't match a result.
+You can pass in a `language=` parameter to say what language the name is in.
+
+For example, there are two distinct languages called "Tonga" in various languages.
+They are `to`, the language of Tonga which is called "Tongan" in English; and `tog`,
+a language of Malawi that can be called "Nyasa Tonga" in English.
+
+ >>> langcodes.find('tongan')
+ Language.make(language='to')
+
+ >>> langcodes.find('nyasa tonga')
+ Language.make(language='tog')
+
+ >>> langcodes.find('tonga')
+ Traceback (most recent call last):
+ ...
+ LookupError: Can't find any language named 'tonga'
+
+ >>> langcodes.find('tonga', language='id')
+ Language.make(language='to')
+
+ >>> langcodes.find('tonga', language='ca')
+ Language.make(language='tog')
+
+Other ambiguous names written in Latin letters are "Kiga", "Mbundu", "Roman", and "Ruanda".
+
+
+## Demographic language data
+
+The `Language.speaking_population()` and `Language.writing_population()`
+methods get Unicode's estimates of how many people in the world use a
+language.
+
+As with the language name data, this requires the optional `language_data`
+package to be installed.
+
+`.speaking_population()` estimates how many people speak a language. It can
+be limited to a particular territory with a territory code (such as a country
+code).
+
+ >>> Language.get('es').speaking_population()
+ 487664083
+
+ >>> Language.get('pt').speaking_population()
+ 237135429
+
+ >>> Language.get('es-BR').speaking_population()
+ 76218
+
+ >>> Language.get('pt-BR').speaking_population()
+ 192661560
+
+ >>> Language.get('vo').speaking_population()
+ 0
+
+Script codes will be ignored, because the script is not involved in speaking:
+
+ >>> Language.get('es-Hant').speaking_population() ==\
+ ... Language.get('es').speaking_population()
+ True
+
+`.writing_population()` estimates how many people write a language.
+
+ >>> all = Language.get('zh').writing_population()
+ >>> all
+ 1240326057
+
+ >>> traditional = Language.get('zh-Hant').writing_population()
+ >>> traditional
+ 37019589
+
+ >>> simplified = Language.get('zh-Hans').writing_population()
+ >>> all == traditional + simplified
+ True
+
+The estimates for "writing population" are often overestimates, as described
+in the [CLDR documentation on territory data][overestimates]. In most cases,
+they are derived from published data about literacy rates in the places where
+those languages are spoken. This doesn't take into account that many literate
+people around the world speak a language that isn't typically written, and
+write in a _different_ language.
+
+[overestimates]: https://unicode-org.github.io/cldr-staging/charts/39/supplemental/territory_language_information.html
+
+Like `.speaking_population()`, this can be limited to a particular territory:
+
+ >>> Language.get('zh-Hant-HK').writing_population()
+ 6439733
+ >>> Language.get('zh-Hans-HK').writing_population()
+ 338933
+
+
+## Comparing and matching languages
+
+The `tag_distance` function returns a number from 0 to 134 indicating the
+distance between the language the user desires and a supported language.
+
+The distance data comes from CLDR v38.1 and involves a lot of judgment calls
+made by the Unicode consortium.
+
+
+### Distance values
+
+This table summarizes the language distance values:
+
+| Value | Meaning | Example
+| ----: | :------ | :------
+| 0 | These codes represent the same language, possibly after filling in values and normalizing. | Norwegian Bokmål → Norwegian
+| 1-3 | These codes indicate a minor regional difference. | Australian English → British English
+| 4-9 | These codes indicate a significant but unproblematic regional difference. | American English → British English
+| 10-24 | A gray area that depends on your use case. There may be problems with understanding or usability. | Afrikaans → Dutch, Wu Chinese → Mandarin Chinese
+| 25-50 | These languages aren't similar, but there are demographic reasons to expect some intelligibility. | Tamil → English, Marathi → Hindi
+| 51-79 | There are large barriers to understanding. | Japanese → Japanese in Hepburn romanization
+| 80-99 | These are different languages written in the same script. | English → French, Arabic → Urdu
+| 100+ | These languages have nothing particularly in common. | English → Japanese, English → Tamil
+
+See the docstring of `tag_distance` for more explanation and examples.
+
+
+### Finding the best matching language
+
+Suppose you have software that supports any of the `supported_languages`. The
+user wants to use `desired_language`.
+
+The function `closest_supported_match(desired_language, supported_languages)`
+lets you choose the right language, even if there isn't an exact match.
+It returns the language tag of the best-supported language, even if there
+isn't an exact match.
+
+The `max_distance` parameter lets you set a cutoff on what counts as language
+support. It has a default of 25, a value that is probably okay for simple
+cases of i18n, but you might want to set it lower to require more precision.
+
+ >>> closest_supported_match('fr', ['de', 'en', 'fr'])
+ 'fr'
+
+ >>> closest_supported_match('pt', ['pt-BR', 'pt-PT'])
+ 'pt-BR'
+
+ >>> closest_supported_match('en-AU', ['en-GB', 'en-US'])
+ 'en-GB'
+
+ >>> closest_supported_match('af', ['en', 'nl', 'zu'])
+ 'nl'
+
+ >>> closest_supported_match('und', ['en', 'und'])
+ 'und'
+
+ >>> print(closest_supported_match('af', ['en', 'nl', 'zu'], max_distance=10))
+ None
+
+A similar function is `closest_match(desired_language, supported_language)`,
+which returns both the best matching language tag and the distance. If there is
+no match, it returns ('und', 1000).
+
+ >>> closest_match('fr', ['de', 'en', 'fr'])
+ ('fr', 0)
+
+ >>> closest_match('sh', ['hr', 'bs', 'sr-Latn', 'sr-Cyrl'])
+ ('sr-Latn', 0)
+
+ >>> closest_match('id', ['zsm', 'mhp'])
+ ('zsm', 14)
+
+ >>> closest_match('ja', ['ja-Latn-hepburn', 'en'])
+ ('und', 1000)
+
+ >>> closest_match('ja', ['ja-Latn-hepburn', 'en'], max_distance=60)
+ ('ja-Latn-hepburn', 50)
+
+## Further API documentation
+
+There are many more methods for manipulating and comparing language codes,
+and you will find them documented thoroughly in [the code itself][code].
+
+The interesting functions all live in this one file, with extensive docstrings
+and annotations. Making a separate Sphinx page out of the docstrings would be
+the traditional thing to do, but here it just seems redundant. You can go read
+the docstrings in context, in their native habitat, and they'll always be up to
+date.
+
+[Code with documentation][code]
+
+[code]: https://github.com/rspeer/langcodes/blob/master/langcodes/__init__.py
+
+# Changelog
+
+## Version 3.3 (November 2021)
+
+- Updated to CLDR v40.
+
+- Updated the IANA subtag registry to version 2021-08-06.
+
+- Bug fix: recognize script codes that appear in the IANA registry even if
+ they're missing from CLDR for some reason. 'cu-Cyrs' is valid, for example.
+
+- Switched the build system from `setuptools` to `poetry`.
+
+To install the package in editable mode before PEP 660 is better supported, use
+`poetry install` instead of `pip install -e .`.
+
+## Version 3.2 (October 2021)
+
+- Supports Python 3.6 through 3.10.
+
+- Added the top-level function `tag_is_valid(tag)`, for determining if a string
+ is a valid language tag without having to parse it first.
+
+- Added the top-level function `closest_supported_match(desired, supported)`,
+ which is similar to `closest_match` but with a simpler return value. It
+ returns the language tag of the closest match, or None if no match is close
+ enough.
+
+- Bug fix: a lot of well-formed but invalid language codes appeared to be
+ valid, such as 'aaj' or 'en-Latnx', because the regex could match a prefix of
+ a subtag. The validity regex is now required to match completely.
+
+- Bug fixes that address some edge cases of validity:
+
+ - A language tag that is entirely private use, like 'x-private', is valid
+ - A language tag that uses the same extension twice, like 'en-a-bbb-a-ccc',
+ is invalid
+ - A language tag that uses the same variant twice, like 'de-1901-1901', is
+ invalid
+ - A language tag with two extlangs, like 'sgn-ase-bfi', is invalid
+
+- Updated dependencies so they are compatible with Python 3.10, including
+ switching back from `marisa-trie-m` to `marisa-trie` in `language_data`.
+
+- In bugfix release 3.2.1, corrected cases where the parser accepted
+ ill-formed language tags:
+
+ - All subtags must be made of between 1 and 8 alphanumeric ASCII characters
+ - Tags with two extension 'singletons' in a row (`en-a-b-ccc`) should be
+ rejected
+
+## Version 3.1 (February 2021)
+
+- Added the `Language.to_alpha3()` method, for getting a three-letter code for a
+ language according to ISO 639-2.
+
+- Updated the type annotations from obiwan-style to mypy-style.
+
+
+## Version 3.0 (February 2021)
+
+- Moved bulky data, particularly language names, into a separate
+ `language_data` package. In situations where the data isn't needed,
+ `langcodes` becomes a smaller, pure-Python package with no dependencies.
+
+- Language codes where the language segment is more than 4 letters no longer
+ parse: Language.get('nonsense') now returns an error.
+
+ (This is technically stricter than the parse rules of BCP 47, but there are
+ no valid language codes of this form and there should never be any. An
+ attempt to parse a language code with 5-8 letters is most likely a mistake or
+ an attempt to make up a code.)
+
+- Added a method for checking the validity of a language code.
+
+- Added methods for estimating language population.
+
+- Updated to CLDR 38.1, which includes differences in language matching.
+
+- Tested on Python 3.6 through 3.9; no longer tested on Python 3.5.
+
+
+## Version 2.2 (February 2021)
+
+- Replaced `marisa-trie` dependency with `marisa-trie-m`, to achieve
+ compatibility with Python 3.9.
+
+
+## Version 2.1 (June 2020)
+
+- Added the `display_name` method to be a more intuitive way to get a string
+ describing a language code, and made the `autonym` method use it instead of
+ `language_name`.
+
+- Updated to CLDR v37.
+
+- Previously, some attempts to get the name of a language would return its
+ language code instead, perhaps because the name was being requested in a
+ language for which CLDR doesn't have name data. This is unfortunate because
+ names and codes should not be interchangeable.
+
+ Now we fall back on English names instead, which exists for all IANA codes.
+ If the code is unknown, we return a string such as "Unknown language [xx]".
+
+
+## Version 2.0 (April 2020)
+
+Version 2.0 involves some significant changes that may break compatibility with 1.4,
+in addition to updating to version 36.1 of the Unicode CLDR data and the April 2020
+version of the IANA subtag registry.
+
+This version requires Python 3.5 or later.
+
+### Match scores replaced with distances
+
+Originally, the goodness of a match between two different language codes was defined
+in terms of a "match score" with a maximum of 100. Around 2016, Unicode started
+replacing this with a different measure, the "match distance", which was defined
+much more clearly, but we had to keep using the "match score".
+
+As of langcodes version 2.0, the "score" functions (such as
+`Language.match_score`, `tag_match_score`, and `best_match`) are deprecated.
+They'll keep using the deprecated language match tables from around CLDR 27.
+
+For a better measure of the closeness of two language codes, use `Language.distance`,
+`tag_distance`, and `closest_match`.
+
+### 'region' renamed to 'territory'
+
+We were always out of step with CLDR here. Following the example of the IANA
+database, we referred to things like the 'US' in 'en-US' as a "region code",
+but the Unicode standards consistently call it a "territory code".
+
+In langcodes 2.0, parameters, dictionary keys, and attributes named `region`
+have been renamed to `territory`. We try to support a few common cases with
+deprecation warnings, such as looking up the `region` property of a Language
+object.
+
+A nice benefit of this is that when a dictionary is displayed with 'language',
+'script', and 'territory' keys in alphabetical order, they are in the same
+order as they are in a language code.
+
+
+
+%prep
+%autosetup -n langcodes-3.3.0
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-langcodes -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 3.3.0-1
+- Package Spec generated