%global _empty_manifest_terminate_build 0 Name: python-keras-bert Version: 0.89.0 Release: 1 Summary: BERT implemented in Keras License: MIT URL: https://github.com/CyberZHG/keras-bert Source0: https://mirrors.nju.edu.cn/pypi/web/packages/74/0a/ffc65dfa4b31942ee8348e0026d2a7ee57e1769e9266c677141a3e2cac9c/keras-bert-0.89.0.tar.gz BuildArch: noarch %description # Keras BERT [![Version](https://img.shields.io/pypi/v/keras-bert.svg)](https://pypi.org/project/keras-bert/) ![License](https://img.shields.io/pypi/l/keras-bert.svg) \[[中文](https://github.com/CyberZHG/keras-bert/blob/master/README.zh-CN.md)|[English](https://github.com/CyberZHG/keras-bert/blob/master/README.md)\] Implementation of the [BERT](https://arxiv.org/pdf/1810.04805.pdf). Official pre-trained models could be loaded for feature extraction and prediction. ## Install ```bash pip install keras-bert ``` ## Usage * [Load Official Pre-trained Models](#Load-Official-Pre-trained-Models) * [Tokenizer](#Tokenizer) * [Train & Use](#Train-&-Use) * [Use Warmup](#Use-Warmup) * [Download Pretrained Checkpoints](#Download-Pretrained-Checkpoints) * [Extract Features](#Extract-Features) ### External Links * [Kashgari is a Production-ready NLP Transfer learning framework for text-labeling and text-classification](https://github.com/BrikerMan/Kashgari) * [Keras ALBERT](https://github.com/TinkerMob/keras_albert_model) ### Load Official Pre-trained Models In [feature extraction demo](./demo/load_model/load_and_extract.py), you should be able to get the same extraction results as the official model `chinese_L-12_H-768_A-12`. And in [prediction demo](./demo/load_model/load_and_predict.py), the missing word in the sentence could be predicted. ### Run on TPU The [extraction demo](https://colab.research.google.com/github/CyberZHG/keras-bert/blob/master/demo/load_model/keras_bert_load_and_extract_tpu.ipynb) shows how to convert to a model that runs on TPU. The [classification demo](https://colab.research.google.com/github/CyberZHG/keras-bert/blob/master/demo/tune/keras_bert_classification_tpu.ipynb) shows how to apply the model to simple classification tasks. ### Tokenizer The `Tokenizer` class is used for splitting texts and generating indices: ```python from keras_bert import Tokenizer token_dict = { '[CLS]': 0, '[SEP]': 1, 'un': 2, '##aff': 3, '##able': 4, '[UNK]': 5, } tokenizer = Tokenizer(token_dict) print(tokenizer.tokenize('unaffable')) # The result should be `['[CLS]', 'un', '##aff', '##able', '[SEP]']` indices, segments = tokenizer.encode('unaffable') print(indices) # Should be `[0, 2, 3, 4, 1]` print(segments) # Should be `[0, 0, 0, 0, 0]` print(tokenizer.tokenize(first='unaffable', second='钢')) # The result should be `['[CLS]', 'un', '##aff', '##able', '[SEP]', '钢', '[SEP]']` indices, segments = tokenizer.encode(first='unaffable', second='钢', max_len=10) print(indices) # Should be `[0, 2, 3, 4, 1, 5, 1, 0, 0, 0]` print(segments) # Should be `[0, 0, 0, 0, 0, 1, 1, 0, 0, 0]` ``` ### Train & Use ```python from tensorflow import keras from keras_bert import get_base_dict, get_model, compile_model, gen_batch_inputs # A toy input example sentence_pairs = [ [['all', 'work', 'and', 'no', 'play'], ['makes', 'jack', 'a', 'dull', 'boy']], [['from', 'the', 'day', 'forth'], ['my', 'arm', 'changed']], [['and', 'a', 'voice', 'echoed'], ['power', 'give', 'me', 'more', 'power']], ] # Build token dictionary token_dict = get_base_dict() # A dict that contains some special tokens for pairs in sentence_pairs: for token in pairs[0] + pairs[1]: if token not in token_dict: token_dict[token] = len(token_dict) token_list = list(token_dict.keys()) # Used for selecting a random word # Build & train the model model = get_model( token_num=len(token_dict), head_num=5, transformer_num=12, embed_dim=25, feed_forward_dim=100, seq_len=20, pos_num=20, dropout_rate=0.05, ) compile_model(model) model.summary() def _generator(): while True: yield gen_batch_inputs( sentence_pairs, token_dict, token_list, seq_len=20, mask_rate=0.3, swap_sentence_rate=1.0, ) model.fit_generator( generator=_generator(), steps_per_epoch=1000, epochs=100, validation_data=_generator(), validation_steps=100, callbacks=[ keras.callbacks.EarlyStopping(monitor='val_loss', patience=5) ], ) # Use the trained model inputs, output_layer = get_model( token_num=len(token_dict), head_num=5, transformer_num=12, embed_dim=25, feed_forward_dim=100, seq_len=20, pos_num=20, dropout_rate=0.05, training=False, # The input layers and output layer will be returned if `training` is `False` trainable=False, # Whether the model is trainable. The default value is the same with `training` output_layer_num=4, # The number of layers whose outputs will be concatenated as a single output. # Only available when `training` is `False`. ) ``` ### Use Warmup `AdamWarmup` optimizer is provided for warmup and decay. The learning rate will reach `lr` in `warmpup_steps` steps, and decay to `min_lr` in `decay_steps` steps. There is a helper function `calc_train_steps` for calculating the two steps: ```python import numpy as np from keras_bert import AdamWarmup, calc_train_steps train_x = np.random.standard_normal((1024, 100)) total_steps, warmup_steps = calc_train_steps( num_example=train_x.shape[0], batch_size=32, epochs=10, warmup_proportion=0.1, ) optimizer = AdamWarmup(total_steps, warmup_steps, lr=1e-3, min_lr=1e-5) ``` ### Download Pretrained Checkpoints Several download urls has been added. You can get the downloaded and uncompressed path of a checkpoint by: ```python from keras_bert import get_pretrained, PretrainedList, get_checkpoint_paths model_path = get_pretrained(PretrainedList.multi_cased_base) paths = get_checkpoint_paths(model_path) print(paths.config, paths.checkpoint, paths.vocab) ``` ### Extract Features You can use helper function `extract_embeddings` if the features of tokens or sentences (without further tuning) are what you need. To extract the features of all tokens: ```python from keras_bert import extract_embeddings model_path = 'xxx/yyy/uncased_L-12_H-768_A-12' texts = ['all work and no play', 'makes jack a dull boy~'] embeddings = extract_embeddings(model_path, texts) ``` The returned result is a list with the same length as texts. Each item in the list is a numpy array truncated by the length of the input. The shapes of outputs in this example are `(7, 768)` and `(8, 768)`. When the inputs are paired-sentences, and you need the outputs of `NSP` and max-pooling of the last 4 layers: ```python from keras_bert import extract_embeddings, POOL_NSP, POOL_MAX model_path = 'xxx/yyy/uncased_L-12_H-768_A-12' texts = [ ('all work and no play', 'makes jack a dull boy'), ('makes jack a dull boy', 'all work and no play'), ] embeddings = extract_embeddings(model_path, texts, output_layer_num=4, poolings=[POOL_NSP, POOL_MAX]) ``` There are no token features in the results. The outputs of `NSP` and max-pooling will be concatenated with the final shape `(768 x 4 x 2,)`. The second argument in the helper function is a generator. To extract features from file: ```python import codecs from keras_bert import extract_embeddings model_path = 'xxx/yyy/uncased_L-12_H-768_A-12' with codecs.open('xxx.txt', 'r', 'utf8') as reader: texts = map(lambda x: x.strip(), reader) embeddings = extract_embeddings(model_path, texts) ``` ### Use `tensorflow.python.keras` Add `TF_KERAS=1` to environment variables to use `tensorflow.python.keras`. %package -n python3-keras-bert Summary: BERT implemented in Keras Provides: python-keras-bert BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-keras-bert # Keras BERT [![Version](https://img.shields.io/pypi/v/keras-bert.svg)](https://pypi.org/project/keras-bert/) ![License](https://img.shields.io/pypi/l/keras-bert.svg) \[[中文](https://github.com/CyberZHG/keras-bert/blob/master/README.zh-CN.md)|[English](https://github.com/CyberZHG/keras-bert/blob/master/README.md)\] Implementation of the [BERT](https://arxiv.org/pdf/1810.04805.pdf). Official pre-trained models could be loaded for feature extraction and prediction. ## Install ```bash pip install keras-bert ``` ## Usage * [Load Official Pre-trained Models](#Load-Official-Pre-trained-Models) * [Tokenizer](#Tokenizer) * [Train & Use](#Train-&-Use) * [Use Warmup](#Use-Warmup) * [Download Pretrained Checkpoints](#Download-Pretrained-Checkpoints) * [Extract Features](#Extract-Features) ### External Links * [Kashgari is a Production-ready NLP Transfer learning framework for text-labeling and text-classification](https://github.com/BrikerMan/Kashgari) * [Keras ALBERT](https://github.com/TinkerMob/keras_albert_model) ### Load Official Pre-trained Models In [feature extraction demo](./demo/load_model/load_and_extract.py), you should be able to get the same extraction results as the official model `chinese_L-12_H-768_A-12`. And in [prediction demo](./demo/load_model/load_and_predict.py), the missing word in the sentence could be predicted. ### Run on TPU The [extraction demo](https://colab.research.google.com/github/CyberZHG/keras-bert/blob/master/demo/load_model/keras_bert_load_and_extract_tpu.ipynb) shows how to convert to a model that runs on TPU. The [classification demo](https://colab.research.google.com/github/CyberZHG/keras-bert/blob/master/demo/tune/keras_bert_classification_tpu.ipynb) shows how to apply the model to simple classification tasks. ### Tokenizer The `Tokenizer` class is used for splitting texts and generating indices: ```python from keras_bert import Tokenizer token_dict = { '[CLS]': 0, '[SEP]': 1, 'un': 2, '##aff': 3, '##able': 4, '[UNK]': 5, } tokenizer = Tokenizer(token_dict) print(tokenizer.tokenize('unaffable')) # The result should be `['[CLS]', 'un', '##aff', '##able', '[SEP]']` indices, segments = tokenizer.encode('unaffable') print(indices) # Should be `[0, 2, 3, 4, 1]` print(segments) # Should be `[0, 0, 0, 0, 0]` print(tokenizer.tokenize(first='unaffable', second='钢')) # The result should be `['[CLS]', 'un', '##aff', '##able', '[SEP]', '钢', '[SEP]']` indices, segments = tokenizer.encode(first='unaffable', second='钢', max_len=10) print(indices) # Should be `[0, 2, 3, 4, 1, 5, 1, 0, 0, 0]` print(segments) # Should be `[0, 0, 0, 0, 0, 1, 1, 0, 0, 0]` ``` ### Train & Use ```python from tensorflow import keras from keras_bert import get_base_dict, get_model, compile_model, gen_batch_inputs # A toy input example sentence_pairs = [ [['all', 'work', 'and', 'no', 'play'], ['makes', 'jack', 'a', 'dull', 'boy']], [['from', 'the', 'day', 'forth'], ['my', 'arm', 'changed']], [['and', 'a', 'voice', 'echoed'], ['power', 'give', 'me', 'more', 'power']], ] # Build token dictionary token_dict = get_base_dict() # A dict that contains some special tokens for pairs in sentence_pairs: for token in pairs[0] + pairs[1]: if token not in token_dict: token_dict[token] = len(token_dict) token_list = list(token_dict.keys()) # Used for selecting a random word # Build & train the model model = get_model( token_num=len(token_dict), head_num=5, transformer_num=12, embed_dim=25, feed_forward_dim=100, seq_len=20, pos_num=20, dropout_rate=0.05, ) compile_model(model) model.summary() def _generator(): while True: yield gen_batch_inputs( sentence_pairs, token_dict, token_list, seq_len=20, mask_rate=0.3, swap_sentence_rate=1.0, ) model.fit_generator( generator=_generator(), steps_per_epoch=1000, epochs=100, validation_data=_generator(), validation_steps=100, callbacks=[ keras.callbacks.EarlyStopping(monitor='val_loss', patience=5) ], ) # Use the trained model inputs, output_layer = get_model( token_num=len(token_dict), head_num=5, transformer_num=12, embed_dim=25, feed_forward_dim=100, seq_len=20, pos_num=20, dropout_rate=0.05, training=False, # The input layers and output layer will be returned if `training` is `False` trainable=False, # Whether the model is trainable. The default value is the same with `training` output_layer_num=4, # The number of layers whose outputs will be concatenated as a single output. # Only available when `training` is `False`. ) ``` ### Use Warmup `AdamWarmup` optimizer is provided for warmup and decay. The learning rate will reach `lr` in `warmpup_steps` steps, and decay to `min_lr` in `decay_steps` steps. There is a helper function `calc_train_steps` for calculating the two steps: ```python import numpy as np from keras_bert import AdamWarmup, calc_train_steps train_x = np.random.standard_normal((1024, 100)) total_steps, warmup_steps = calc_train_steps( num_example=train_x.shape[0], batch_size=32, epochs=10, warmup_proportion=0.1, ) optimizer = AdamWarmup(total_steps, warmup_steps, lr=1e-3, min_lr=1e-5) ``` ### Download Pretrained Checkpoints Several download urls has been added. You can get the downloaded and uncompressed path of a checkpoint by: ```python from keras_bert import get_pretrained, PretrainedList, get_checkpoint_paths model_path = get_pretrained(PretrainedList.multi_cased_base) paths = get_checkpoint_paths(model_path) print(paths.config, paths.checkpoint, paths.vocab) ``` ### Extract Features You can use helper function `extract_embeddings` if the features of tokens or sentences (without further tuning) are what you need. To extract the features of all tokens: ```python from keras_bert import extract_embeddings model_path = 'xxx/yyy/uncased_L-12_H-768_A-12' texts = ['all work and no play', 'makes jack a dull boy~'] embeddings = extract_embeddings(model_path, texts) ``` The returned result is a list with the same length as texts. Each item in the list is a numpy array truncated by the length of the input. The shapes of outputs in this example are `(7, 768)` and `(8, 768)`. When the inputs are paired-sentences, and you need the outputs of `NSP` and max-pooling of the last 4 layers: ```python from keras_bert import extract_embeddings, POOL_NSP, POOL_MAX model_path = 'xxx/yyy/uncased_L-12_H-768_A-12' texts = [ ('all work and no play', 'makes jack a dull boy'), ('makes jack a dull boy', 'all work and no play'), ] embeddings = extract_embeddings(model_path, texts, output_layer_num=4, poolings=[POOL_NSP, POOL_MAX]) ``` There are no token features in the results. The outputs of `NSP` and max-pooling will be concatenated with the final shape `(768 x 4 x 2,)`. The second argument in the helper function is a generator. To extract features from file: ```python import codecs from keras_bert import extract_embeddings model_path = 'xxx/yyy/uncased_L-12_H-768_A-12' with codecs.open('xxx.txt', 'r', 'utf8') as reader: texts = map(lambda x: x.strip(), reader) embeddings = extract_embeddings(model_path, texts) ``` ### Use `tensorflow.python.keras` Add `TF_KERAS=1` to environment variables to use `tensorflow.python.keras`. %package help Summary: Development documents and examples for keras-bert Provides: python3-keras-bert-doc %description help # Keras BERT [![Version](https://img.shields.io/pypi/v/keras-bert.svg)](https://pypi.org/project/keras-bert/) ![License](https://img.shields.io/pypi/l/keras-bert.svg) \[[中文](https://github.com/CyberZHG/keras-bert/blob/master/README.zh-CN.md)|[English](https://github.com/CyberZHG/keras-bert/blob/master/README.md)\] Implementation of the [BERT](https://arxiv.org/pdf/1810.04805.pdf). Official pre-trained models could be loaded for feature extraction and prediction. ## Install ```bash pip install keras-bert ``` ## Usage * [Load Official Pre-trained Models](#Load-Official-Pre-trained-Models) * [Tokenizer](#Tokenizer) * [Train & Use](#Train-&-Use) * [Use Warmup](#Use-Warmup) * [Download Pretrained Checkpoints](#Download-Pretrained-Checkpoints) * [Extract Features](#Extract-Features) ### External Links * [Kashgari is a Production-ready NLP Transfer learning framework for text-labeling and text-classification](https://github.com/BrikerMan/Kashgari) * [Keras ALBERT](https://github.com/TinkerMob/keras_albert_model) ### Load Official Pre-trained Models In [feature extraction demo](./demo/load_model/load_and_extract.py), you should be able to get the same extraction results as the official model `chinese_L-12_H-768_A-12`. And in [prediction demo](./demo/load_model/load_and_predict.py), the missing word in the sentence could be predicted. ### Run on TPU The [extraction demo](https://colab.research.google.com/github/CyberZHG/keras-bert/blob/master/demo/load_model/keras_bert_load_and_extract_tpu.ipynb) shows how to convert to a model that runs on TPU. The [classification demo](https://colab.research.google.com/github/CyberZHG/keras-bert/blob/master/demo/tune/keras_bert_classification_tpu.ipynb) shows how to apply the model to simple classification tasks. ### Tokenizer The `Tokenizer` class is used for splitting texts and generating indices: ```python from keras_bert import Tokenizer token_dict = { '[CLS]': 0, '[SEP]': 1, 'un': 2, '##aff': 3, '##able': 4, '[UNK]': 5, } tokenizer = Tokenizer(token_dict) print(tokenizer.tokenize('unaffable')) # The result should be `['[CLS]', 'un', '##aff', '##able', '[SEP]']` indices, segments = tokenizer.encode('unaffable') print(indices) # Should be `[0, 2, 3, 4, 1]` print(segments) # Should be `[0, 0, 0, 0, 0]` print(tokenizer.tokenize(first='unaffable', second='钢')) # The result should be `['[CLS]', 'un', '##aff', '##able', '[SEP]', '钢', '[SEP]']` indices, segments = tokenizer.encode(first='unaffable', second='钢', max_len=10) print(indices) # Should be `[0, 2, 3, 4, 1, 5, 1, 0, 0, 0]` print(segments) # Should be `[0, 0, 0, 0, 0, 1, 1, 0, 0, 0]` ``` ### Train & Use ```python from tensorflow import keras from keras_bert import get_base_dict, get_model, compile_model, gen_batch_inputs # A toy input example sentence_pairs = [ [['all', 'work', 'and', 'no', 'play'], ['makes', 'jack', 'a', 'dull', 'boy']], [['from', 'the', 'day', 'forth'], ['my', 'arm', 'changed']], [['and', 'a', 'voice', 'echoed'], ['power', 'give', 'me', 'more', 'power']], ] # Build token dictionary token_dict = get_base_dict() # A dict that contains some special tokens for pairs in sentence_pairs: for token in pairs[0] + pairs[1]: if token not in token_dict: token_dict[token] = len(token_dict) token_list = list(token_dict.keys()) # Used for selecting a random word # Build & train the model model = get_model( token_num=len(token_dict), head_num=5, transformer_num=12, embed_dim=25, feed_forward_dim=100, seq_len=20, pos_num=20, dropout_rate=0.05, ) compile_model(model) model.summary() def _generator(): while True: yield gen_batch_inputs( sentence_pairs, token_dict, token_list, seq_len=20, mask_rate=0.3, swap_sentence_rate=1.0, ) model.fit_generator( generator=_generator(), steps_per_epoch=1000, epochs=100, validation_data=_generator(), validation_steps=100, callbacks=[ keras.callbacks.EarlyStopping(monitor='val_loss', patience=5) ], ) # Use the trained model inputs, output_layer = get_model( token_num=len(token_dict), head_num=5, transformer_num=12, embed_dim=25, feed_forward_dim=100, seq_len=20, pos_num=20, dropout_rate=0.05, training=False, # The input layers and output layer will be returned if `training` is `False` trainable=False, # Whether the model is trainable. The default value is the same with `training` output_layer_num=4, # The number of layers whose outputs will be concatenated as a single output. # Only available when `training` is `False`. ) ``` ### Use Warmup `AdamWarmup` optimizer is provided for warmup and decay. The learning rate will reach `lr` in `warmpup_steps` steps, and decay to `min_lr` in `decay_steps` steps. There is a helper function `calc_train_steps` for calculating the two steps: ```python import numpy as np from keras_bert import AdamWarmup, calc_train_steps train_x = np.random.standard_normal((1024, 100)) total_steps, warmup_steps = calc_train_steps( num_example=train_x.shape[0], batch_size=32, epochs=10, warmup_proportion=0.1, ) optimizer = AdamWarmup(total_steps, warmup_steps, lr=1e-3, min_lr=1e-5) ``` ### Download Pretrained Checkpoints Several download urls has been added. You can get the downloaded and uncompressed path of a checkpoint by: ```python from keras_bert import get_pretrained, PretrainedList, get_checkpoint_paths model_path = get_pretrained(PretrainedList.multi_cased_base) paths = get_checkpoint_paths(model_path) print(paths.config, paths.checkpoint, paths.vocab) ``` ### Extract Features You can use helper function `extract_embeddings` if the features of tokens or sentences (without further tuning) are what you need. To extract the features of all tokens: ```python from keras_bert import extract_embeddings model_path = 'xxx/yyy/uncased_L-12_H-768_A-12' texts = ['all work and no play', 'makes jack a dull boy~'] embeddings = extract_embeddings(model_path, texts) ``` The returned result is a list with the same length as texts. Each item in the list is a numpy array truncated by the length of the input. The shapes of outputs in this example are `(7, 768)` and `(8, 768)`. When the inputs are paired-sentences, and you need the outputs of `NSP` and max-pooling of the last 4 layers: ```python from keras_bert import extract_embeddings, POOL_NSP, POOL_MAX model_path = 'xxx/yyy/uncased_L-12_H-768_A-12' texts = [ ('all work and no play', 'makes jack a dull boy'), ('makes jack a dull boy', 'all work and no play'), ] embeddings = extract_embeddings(model_path, texts, output_layer_num=4, poolings=[POOL_NSP, POOL_MAX]) ``` There are no token features in the results. The outputs of `NSP` and max-pooling will be concatenated with the final shape `(768 x 4 x 2,)`. The second argument in the helper function is a generator. To extract features from file: ```python import codecs from keras_bert import extract_embeddings model_path = 'xxx/yyy/uncased_L-12_H-768_A-12' with codecs.open('xxx.txt', 'r', 'utf8') as reader: texts = map(lambda x: x.strip(), reader) embeddings = extract_embeddings(model_path, texts) ``` ### Use `tensorflow.python.keras` Add `TF_KERAS=1` to environment variables to use `tensorflow.python.keras`. %prep %autosetup -n keras-bert-0.89.0 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-keras-bert -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Sun Apr 23 2023 Python_Bot - 0.89.0-1 - Package Spec generated