diff options
author | CoprDistGit <infra@openeuler.org> | 2023-06-20 06:26:10 +0000 |
---|---|---|
committer | CoprDistGit <infra@openeuler.org> | 2023-06-20 06:26:10 +0000 |
commit | 030be3a6abf6fcceef0979a2da24c81ad8a2c71a (patch) | |
tree | f032545c2a5ae67c98501eed7ec176fe0524a412 | |
parent | dd08805cf06c20fe7e64cf2dcb818ab3b890eb0a (diff) |
automatic import of python-kobert-transformersopeneuler20.03
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-kobert-transformers.spec | 516 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 518 insertions, 0 deletions
@@ -0,0 +1 @@ +/kobert-transformers-0.5.1.tar.gz diff --git a/python-kobert-transformers.spec b/python-kobert-transformers.spec new file mode 100644 index 0000000..d7e6358 --- /dev/null +++ b/python-kobert-transformers.spec @@ -0,0 +1,516 @@ +%global _empty_manifest_terminate_build 0 +Name: python-kobert-transformers +Version: 0.5.1 +Release: 1 +Summary: Transformers library for KoBERT, DistilKoBERT +License: Apache License 2.0 +URL: https://github.com/monologg/KoBERT-Transformers +Source0: https://mirrors.aliyun.com/pypi/web/packages/36/04/3b41292198e1c7429c2104dcb05b8912ddb18582c8021b324d233313a807/kobert-transformers-0.5.1.tar.gz +BuildArch: noarch + +Requires: python3-torch +Requires: python3-transformers +Requires: python3-sentencepiece + +%description +# KoBERT-Transformers + +`KoBERT` & `DistilKoBERT` on ๐ค Huggingface Transformers ๐ค + +KoBERT ๋ชจ๋ธ์ [๊ณต์ ๋ ํฌ](https://github.com/SKTBrain/KoBERT)์ ๊ฒ๊ณผ ๋์ผํฉ๋๋ค. ๋ณธ ๋ ํฌ๋ **Huggingface tokenizer์ ๋ชจ๋ API๋ฅผ ์ง์**ํ๊ธฐ ์ํด์ ์ ์๋์์ต๋๋ค. + +## ๐จ ์ค์! ๐จ + +### ๐ TL;DR + +1. `transformers` ๋ `v3.0` ์ด์์ ๋ฐ๋์ ์ค์น! +2. `tokenizer`๋ ๋ณธ ๋ ํฌ์ `kobert_transformers/tokenization_kobert.py`๋ฅผ ์ฌ์ฉ! + +### 1. Tokenizer ํธํ + +`Huggingface Transformers`๊ฐ `v2.9.0`๋ถํฐ tokenization ๊ด๋ จ API๊ฐ ์ผ๋ถ ๋ณ๊ฒฝ๋์์ต๋๋ค. ์ด์ ๋ง์ถฐ ๊ธฐ์กด์ `tokenization_kobert.py`๋ฅผ ์์ ๋ฒ์ ์ ๋ง๊ฒ ์์ ํ์์ต๋๋ค. + +### 2. Embedding์ padding_idx ์ด์ + +์ด์ ๋ถํฐ `BertModel`์ `BertEmbeddings`์์ `padding_idx=0`์ผ๋ก **Hard-coding**๋์ด ์์์ต๋๋ค. (์๋ ์ฝ๋ ์ฐธ๊ณ ) + +```python +class BertEmbeddings(nn.Module): + def __init__(self, config): + super().__init__() + self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0) + self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) + self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size) +``` + +๊ทธ๋ฌ๋ Sentencepiece์ ๊ฒฝ์ฐ ๊ธฐ๋ณธ๊ฐ์ผ๋ก `pad_token_id=1`, `unk_token_id=0`์ผ๋ก ์ค์ ์ด ๋์ด ์๊ณ (์ด๋ KoBERT๋ ๋์ผ), ์ด๋ฅผ ๊ทธ๋๋ก ์ฌ์ฉํ๋ BertModel์ ๊ฒฝ์ฐ ์์น ์์ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ์ ธ์ฌ ์ ์์ต๋๋ค. + +Huggingface์์๋ ์ต๊ทผ์ ํด๋น ์ด์๋ฅผ ์ธ์งํ์ฌ ์ด๋ฅผ ์์ ํ์ฌ `v2.9.0`์ ๋ฐ์ํ์์ต๋๋ค. ([๊ด๋ จ PR #3793](https://github.com/huggingface/transformers/pull/3793)) config์ `pad_token_id=1` ์ ์ถ๊ฐ ๊ฐ๋ฅํ์ฌ ์ด๋ฅผ ํด๊ฒฐํ ์ ์๊ฒ ํ์์ต๋๋ค. + +```python +class BertEmbeddings(nn.Module): + def __init__(self, config): + super().__init__() + self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id) + self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) + self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size) +``` + +๊ทธ๋ฌ๋ `v.2.9.0`์์ `DistilBERT`, `ALBERT` ๋ฑ์๋ ์ด ์ด์๊ฐ ํด๊ฒฐ๋์ง ์์ ์ง์ PR์ ์ฌ๋ ค ์ฒ๋ฆฌํ์๊ณ ([๊ด๋ จ PR #3965](https://github.com/huggingface/transformers/pull/3965)), **`v2.9.1`์ ์ต์ข
์ ์ผ๋ก ๋ฐ์๋์ด ๋ฐฐํฌ๋์์ต๋๋ค.** + +์๋๋ ์ด์ ๊ณผ ํ์ฌ ๋ฒ์ ์ ์ฐจ์ด์ ์ ๋ณด์ฌ์ฃผ๋ ์ฝ๋์
๋๋ค. + +```python +# Transformers v2.7.0 +>>> from transformers import BertModel, DistilBertModel +>>> model = BertModel.from_pretrained("monologg/kobert") +>>> model.embeddings.word_embeddings +Embedding(8002, 768, padding_idx=0) +>>> model = DistilBertModel.from_pretrained("monologg/distilkobert") +>>> model.embeddings.word_embeddings +Embedding(8002, 768, padding_idx=0) + + +### Transformers v2.9.1 +>>> from transformers import BertModel, DistilBertModel +>>> model = BertModel.from_pretrained("monologg/kobert") +>>> model.embeddings.word_embeddings +Embedding(8002, 768, padding_idx=1) +>>> model = DistilBertModel.from_pretrained("monologg/distilkobert") +>>> model.embeddings.word_embeddings +Embedding(8002, 768, padding_idx=1) +``` + +## KoBERT / DistilKoBERT on ๐ค Transformers ๐ค + +### Dependencies + +- torch>=1.1.0 +- transformers>=3,<5 + +### How to Use + +```python +>>> from transformers import BertModel, DistilBertModel +>>> bert_model = BertModel.from_pretrained('monologg/kobert') +>>> distilbert_model = DistilBertModel.from_pretrained('monologg/distilkobert') +``` + +**Tokenizer๋ฅผ ์ฌ์ฉํ๋ ค๋ฉด, [`kobert_transformers/tokenization_kobert.py`](https://github.com/monologg/KoBERT-Transformers/blob/master/kobert_transformers/tokenization_kobert.py) ํ์ผ์ ๋ณต์ฌํ ํ, `KoBertTokenizer`๋ฅผ ์ํฌํธํ๋ฉด ๋ฉ๋๋ค.** + +- KoBERT์ DistilKoBERT ๋ชจ๋ ๋์ผํ ํ ํฌ๋์ด์ ๋ฅผ ์ฌ์ฉํฉ๋๋ค. +- **๊ธฐ์กด KoBERT์ ๊ฒฝ์ฐ Special Token์ด ์ ๋๋ก ๋ถ๋ฆฌ๋์ง ์๋ ์ด์**๊ฐ ์์ด์ ํด๋น ๋ถ๋ถ์ ์์ ํ์ฌ ๋ฐ์ํ์์ต๋๋ค. ([Issue link](https://github.com/SKTBrain/KoBERT/issues/11)) + +```python +>>> from tokenization_kobert import KoBertTokenizer +>>> tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert') # monologg/distilkobert๋ ๋์ผ +>>> tokenizer.tokenize("[CLS] ํ๊ตญ์ด ๋ชจ๋ธ์ ๊ณต์ ํฉ๋๋ค. [SEP]") +>>> ['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]'] +>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]']) +>>> [2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3] +``` + +## Kobert-Transformers (Pip library) + +[](https://pypi.org/project/kobert-transformers/) +[](https://github.com/monologg/DistilKoBERT/blob/master/LICENSE) +[](https://pepy.tech/project/kobert-transformers) + +- `tokenization_kobert.py`๋ฅผ ๋ฉํํ ํ์ด์ฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ +- KoBERT, DistilKoBERT๋ฅผ Huggingface Transformers ๋ผ์ด๋ธ๋ฌ๋ฆฌ ํํ๋ก ์ ๊ณต +- `v0.5.1`์์๋ `transformers v3.0` ์ด์์ผ๋ก ๊ธฐ๋ณธ ์ค์นํฉ๋๋ค. (`transformers v4.0` ๊น์ง๋ ์ด์ ์์ด ์ฌ์ฉ ๊ฐ๋ฅ) + +### Install Kobert-Transformers + +```bash +pip3 install kobert-transformers +``` + +### How to Use + +```python +>>> import torch +>>> from kobert_transformers import get_kobert_model, get_distilkobert_model +>>> model = get_kobert_model() +>>> model.eval() +>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]]) +>>> attention_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]]) +>>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]]) +>>> sequence_output, pooled_output = model(input_ids, attention_mask, token_type_ids) +>>> sequence_output[0] +tensor([[-0.2461, 0.2428, 0.2590, ..., -0.4861, -0.0731, 0.0756], + [-0.2478, 0.2420, 0.2552, ..., -0.4877, -0.0727, 0.0754], + [-0.2472, 0.2420, 0.2561, ..., -0.4874, -0.0733, 0.0765]], + grad_fn=<SelectBackward>) +``` + +```python +>>> from kobert_transformers import get_tokenizer +>>> tokenizer = get_tokenizer() +>>> tokenizer.tokenize("[CLS] ํ๊ตญ์ด ๋ชจ๋ธ์ ๊ณต์ ํฉ๋๋ค. [SEP]") +['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]'] +>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]']) +[2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3] +``` + +## Reference + +- [KoBERT](https://github.com/SKTBrain/KoBERT) +- [DistilKoBERT](https://github.com/monologg/DistilKoBERT) +- [Huggingface Transformers](https://github.com/huggingface/transformers) + + + + +%package -n python3-kobert-transformers +Summary: Transformers library for KoBERT, DistilKoBERT +Provides: python-kobert-transformers +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-kobert-transformers +# KoBERT-Transformers + +`KoBERT` & `DistilKoBERT` on ๐ค Huggingface Transformers ๐ค + +KoBERT ๋ชจ๋ธ์ [๊ณต์ ๋ ํฌ](https://github.com/SKTBrain/KoBERT)์ ๊ฒ๊ณผ ๋์ผํฉ๋๋ค. ๋ณธ ๋ ํฌ๋ **Huggingface tokenizer์ ๋ชจ๋ API๋ฅผ ์ง์**ํ๊ธฐ ์ํด์ ์ ์๋์์ต๋๋ค. + +## ๐จ ์ค์! ๐จ + +### ๐ TL;DR + +1. `transformers` ๋ `v3.0` ์ด์์ ๋ฐ๋์ ์ค์น! +2. `tokenizer`๋ ๋ณธ ๋ ํฌ์ `kobert_transformers/tokenization_kobert.py`๋ฅผ ์ฌ์ฉ! + +### 1. Tokenizer ํธํ + +`Huggingface Transformers`๊ฐ `v2.9.0`๋ถํฐ tokenization ๊ด๋ จ API๊ฐ ์ผ๋ถ ๋ณ๊ฒฝ๋์์ต๋๋ค. ์ด์ ๋ง์ถฐ ๊ธฐ์กด์ `tokenization_kobert.py`๋ฅผ ์์ ๋ฒ์ ์ ๋ง๊ฒ ์์ ํ์์ต๋๋ค. + +### 2. Embedding์ padding_idx ์ด์ + +์ด์ ๋ถํฐ `BertModel`์ `BertEmbeddings`์์ `padding_idx=0`์ผ๋ก **Hard-coding**๋์ด ์์์ต๋๋ค. (์๋ ์ฝ๋ ์ฐธ๊ณ ) + +```python +class BertEmbeddings(nn.Module): + def __init__(self, config): + super().__init__() + self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0) + self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) + self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size) +``` + +๊ทธ๋ฌ๋ Sentencepiece์ ๊ฒฝ์ฐ ๊ธฐ๋ณธ๊ฐ์ผ๋ก `pad_token_id=1`, `unk_token_id=0`์ผ๋ก ์ค์ ์ด ๋์ด ์๊ณ (์ด๋ KoBERT๋ ๋์ผ), ์ด๋ฅผ ๊ทธ๋๋ก ์ฌ์ฉํ๋ BertModel์ ๊ฒฝ์ฐ ์์น ์์ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ์ ธ์ฌ ์ ์์ต๋๋ค. + +Huggingface์์๋ ์ต๊ทผ์ ํด๋น ์ด์๋ฅผ ์ธ์งํ์ฌ ์ด๋ฅผ ์์ ํ์ฌ `v2.9.0`์ ๋ฐ์ํ์์ต๋๋ค. ([๊ด๋ จ PR #3793](https://github.com/huggingface/transformers/pull/3793)) config์ `pad_token_id=1` ์ ์ถ๊ฐ ๊ฐ๋ฅํ์ฌ ์ด๋ฅผ ํด๊ฒฐํ ์ ์๊ฒ ํ์์ต๋๋ค. + +```python +class BertEmbeddings(nn.Module): + def __init__(self, config): + super().__init__() + self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id) + self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) + self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size) +``` + +๊ทธ๋ฌ๋ `v.2.9.0`์์ `DistilBERT`, `ALBERT` ๋ฑ์๋ ์ด ์ด์๊ฐ ํด๊ฒฐ๋์ง ์์ ์ง์ PR์ ์ฌ๋ ค ์ฒ๋ฆฌํ์๊ณ ([๊ด๋ จ PR #3965](https://github.com/huggingface/transformers/pull/3965)), **`v2.9.1`์ ์ต์ข
์ ์ผ๋ก ๋ฐ์๋์ด ๋ฐฐํฌ๋์์ต๋๋ค.** + +์๋๋ ์ด์ ๊ณผ ํ์ฌ ๋ฒ์ ์ ์ฐจ์ด์ ์ ๋ณด์ฌ์ฃผ๋ ์ฝ๋์
๋๋ค. + +```python +# Transformers v2.7.0 +>>> from transformers import BertModel, DistilBertModel +>>> model = BertModel.from_pretrained("monologg/kobert") +>>> model.embeddings.word_embeddings +Embedding(8002, 768, padding_idx=0) +>>> model = DistilBertModel.from_pretrained("monologg/distilkobert") +>>> model.embeddings.word_embeddings +Embedding(8002, 768, padding_idx=0) + + +### Transformers v2.9.1 +>>> from transformers import BertModel, DistilBertModel +>>> model = BertModel.from_pretrained("monologg/kobert") +>>> model.embeddings.word_embeddings +Embedding(8002, 768, padding_idx=1) +>>> model = DistilBertModel.from_pretrained("monologg/distilkobert") +>>> model.embeddings.word_embeddings +Embedding(8002, 768, padding_idx=1) +``` + +## KoBERT / DistilKoBERT on ๐ค Transformers ๐ค + +### Dependencies + +- torch>=1.1.0 +- transformers>=3,<5 + +### How to Use + +```python +>>> from transformers import BertModel, DistilBertModel +>>> bert_model = BertModel.from_pretrained('monologg/kobert') +>>> distilbert_model = DistilBertModel.from_pretrained('monologg/distilkobert') +``` + +**Tokenizer๋ฅผ ์ฌ์ฉํ๋ ค๋ฉด, [`kobert_transformers/tokenization_kobert.py`](https://github.com/monologg/KoBERT-Transformers/blob/master/kobert_transformers/tokenization_kobert.py) ํ์ผ์ ๋ณต์ฌํ ํ, `KoBertTokenizer`๋ฅผ ์ํฌํธํ๋ฉด ๋ฉ๋๋ค.** + +- KoBERT์ DistilKoBERT ๋ชจ๋ ๋์ผํ ํ ํฌ๋์ด์ ๋ฅผ ์ฌ์ฉํฉ๋๋ค. +- **๊ธฐ์กด KoBERT์ ๊ฒฝ์ฐ Special Token์ด ์ ๋๋ก ๋ถ๋ฆฌ๋์ง ์๋ ์ด์**๊ฐ ์์ด์ ํด๋น ๋ถ๋ถ์ ์์ ํ์ฌ ๋ฐ์ํ์์ต๋๋ค. ([Issue link](https://github.com/SKTBrain/KoBERT/issues/11)) + +```python +>>> from tokenization_kobert import KoBertTokenizer +>>> tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert') # monologg/distilkobert๋ ๋์ผ +>>> tokenizer.tokenize("[CLS] ํ๊ตญ์ด ๋ชจ๋ธ์ ๊ณต์ ํฉ๋๋ค. [SEP]") +>>> ['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]'] +>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]']) +>>> [2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3] +``` + +## Kobert-Transformers (Pip library) + +[](https://pypi.org/project/kobert-transformers/) +[](https://github.com/monologg/DistilKoBERT/blob/master/LICENSE) +[](https://pepy.tech/project/kobert-transformers) + +- `tokenization_kobert.py`๋ฅผ ๋ฉํํ ํ์ด์ฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ +- KoBERT, DistilKoBERT๋ฅผ Huggingface Transformers ๋ผ์ด๋ธ๋ฌ๋ฆฌ ํํ๋ก ์ ๊ณต +- `v0.5.1`์์๋ `transformers v3.0` ์ด์์ผ๋ก ๊ธฐ๋ณธ ์ค์นํฉ๋๋ค. (`transformers v4.0` ๊น์ง๋ ์ด์ ์์ด ์ฌ์ฉ ๊ฐ๋ฅ) + +### Install Kobert-Transformers + +```bash +pip3 install kobert-transformers +``` + +### How to Use + +```python +>>> import torch +>>> from kobert_transformers import get_kobert_model, get_distilkobert_model +>>> model = get_kobert_model() +>>> model.eval() +>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]]) +>>> attention_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]]) +>>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]]) +>>> sequence_output, pooled_output = model(input_ids, attention_mask, token_type_ids) +>>> sequence_output[0] +tensor([[-0.2461, 0.2428, 0.2590, ..., -0.4861, -0.0731, 0.0756], + [-0.2478, 0.2420, 0.2552, ..., -0.4877, -0.0727, 0.0754], + [-0.2472, 0.2420, 0.2561, ..., -0.4874, -0.0733, 0.0765]], + grad_fn=<SelectBackward>) +``` + +```python +>>> from kobert_transformers import get_tokenizer +>>> tokenizer = get_tokenizer() +>>> tokenizer.tokenize("[CLS] ํ๊ตญ์ด ๋ชจ๋ธ์ ๊ณต์ ํฉ๋๋ค. [SEP]") +['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]'] +>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]']) +[2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3] +``` + +## Reference + +- [KoBERT](https://github.com/SKTBrain/KoBERT) +- [DistilKoBERT](https://github.com/monologg/DistilKoBERT) +- [Huggingface Transformers](https://github.com/huggingface/transformers) + + + + +%package help +Summary: Development documents and examples for kobert-transformers +Provides: python3-kobert-transformers-doc +%description help +# KoBERT-Transformers + +`KoBERT` & `DistilKoBERT` on ๐ค Huggingface Transformers ๐ค + +KoBERT ๋ชจ๋ธ์ [๊ณต์ ๋ ํฌ](https://github.com/SKTBrain/KoBERT)์ ๊ฒ๊ณผ ๋์ผํฉ๋๋ค. ๋ณธ ๋ ํฌ๋ **Huggingface tokenizer์ ๋ชจ๋ API๋ฅผ ์ง์**ํ๊ธฐ ์ํด์ ์ ์๋์์ต๋๋ค. + +## ๐จ ์ค์! ๐จ + +### ๐ TL;DR + +1. `transformers` ๋ `v3.0` ์ด์์ ๋ฐ๋์ ์ค์น! +2. `tokenizer`๋ ๋ณธ ๋ ํฌ์ `kobert_transformers/tokenization_kobert.py`๋ฅผ ์ฌ์ฉ! + +### 1. Tokenizer ํธํ + +`Huggingface Transformers`๊ฐ `v2.9.0`๋ถํฐ tokenization ๊ด๋ จ API๊ฐ ์ผ๋ถ ๋ณ๊ฒฝ๋์์ต๋๋ค. ์ด์ ๋ง์ถฐ ๊ธฐ์กด์ `tokenization_kobert.py`๋ฅผ ์์ ๋ฒ์ ์ ๋ง๊ฒ ์์ ํ์์ต๋๋ค. + +### 2. Embedding์ padding_idx ์ด์ + +์ด์ ๋ถํฐ `BertModel`์ `BertEmbeddings`์์ `padding_idx=0`์ผ๋ก **Hard-coding**๋์ด ์์์ต๋๋ค. (์๋ ์ฝ๋ ์ฐธ๊ณ ) + +```python +class BertEmbeddings(nn.Module): + def __init__(self, config): + super().__init__() + self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0) + self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) + self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size) +``` + +๊ทธ๋ฌ๋ Sentencepiece์ ๊ฒฝ์ฐ ๊ธฐ๋ณธ๊ฐ์ผ๋ก `pad_token_id=1`, `unk_token_id=0`์ผ๋ก ์ค์ ์ด ๋์ด ์๊ณ (์ด๋ KoBERT๋ ๋์ผ), ์ด๋ฅผ ๊ทธ๋๋ก ์ฌ์ฉํ๋ BertModel์ ๊ฒฝ์ฐ ์์น ์์ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ์ ธ์ฌ ์ ์์ต๋๋ค. + +Huggingface์์๋ ์ต๊ทผ์ ํด๋น ์ด์๋ฅผ ์ธ์งํ์ฌ ์ด๋ฅผ ์์ ํ์ฌ `v2.9.0`์ ๋ฐ์ํ์์ต๋๋ค. ([๊ด๋ จ PR #3793](https://github.com/huggingface/transformers/pull/3793)) config์ `pad_token_id=1` ์ ์ถ๊ฐ ๊ฐ๋ฅํ์ฌ ์ด๋ฅผ ํด๊ฒฐํ ์ ์๊ฒ ํ์์ต๋๋ค. + +```python +class BertEmbeddings(nn.Module): + def __init__(self, config): + super().__init__() + self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id) + self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) + self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size) +``` + +๊ทธ๋ฌ๋ `v.2.9.0`์์ `DistilBERT`, `ALBERT` ๋ฑ์๋ ์ด ์ด์๊ฐ ํด๊ฒฐ๋์ง ์์ ์ง์ PR์ ์ฌ๋ ค ์ฒ๋ฆฌํ์๊ณ ([๊ด๋ จ PR #3965](https://github.com/huggingface/transformers/pull/3965)), **`v2.9.1`์ ์ต์ข
์ ์ผ๋ก ๋ฐ์๋์ด ๋ฐฐํฌ๋์์ต๋๋ค.** + +์๋๋ ์ด์ ๊ณผ ํ์ฌ ๋ฒ์ ์ ์ฐจ์ด์ ์ ๋ณด์ฌ์ฃผ๋ ์ฝ๋์
๋๋ค. + +```python +# Transformers v2.7.0 +>>> from transformers import BertModel, DistilBertModel +>>> model = BertModel.from_pretrained("monologg/kobert") +>>> model.embeddings.word_embeddings +Embedding(8002, 768, padding_idx=0) +>>> model = DistilBertModel.from_pretrained("monologg/distilkobert") +>>> model.embeddings.word_embeddings +Embedding(8002, 768, padding_idx=0) + + +### Transformers v2.9.1 +>>> from transformers import BertModel, DistilBertModel +>>> model = BertModel.from_pretrained("monologg/kobert") +>>> model.embeddings.word_embeddings +Embedding(8002, 768, padding_idx=1) +>>> model = DistilBertModel.from_pretrained("monologg/distilkobert") +>>> model.embeddings.word_embeddings +Embedding(8002, 768, padding_idx=1) +``` + +## KoBERT / DistilKoBERT on ๐ค Transformers ๐ค + +### Dependencies + +- torch>=1.1.0 +- transformers>=3,<5 + +### How to Use + +```python +>>> from transformers import BertModel, DistilBertModel +>>> bert_model = BertModel.from_pretrained('monologg/kobert') +>>> distilbert_model = DistilBertModel.from_pretrained('monologg/distilkobert') +``` + +**Tokenizer๋ฅผ ์ฌ์ฉํ๋ ค๋ฉด, [`kobert_transformers/tokenization_kobert.py`](https://github.com/monologg/KoBERT-Transformers/blob/master/kobert_transformers/tokenization_kobert.py) ํ์ผ์ ๋ณต์ฌํ ํ, `KoBertTokenizer`๋ฅผ ์ํฌํธํ๋ฉด ๋ฉ๋๋ค.** + +- KoBERT์ DistilKoBERT ๋ชจ๋ ๋์ผํ ํ ํฌ๋์ด์ ๋ฅผ ์ฌ์ฉํฉ๋๋ค. +- **๊ธฐ์กด KoBERT์ ๊ฒฝ์ฐ Special Token์ด ์ ๋๋ก ๋ถ๋ฆฌ๋์ง ์๋ ์ด์**๊ฐ ์์ด์ ํด๋น ๋ถ๋ถ์ ์์ ํ์ฌ ๋ฐ์ํ์์ต๋๋ค. ([Issue link](https://github.com/SKTBrain/KoBERT/issues/11)) + +```python +>>> from tokenization_kobert import KoBertTokenizer +>>> tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert') # monologg/distilkobert๋ ๋์ผ +>>> tokenizer.tokenize("[CLS] ํ๊ตญ์ด ๋ชจ๋ธ์ ๊ณต์ ํฉ๋๋ค. [SEP]") +>>> ['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]'] +>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]']) +>>> [2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3] +``` + +## Kobert-Transformers (Pip library) + +[](https://pypi.org/project/kobert-transformers/) +[](https://github.com/monologg/DistilKoBERT/blob/master/LICENSE) +[](https://pepy.tech/project/kobert-transformers) + +- `tokenization_kobert.py`๋ฅผ ๋ฉํํ ํ์ด์ฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ +- KoBERT, DistilKoBERT๋ฅผ Huggingface Transformers ๋ผ์ด๋ธ๋ฌ๋ฆฌ ํํ๋ก ์ ๊ณต +- `v0.5.1`์์๋ `transformers v3.0` ์ด์์ผ๋ก ๊ธฐ๋ณธ ์ค์นํฉ๋๋ค. (`transformers v4.0` ๊น์ง๋ ์ด์ ์์ด ์ฌ์ฉ ๊ฐ๋ฅ) + +### Install Kobert-Transformers + +```bash +pip3 install kobert-transformers +``` + +### How to Use + +```python +>>> import torch +>>> from kobert_transformers import get_kobert_model, get_distilkobert_model +>>> model = get_kobert_model() +>>> model.eval() +>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]]) +>>> attention_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]]) +>>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]]) +>>> sequence_output, pooled_output = model(input_ids, attention_mask, token_type_ids) +>>> sequence_output[0] +tensor([[-0.2461, 0.2428, 0.2590, ..., -0.4861, -0.0731, 0.0756], + [-0.2478, 0.2420, 0.2552, ..., -0.4877, -0.0727, 0.0754], + [-0.2472, 0.2420, 0.2561, ..., -0.4874, -0.0733, 0.0765]], + grad_fn=<SelectBackward>) +``` + +```python +>>> from kobert_transformers import get_tokenizer +>>> tokenizer = get_tokenizer() +>>> tokenizer.tokenize("[CLS] ํ๊ตญ์ด ๋ชจ๋ธ์ ๊ณต์ ํฉ๋๋ค. [SEP]") +['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]'] +>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]']) +[2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3] +``` + +## Reference + +- [KoBERT](https://github.com/SKTBrain/KoBERT) +- [DistilKoBERT](https://github.com/monologg/DistilKoBERT) +- [Huggingface Transformers](https://github.com/huggingface/transformers) + + + + +%prep +%autosetup -n kobert-transformers-0.5.1 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-kobert-transformers -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Tue Jun 20 2023 Python_Bot <Python_Bot@openeuler.org> - 0.5.1-1 +- Package Spec generated @@ -0,0 +1 @@ +33625e69e5551325dbe9edbcb7151136 kobert-transformers-0.5.1.tar.gz |