diff options
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-keybert.spec | 870 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 872 insertions, 0 deletions
@@ -0,0 +1 @@ +/keybert-0.7.0.tar.gz diff --git a/python-keybert.spec b/python-keybert.spec new file mode 100644 index 0000000..4244764 --- /dev/null +++ b/python-keybert.spec @@ -0,0 +1,870 @@ +%global _empty_manifest_terminate_build 0 +Name: python-keybert +Version: 0.7.0 +Release: 1 +Summary: KeyBERT performs keyword extraction with state-of-the-art transformer models. +License: MIT License +URL: https://github.com/MaartenGr/keyBERT +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/9a/41/b7b21fb0abee8381b83db942fd6dc31c9d61d59a6af0f0f78e310a5cf908/keybert-0.7.0.tar.gz +BuildArch: noarch + + +%description +[](https://pypi.org/project/keybert/)
+[](https://github.com/MaartenGr/keybert/blob/master/LICENSE)
+[](https://pypi.org/project/keybert/)
+[](https://pypi.org/project/keybert/)
+[](https://colab.research.google.com/drive/1OxpgwKqSzODtO3vS7Xe1nEmZMCAIMckX?usp=sharing)
+
+<img src="images/logo.png" width="35%" height="35%" align="right" />
+
+# KeyBERT
+
+KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to
+create keywords and keyphrases that are most similar to a document.
+
+Corresponding medium post can be found [here](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea).
+
+<a name="toc"/></a>
+## Table of Contents
+<!--ts-->
+ 1. [About the Project](#about)
+ 2. [Getting Started](#gettingstarted)
+ 2.1. [Installation](#installation)
+ 2.2. [Basic Usage](#usage)
+ 2.3. [Max Sum Distance](#maxsum)
+ 2.4. [Maximal Marginal Relevance](#maximal)
+ 2.5. [Embedding Models](#embeddings)
+<!--te-->
+
+
+<a name="about"/></a>
+## 1. About the Project
+[Back to ToC](#toc)
+
+Although there are already many methods available for keyword generation
+(e.g.,
+[Rake](https://github.com/aneesha/RAKE),
+[YAKE!](https://github.com/LIAAD/yake), TF-IDF, etc.)
+I wanted to create a very basic, but powerful method for extracting keywords and keyphrases.
+This is where **KeyBERT** comes in! Which uses BERT-embeddings and simple cosine similarity
+to find the sub-phrases in a document that are the most similar to the document itself.
+
+First, document embeddings are extracted with BERT to get a document-level representation.
+Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity
+to find the words/phrases that are the most similar to the document. The most similar words could
+then be identified as the words that best describe the entire document.
+
+KeyBERT is by no means unique and is created as a quick and easy method
+for creating keywords and keyphrases. Although there are many great
+papers and solutions out there that use BERT-embeddings
+(e.g.,
+[1](https://github.com/pranav-ust/BERT-keyphrase-extraction),
+[2](https://github.com/ibatra/BERT-Keyword-Extractor),
+[3](https://www.preprints.org/manuscript/201908.0073/download/final_file),
+), I could not find a BERT-based solution that did not have to be trained from scratch and
+could be used for beginners (**correct me if I'm wrong!**).
+Thus, the goal was a `pip install keybert` and at most 3 lines of code in usage.
+
+<a name="gettingstarted"/></a>
+## 2. Getting Started
+[Back to ToC](#toc)
+
+<a name="installation"/></a>
+### 2.1. Installation
+Installation can be done using [pypi](https://pypi.org/project/keybert/):
+
+```
+pip install keybert
+```
+
+You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
+
+```
+pip install keybert[flair]
+pip install keybert[gensim]
+pip install keybert[spacy]
+pip install keybert[use]
+```
+
+<a name="usage"/></a>
+### 2.2. Usage
+
+The most minimal example can be seen below for the extraction of keywords:
+```python
+from keybert import KeyBERT
+
+doc = """
+ Supervised learning is the machine learning task of learning a function that
+ maps an input to an output based on example input-output pairs. It infers a
+ function from labeled training data consisting of a set of training examples.
+ In supervised learning, each example is a pair consisting of an input object
+ (typically a vector) and a desired output value (also called the supervisory signal).
+ A supervised learning algorithm analyzes the training data and produces an inferred function,
+ which can be used for mapping new examples. An optimal scenario will allow for the
+ algorithm to correctly determine the class labels for unseen instances. This requires
+ the learning algorithm to generalize from the training data to unseen situations in a
+ 'reasonable' way (see inductive bias).
+ """
+kw_model = KeyBERT()
+keywords = kw_model.extract_keywords(doc)
+```
+
+You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
+[('learning', 0.4604),
+ ('algorithm', 0.4556),
+ ('training', 0.4487),
+ ('class', 0.4086),
+ ('mapping', 0.3700)]
+```
+
+To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number
+of words you would like in the resulting keyphrases:
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
+[('learning algorithm', 0.6978),
+ ('machine learning', 0.6305),
+ ('supervised learning', 0.5985),
+ ('algorithm analyzes', 0.5860),
+ ('learning function', 0.5850)]
+```
+
+We can highlight the keywords in the document by simply setting `highlight`:
+
+```python
+keywords = kw_model.extract_keywords(doc, highlight=True)
+```
+<img src="images/highlight.png" width="75%" height="75%" />
+
+
+**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
+I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
+for multi-lingual documents or any other language.
+
+<a name="maxsum"/></a>
+### 2.3. Max Sum Distance
+
+To diversify the results, we take the 2 x top_n most similar words/phrases to the document.
+Then, we take all top_n combinations from the 2 x top_n words and extract the combination
+that are the least similar to each other by cosine similarity.
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
+ use_maxsum=True, nr_candidates=20, top_n=5)
+[('set training examples', 0.7504),
+ ('generalize training data', 0.7727),
+ ('requires learning algorithm', 0.5050),
+ ('supervised learning algorithm', 0.3779),
+ ('learning machine learning', 0.2891)]
+```
+
+
+<a name="maximal"/></a>
+### 2.4. Maximal Marginal Relevance
+
+To diversify the results, we can use Maximal Margin Relevance (MMR) to create
+keywords / keyphrases which is also based on cosine similarity. The results
+with **high diversity**:
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
+ use_mmr=True, diversity=0.7)
+[('algorithm generalize training', 0.7727),
+ ('labels unseen instances', 0.1649),
+ ('new examples optimal', 0.4185),
+ ('determine class labels', 0.4774),
+ ('supervised learning algorithm', 0.7502)]
+```
+
+The results with **low diversity**:
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
+ use_mmr=True, diversity=0.2)
+[('algorithm generalize training', 0.7727),
+ ('supervised learning algorithm', 0.7502),
+ ('learning machine learning', 0.7577),
+ ('learning algorithm analyzes', 0.7587),
+ ('learning algorithm generalize', 0.7514)]
+```
+
+
+<a name="embeddings"/></a>
+### 2.5. Embedding Models
+KeyBERT supports many embedding models that can be used to embed the documents and words:
+
+* Sentence-Transformers
+* Flair
+* Spacy
+* Gensim
+* USE
+
+Click [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html) for a full overview of all supported embedding models.
+
+**Sentence-Transformers**
+You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html)
+and pass it through KeyBERT with `model`:
+
+```python
+from keybert import KeyBERT
+kw_model = KeyBERT(model='all-MiniLM-L6-v2')
+```
+
+Or select a SentenceTransformer model with your own parameters:
+
+```python
+from keybert import KeyBERT
+from sentence_transformers import SentenceTransformer
+
+sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
+kw_model = KeyBERT(model=sentence_model)
+```
+
+**Flair**
+[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
+is publicly available. Flair can be used as follows:
+
+```python
+from keybert import KeyBERT
+from flair.embeddings import TransformerDocumentEmbeddings
+
+roberta = TransformerDocumentEmbeddings('roberta-base')
+kw_model = KeyBERT(model=roberta)
+```
+
+You can select any 🤗 transformers model [here](https://huggingface.co/models).
+
+
+## Citation
+To cite KeyBERT in your work, please use the following bibtex reference:
+
+```bibtex
+@misc{grootendorst2020keybert,
+ author = {Maarten Grootendorst},
+ title = {KeyBERT: Minimal keyword extraction with BERT.},
+ year = 2020,
+ publisher = {Zenodo},
+ version = {v0.3.0},
+ doi = {10.5281/zenodo.4461265},
+ url = {https://doi.org/10.5281/zenodo.4461265}
+}
+```
+
+## References
+Below, you can find several resources that were used for the creation of KeyBERT
+but most importantly, these are amazing resources for creating impressive keyword extraction models:
+
+**Papers**:
+* Sharma, P., & Li, Y. (2019). [Self-Supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling.](https://www.preprints.org/manuscript/201908.0073/download/final_file)
+
+**Github Repos**:
+* https://github.com/thunlp/BERT-KPE
+* https://github.com/ibatra/BERT-Keyword-Extractor
+* https://github.com/pranav-ust/BERT-keyphrase-extraction
+* https://github.com/swisscom/ai-research-keyphrase-extraction
+
+**MMR**:
+The selection of keywords/keyphrases was modeled after:
+* https://github.com/swisscom/ai-research-keyphrase-extraction
+
+**NOTE**: If you find a paper or github repo that has an easy-to-use implementation
+of BERT-embeddings for keyword/keyphrase extraction, let me know! I'll make sure to
+add a reference to this repo.
+
+
+ + +%package -n python3-keybert +Summary: KeyBERT performs keyword extraction with state-of-the-art transformer models. +Provides: python-keybert +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-keybert +[](https://pypi.org/project/keybert/)
+[](https://github.com/MaartenGr/keybert/blob/master/LICENSE)
+[](https://pypi.org/project/keybert/)
+[](https://pypi.org/project/keybert/)
+[](https://colab.research.google.com/drive/1OxpgwKqSzODtO3vS7Xe1nEmZMCAIMckX?usp=sharing)
+
+<img src="images/logo.png" width="35%" height="35%" align="right" />
+
+# KeyBERT
+
+KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to
+create keywords and keyphrases that are most similar to a document.
+
+Corresponding medium post can be found [here](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea).
+
+<a name="toc"/></a>
+## Table of Contents
+<!--ts-->
+ 1. [About the Project](#about)
+ 2. [Getting Started](#gettingstarted)
+ 2.1. [Installation](#installation)
+ 2.2. [Basic Usage](#usage)
+ 2.3. [Max Sum Distance](#maxsum)
+ 2.4. [Maximal Marginal Relevance](#maximal)
+ 2.5. [Embedding Models](#embeddings)
+<!--te-->
+
+
+<a name="about"/></a>
+## 1. About the Project
+[Back to ToC](#toc)
+
+Although there are already many methods available for keyword generation
+(e.g.,
+[Rake](https://github.com/aneesha/RAKE),
+[YAKE!](https://github.com/LIAAD/yake), TF-IDF, etc.)
+I wanted to create a very basic, but powerful method for extracting keywords and keyphrases.
+This is where **KeyBERT** comes in! Which uses BERT-embeddings and simple cosine similarity
+to find the sub-phrases in a document that are the most similar to the document itself.
+
+First, document embeddings are extracted with BERT to get a document-level representation.
+Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity
+to find the words/phrases that are the most similar to the document. The most similar words could
+then be identified as the words that best describe the entire document.
+
+KeyBERT is by no means unique and is created as a quick and easy method
+for creating keywords and keyphrases. Although there are many great
+papers and solutions out there that use BERT-embeddings
+(e.g.,
+[1](https://github.com/pranav-ust/BERT-keyphrase-extraction),
+[2](https://github.com/ibatra/BERT-Keyword-Extractor),
+[3](https://www.preprints.org/manuscript/201908.0073/download/final_file),
+), I could not find a BERT-based solution that did not have to be trained from scratch and
+could be used for beginners (**correct me if I'm wrong!**).
+Thus, the goal was a `pip install keybert` and at most 3 lines of code in usage.
+
+<a name="gettingstarted"/></a>
+## 2. Getting Started
+[Back to ToC](#toc)
+
+<a name="installation"/></a>
+### 2.1. Installation
+Installation can be done using [pypi](https://pypi.org/project/keybert/):
+
+```
+pip install keybert
+```
+
+You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
+
+```
+pip install keybert[flair]
+pip install keybert[gensim]
+pip install keybert[spacy]
+pip install keybert[use]
+```
+
+<a name="usage"/></a>
+### 2.2. Usage
+
+The most minimal example can be seen below for the extraction of keywords:
+```python
+from keybert import KeyBERT
+
+doc = """
+ Supervised learning is the machine learning task of learning a function that
+ maps an input to an output based on example input-output pairs. It infers a
+ function from labeled training data consisting of a set of training examples.
+ In supervised learning, each example is a pair consisting of an input object
+ (typically a vector) and a desired output value (also called the supervisory signal).
+ A supervised learning algorithm analyzes the training data and produces an inferred function,
+ which can be used for mapping new examples. An optimal scenario will allow for the
+ algorithm to correctly determine the class labels for unseen instances. This requires
+ the learning algorithm to generalize from the training data to unseen situations in a
+ 'reasonable' way (see inductive bias).
+ """
+kw_model = KeyBERT()
+keywords = kw_model.extract_keywords(doc)
+```
+
+You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
+[('learning', 0.4604),
+ ('algorithm', 0.4556),
+ ('training', 0.4487),
+ ('class', 0.4086),
+ ('mapping', 0.3700)]
+```
+
+To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number
+of words you would like in the resulting keyphrases:
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
+[('learning algorithm', 0.6978),
+ ('machine learning', 0.6305),
+ ('supervised learning', 0.5985),
+ ('algorithm analyzes', 0.5860),
+ ('learning function', 0.5850)]
+```
+
+We can highlight the keywords in the document by simply setting `highlight`:
+
+```python
+keywords = kw_model.extract_keywords(doc, highlight=True)
+```
+<img src="images/highlight.png" width="75%" height="75%" />
+
+
+**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
+I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
+for multi-lingual documents or any other language.
+
+<a name="maxsum"/></a>
+### 2.3. Max Sum Distance
+
+To diversify the results, we take the 2 x top_n most similar words/phrases to the document.
+Then, we take all top_n combinations from the 2 x top_n words and extract the combination
+that are the least similar to each other by cosine similarity.
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
+ use_maxsum=True, nr_candidates=20, top_n=5)
+[('set training examples', 0.7504),
+ ('generalize training data', 0.7727),
+ ('requires learning algorithm', 0.5050),
+ ('supervised learning algorithm', 0.3779),
+ ('learning machine learning', 0.2891)]
+```
+
+
+<a name="maximal"/></a>
+### 2.4. Maximal Marginal Relevance
+
+To diversify the results, we can use Maximal Margin Relevance (MMR) to create
+keywords / keyphrases which is also based on cosine similarity. The results
+with **high diversity**:
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
+ use_mmr=True, diversity=0.7)
+[('algorithm generalize training', 0.7727),
+ ('labels unseen instances', 0.1649),
+ ('new examples optimal', 0.4185),
+ ('determine class labels', 0.4774),
+ ('supervised learning algorithm', 0.7502)]
+```
+
+The results with **low diversity**:
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
+ use_mmr=True, diversity=0.2)
+[('algorithm generalize training', 0.7727),
+ ('supervised learning algorithm', 0.7502),
+ ('learning machine learning', 0.7577),
+ ('learning algorithm analyzes', 0.7587),
+ ('learning algorithm generalize', 0.7514)]
+```
+
+
+<a name="embeddings"/></a>
+### 2.5. Embedding Models
+KeyBERT supports many embedding models that can be used to embed the documents and words:
+
+* Sentence-Transformers
+* Flair
+* Spacy
+* Gensim
+* USE
+
+Click [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html) for a full overview of all supported embedding models.
+
+**Sentence-Transformers**
+You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html)
+and pass it through KeyBERT with `model`:
+
+```python
+from keybert import KeyBERT
+kw_model = KeyBERT(model='all-MiniLM-L6-v2')
+```
+
+Or select a SentenceTransformer model with your own parameters:
+
+```python
+from keybert import KeyBERT
+from sentence_transformers import SentenceTransformer
+
+sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
+kw_model = KeyBERT(model=sentence_model)
+```
+
+**Flair**
+[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
+is publicly available. Flair can be used as follows:
+
+```python
+from keybert import KeyBERT
+from flair.embeddings import TransformerDocumentEmbeddings
+
+roberta = TransformerDocumentEmbeddings('roberta-base')
+kw_model = KeyBERT(model=roberta)
+```
+
+You can select any 🤗 transformers model [here](https://huggingface.co/models).
+
+
+## Citation
+To cite KeyBERT in your work, please use the following bibtex reference:
+
+```bibtex
+@misc{grootendorst2020keybert,
+ author = {Maarten Grootendorst},
+ title = {KeyBERT: Minimal keyword extraction with BERT.},
+ year = 2020,
+ publisher = {Zenodo},
+ version = {v0.3.0},
+ doi = {10.5281/zenodo.4461265},
+ url = {https://doi.org/10.5281/zenodo.4461265}
+}
+```
+
+## References
+Below, you can find several resources that were used for the creation of KeyBERT
+but most importantly, these are amazing resources for creating impressive keyword extraction models:
+
+**Papers**:
+* Sharma, P., & Li, Y. (2019). [Self-Supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling.](https://www.preprints.org/manuscript/201908.0073/download/final_file)
+
+**Github Repos**:
+* https://github.com/thunlp/BERT-KPE
+* https://github.com/ibatra/BERT-Keyword-Extractor
+* https://github.com/pranav-ust/BERT-keyphrase-extraction
+* https://github.com/swisscom/ai-research-keyphrase-extraction
+
+**MMR**:
+The selection of keywords/keyphrases was modeled after:
+* https://github.com/swisscom/ai-research-keyphrase-extraction
+
+**NOTE**: If you find a paper or github repo that has an easy-to-use implementation
+of BERT-embeddings for keyword/keyphrase extraction, let me know! I'll make sure to
+add a reference to this repo.
+
+
+ + +%package help +Summary: Development documents and examples for keybert +Provides: python3-keybert-doc +%description help +[](https://pypi.org/project/keybert/)
+[](https://github.com/MaartenGr/keybert/blob/master/LICENSE)
+[](https://pypi.org/project/keybert/)
+[](https://pypi.org/project/keybert/)
+[](https://colab.research.google.com/drive/1OxpgwKqSzODtO3vS7Xe1nEmZMCAIMckX?usp=sharing)
+
+<img src="images/logo.png" width="35%" height="35%" align="right" />
+
+# KeyBERT
+
+KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to
+create keywords and keyphrases that are most similar to a document.
+
+Corresponding medium post can be found [here](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea).
+
+<a name="toc"/></a>
+## Table of Contents
+<!--ts-->
+ 1. [About the Project](#about)
+ 2. [Getting Started](#gettingstarted)
+ 2.1. [Installation](#installation)
+ 2.2. [Basic Usage](#usage)
+ 2.3. [Max Sum Distance](#maxsum)
+ 2.4. [Maximal Marginal Relevance](#maximal)
+ 2.5. [Embedding Models](#embeddings)
+<!--te-->
+
+
+<a name="about"/></a>
+## 1. About the Project
+[Back to ToC](#toc)
+
+Although there are already many methods available for keyword generation
+(e.g.,
+[Rake](https://github.com/aneesha/RAKE),
+[YAKE!](https://github.com/LIAAD/yake), TF-IDF, etc.)
+I wanted to create a very basic, but powerful method for extracting keywords and keyphrases.
+This is where **KeyBERT** comes in! Which uses BERT-embeddings and simple cosine similarity
+to find the sub-phrases in a document that are the most similar to the document itself.
+
+First, document embeddings are extracted with BERT to get a document-level representation.
+Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity
+to find the words/phrases that are the most similar to the document. The most similar words could
+then be identified as the words that best describe the entire document.
+
+KeyBERT is by no means unique and is created as a quick and easy method
+for creating keywords and keyphrases. Although there are many great
+papers and solutions out there that use BERT-embeddings
+(e.g.,
+[1](https://github.com/pranav-ust/BERT-keyphrase-extraction),
+[2](https://github.com/ibatra/BERT-Keyword-Extractor),
+[3](https://www.preprints.org/manuscript/201908.0073/download/final_file),
+), I could not find a BERT-based solution that did not have to be trained from scratch and
+could be used for beginners (**correct me if I'm wrong!**).
+Thus, the goal was a `pip install keybert` and at most 3 lines of code in usage.
+
+<a name="gettingstarted"/></a>
+## 2. Getting Started
+[Back to ToC](#toc)
+
+<a name="installation"/></a>
+### 2.1. Installation
+Installation can be done using [pypi](https://pypi.org/project/keybert/):
+
+```
+pip install keybert
+```
+
+You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
+
+```
+pip install keybert[flair]
+pip install keybert[gensim]
+pip install keybert[spacy]
+pip install keybert[use]
+```
+
+<a name="usage"/></a>
+### 2.2. Usage
+
+The most minimal example can be seen below for the extraction of keywords:
+```python
+from keybert import KeyBERT
+
+doc = """
+ Supervised learning is the machine learning task of learning a function that
+ maps an input to an output based on example input-output pairs. It infers a
+ function from labeled training data consisting of a set of training examples.
+ In supervised learning, each example is a pair consisting of an input object
+ (typically a vector) and a desired output value (also called the supervisory signal).
+ A supervised learning algorithm analyzes the training data and produces an inferred function,
+ which can be used for mapping new examples. An optimal scenario will allow for the
+ algorithm to correctly determine the class labels for unseen instances. This requires
+ the learning algorithm to generalize from the training data to unseen situations in a
+ 'reasonable' way (see inductive bias).
+ """
+kw_model = KeyBERT()
+keywords = kw_model.extract_keywords(doc)
+```
+
+You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
+[('learning', 0.4604),
+ ('algorithm', 0.4556),
+ ('training', 0.4487),
+ ('class', 0.4086),
+ ('mapping', 0.3700)]
+```
+
+To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number
+of words you would like in the resulting keyphrases:
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
+[('learning algorithm', 0.6978),
+ ('machine learning', 0.6305),
+ ('supervised learning', 0.5985),
+ ('algorithm analyzes', 0.5860),
+ ('learning function', 0.5850)]
+```
+
+We can highlight the keywords in the document by simply setting `highlight`:
+
+```python
+keywords = kw_model.extract_keywords(doc, highlight=True)
+```
+<img src="images/highlight.png" width="75%" height="75%" />
+
+
+**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
+I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
+for multi-lingual documents or any other language.
+
+<a name="maxsum"/></a>
+### 2.3. Max Sum Distance
+
+To diversify the results, we take the 2 x top_n most similar words/phrases to the document.
+Then, we take all top_n combinations from the 2 x top_n words and extract the combination
+that are the least similar to each other by cosine similarity.
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
+ use_maxsum=True, nr_candidates=20, top_n=5)
+[('set training examples', 0.7504),
+ ('generalize training data', 0.7727),
+ ('requires learning algorithm', 0.5050),
+ ('supervised learning algorithm', 0.3779),
+ ('learning machine learning', 0.2891)]
+```
+
+
+<a name="maximal"/></a>
+### 2.4. Maximal Marginal Relevance
+
+To diversify the results, we can use Maximal Margin Relevance (MMR) to create
+keywords / keyphrases which is also based on cosine similarity. The results
+with **high diversity**:
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
+ use_mmr=True, diversity=0.7)
+[('algorithm generalize training', 0.7727),
+ ('labels unseen instances', 0.1649),
+ ('new examples optimal', 0.4185),
+ ('determine class labels', 0.4774),
+ ('supervised learning algorithm', 0.7502)]
+```
+
+The results with **low diversity**:
+
+```python
+>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
+ use_mmr=True, diversity=0.2)
+[('algorithm generalize training', 0.7727),
+ ('supervised learning algorithm', 0.7502),
+ ('learning machine learning', 0.7577),
+ ('learning algorithm analyzes', 0.7587),
+ ('learning algorithm generalize', 0.7514)]
+```
+
+
+<a name="embeddings"/></a>
+### 2.5. Embedding Models
+KeyBERT supports many embedding models that can be used to embed the documents and words:
+
+* Sentence-Transformers
+* Flair
+* Spacy
+* Gensim
+* USE
+
+Click [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html) for a full overview of all supported embedding models.
+
+**Sentence-Transformers**
+You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html)
+and pass it through KeyBERT with `model`:
+
+```python
+from keybert import KeyBERT
+kw_model = KeyBERT(model='all-MiniLM-L6-v2')
+```
+
+Or select a SentenceTransformer model with your own parameters:
+
+```python
+from keybert import KeyBERT
+from sentence_transformers import SentenceTransformer
+
+sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
+kw_model = KeyBERT(model=sentence_model)
+```
+
+**Flair**
+[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
+is publicly available. Flair can be used as follows:
+
+```python
+from keybert import KeyBERT
+from flair.embeddings import TransformerDocumentEmbeddings
+
+roberta = TransformerDocumentEmbeddings('roberta-base')
+kw_model = KeyBERT(model=roberta)
+```
+
+You can select any 🤗 transformers model [here](https://huggingface.co/models).
+
+
+## Citation
+To cite KeyBERT in your work, please use the following bibtex reference:
+
+```bibtex
+@misc{grootendorst2020keybert,
+ author = {Maarten Grootendorst},
+ title = {KeyBERT: Minimal keyword extraction with BERT.},
+ year = 2020,
+ publisher = {Zenodo},
+ version = {v0.3.0},
+ doi = {10.5281/zenodo.4461265},
+ url = {https://doi.org/10.5281/zenodo.4461265}
+}
+```
+
+## References
+Below, you can find several resources that were used for the creation of KeyBERT
+but most importantly, these are amazing resources for creating impressive keyword extraction models:
+
+**Papers**:
+* Sharma, P., & Li, Y. (2019). [Self-Supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling.](https://www.preprints.org/manuscript/201908.0073/download/final_file)
+
+**Github Repos**:
+* https://github.com/thunlp/BERT-KPE
+* https://github.com/ibatra/BERT-Keyword-Extractor
+* https://github.com/pranav-ust/BERT-keyphrase-extraction
+* https://github.com/swisscom/ai-research-keyphrase-extraction
+
+**MMR**:
+The selection of keywords/keyphrases was modeled after:
+* https://github.com/swisscom/ai-research-keyphrase-extraction
+
+**NOTE**: If you find a paper or github repo that has an easy-to-use implementation
+of BERT-embeddings for keyword/keyphrase extraction, let me know! I'll make sure to
+add a reference to this repo.
+
+
+ + +%prep +%autosetup -n keybert-0.7.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-keybert -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed Apr 12 2023 Python_Bot <Python_Bot@openeuler.org> - 0.7.0-1 +- Package Spec generated @@ -0,0 +1 @@ +c17a1fc0c3c0c2c6cc8acf378e5f11ab keybert-0.7.0.tar.gz |