From 70f12a5dbc6603235545bb29e9e0b22c62d81669 Mon Sep 17 00:00:00 2001 From: CoprDistGit Date: Wed, 12 Apr 2023 00:58:00 +0000 Subject: automatic import of python-keybert --- .gitignore | 1 + python-keybert.spec | 870 ++++++++++++++++++++++++++++++++++++++++++++++++++++ sources | 1 + 3 files changed, 872 insertions(+) create mode 100644 python-keybert.spec create mode 100644 sources diff --git a/.gitignore b/.gitignore index e69de29..cec9f57 100644 --- a/.gitignore +++ b/.gitignore @@ -0,0 +1 @@ +/keybert-0.7.0.tar.gz diff --git a/python-keybert.spec b/python-keybert.spec new file mode 100644 index 0000000..4244764 --- /dev/null +++ b/python-keybert.spec @@ -0,0 +1,870 @@ +%global _empty_manifest_terminate_build 0 +Name: python-keybert +Version: 0.7.0 +Release: 1 +Summary: KeyBERT performs keyword extraction with state-of-the-art transformer models. +License: MIT License +URL: https://github.com/MaartenGr/keyBERT +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/9a/41/b7b21fb0abee8381b83db942fd6dc31c9d61d59a6af0f0f78e310a5cf908/keybert-0.7.0.tar.gz +BuildArch: noarch + + +%description +[![PyPI - Python](https://img.shields.io/badge/python-3.6%20|%203.7%20|%203.8-blue.svg)](https://pypi.org/project/keybert/) +[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE) +[![PyPI - PyPi](https://img.shields.io/pypi/v/keyBERT)](https://pypi.org/project/keybert/) +[![Build](https://img.shields.io/github/workflow/status/MaartenGr/keyBERT/Code%20Checks/master)](https://pypi.org/project/keybert/) +[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OxpgwKqSzODtO3vS7Xe1nEmZMCAIMckX?usp=sharing) + + + +# KeyBERT + +KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to +create keywords and keyphrases that are most similar to a document. + +Corresponding medium post can be found [here](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea). + + +## Table of Contents + + 1. [About the Project](#about) + 2. [Getting Started](#gettingstarted) + 2.1. [Installation](#installation) + 2.2. [Basic Usage](#usage) + 2.3. [Max Sum Distance](#maxsum) + 2.4. [Maximal Marginal Relevance](#maximal) + 2.5. [Embedding Models](#embeddings) + + + + +## 1. About the Project +[Back to ToC](#toc) + +Although there are already many methods available for keyword generation +(e.g., +[Rake](https://github.com/aneesha/RAKE), +[YAKE!](https://github.com/LIAAD/yake), TF-IDF, etc.) +I wanted to create a very basic, but powerful method for extracting keywords and keyphrases. +This is where **KeyBERT** comes in! Which uses BERT-embeddings and simple cosine similarity +to find the sub-phrases in a document that are the most similar to the document itself. + +First, document embeddings are extracted with BERT to get a document-level representation. +Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity +to find the words/phrases that are the most similar to the document. The most similar words could +then be identified as the words that best describe the entire document. + +KeyBERT is by no means unique and is created as a quick and easy method +for creating keywords and keyphrases. Although there are many great +papers and solutions out there that use BERT-embeddings +(e.g., +[1](https://github.com/pranav-ust/BERT-keyphrase-extraction), +[2](https://github.com/ibatra/BERT-Keyword-Extractor), +[3](https://www.preprints.org/manuscript/201908.0073/download/final_file), +), I could not find a BERT-based solution that did not have to be trained from scratch and +could be used for beginners (**correct me if I'm wrong!**). +Thus, the goal was a `pip install keybert` and at most 3 lines of code in usage. + + +## 2. Getting Started +[Back to ToC](#toc) + + +### 2.1. Installation +Installation can be done using [pypi](https://pypi.org/project/keybert/): + +``` +pip install keybert +``` + +You may want to install more depending on the transformers and language backends that you will be using. The possible installations are: + +``` +pip install keybert[flair] +pip install keybert[gensim] +pip install keybert[spacy] +pip install keybert[use] +``` + + +### 2.2. Usage + +The most minimal example can be seen below for the extraction of keywords: +```python +from keybert import KeyBERT + +doc = """ + Supervised learning is the machine learning task of learning a function that + maps an input to an output based on example input-output pairs. It infers a + function from labeled training data consisting of a set of training examples. + In supervised learning, each example is a pair consisting of an input object + (typically a vector) and a desired output value (also called the supervisory signal). + A supervised learning algorithm analyzes the training data and produces an inferred function, + which can be used for mapping new examples. An optimal scenario will allow for the + algorithm to correctly determine the class labels for unseen instances. This requires + the learning algorithm to generalize from the training data to unseen situations in a + 'reasonable' way (see inductive bias). + """ +kw_model = KeyBERT() +keywords = kw_model.extract_keywords(doc) +``` + +You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases: + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None) +[('learning', 0.4604), + ('algorithm', 0.4556), + ('training', 0.4487), + ('class', 0.4086), + ('mapping', 0.3700)] +``` + +To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number +of words you would like in the resulting keyphrases: + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None) +[('learning algorithm', 0.6978), + ('machine learning', 0.6305), + ('supervised learning', 0.5985), + ('algorithm analyzes', 0.5860), + ('learning function', 0.5850)] +``` + +We can highlight the keywords in the document by simply setting `highlight`: + +```python +keywords = kw_model.extract_keywords(doc, highlight=True) +``` + + + +**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html). +I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"` +for multi-lingual documents or any other language. + + +### 2.3. Max Sum Distance + +To diversify the results, we take the 2 x top_n most similar words/phrases to the document. +Then, we take all top_n combinations from the 2 x top_n words and extract the combination +that are the least similar to each other by cosine similarity. + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', + use_maxsum=True, nr_candidates=20, top_n=5) +[('set training examples', 0.7504), + ('generalize training data', 0.7727), + ('requires learning algorithm', 0.5050), + ('supervised learning algorithm', 0.3779), + ('learning machine learning', 0.2891)] +``` + + + +### 2.4. Maximal Marginal Relevance + +To diversify the results, we can use Maximal Margin Relevance (MMR) to create +keywords / keyphrases which is also based on cosine similarity. The results +with **high diversity**: + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', + use_mmr=True, diversity=0.7) +[('algorithm generalize training', 0.7727), + ('labels unseen instances', 0.1649), + ('new examples optimal', 0.4185), + ('determine class labels', 0.4774), + ('supervised learning algorithm', 0.7502)] +``` + +The results with **low diversity**: + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', + use_mmr=True, diversity=0.2) +[('algorithm generalize training', 0.7727), + ('supervised learning algorithm', 0.7502), + ('learning machine learning', 0.7577), + ('learning algorithm analyzes', 0.7587), + ('learning algorithm generalize', 0.7514)] +``` + + + +### 2.5. Embedding Models +KeyBERT supports many embedding models that can be used to embed the documents and words: + +* Sentence-Transformers +* Flair +* Spacy +* Gensim +* USE + +Click [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html) for a full overview of all supported embedding models. + +**Sentence-Transformers** +You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html) +and pass it through KeyBERT with `model`: + +```python +from keybert import KeyBERT +kw_model = KeyBERT(model='all-MiniLM-L6-v2') +``` + +Or select a SentenceTransformer model with your own parameters: + +```python +from keybert import KeyBERT +from sentence_transformers import SentenceTransformer + +sentence_model = SentenceTransformer("all-MiniLM-L6-v2") +kw_model = KeyBERT(model=sentence_model) +``` + +**Flair** +[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that +is publicly available. Flair can be used as follows: + +```python +from keybert import KeyBERT +from flair.embeddings import TransformerDocumentEmbeddings + +roberta = TransformerDocumentEmbeddings('roberta-base') +kw_model = KeyBERT(model=roberta) +``` + +You can select any 🤗 transformers model [here](https://huggingface.co/models). + + +## Citation +To cite KeyBERT in your work, please use the following bibtex reference: + +```bibtex +@misc{grootendorst2020keybert, + author = {Maarten Grootendorst}, + title = {KeyBERT: Minimal keyword extraction with BERT.}, + year = 2020, + publisher = {Zenodo}, + version = {v0.3.0}, + doi = {10.5281/zenodo.4461265}, + url = {https://doi.org/10.5281/zenodo.4461265} +} +``` + +## References +Below, you can find several resources that were used for the creation of KeyBERT +but most importantly, these are amazing resources for creating impressive keyword extraction models: + +**Papers**: +* Sharma, P., & Li, Y. (2019). [Self-Supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling.](https://www.preprints.org/manuscript/201908.0073/download/final_file) + +**Github Repos**: +* https://github.com/thunlp/BERT-KPE +* https://github.com/ibatra/BERT-Keyword-Extractor +* https://github.com/pranav-ust/BERT-keyphrase-extraction +* https://github.com/swisscom/ai-research-keyphrase-extraction + +**MMR**: +The selection of keywords/keyphrases was modeled after: +* https://github.com/swisscom/ai-research-keyphrase-extraction + +**NOTE**: If you find a paper or github repo that has an easy-to-use implementation +of BERT-embeddings for keyword/keyphrase extraction, let me know! I'll make sure to +add a reference to this repo. + + + + +%package -n python3-keybert +Summary: KeyBERT performs keyword extraction with state-of-the-art transformer models. +Provides: python-keybert +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-keybert +[![PyPI - Python](https://img.shields.io/badge/python-3.6%20|%203.7%20|%203.8-blue.svg)](https://pypi.org/project/keybert/) +[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE) +[![PyPI - PyPi](https://img.shields.io/pypi/v/keyBERT)](https://pypi.org/project/keybert/) +[![Build](https://img.shields.io/github/workflow/status/MaartenGr/keyBERT/Code%20Checks/master)](https://pypi.org/project/keybert/) +[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OxpgwKqSzODtO3vS7Xe1nEmZMCAIMckX?usp=sharing) + + + +# KeyBERT + +KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to +create keywords and keyphrases that are most similar to a document. + +Corresponding medium post can be found [here](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea). + + +## Table of Contents + + 1. [About the Project](#about) + 2. [Getting Started](#gettingstarted) + 2.1. [Installation](#installation) + 2.2. [Basic Usage](#usage) + 2.3. [Max Sum Distance](#maxsum) + 2.4. [Maximal Marginal Relevance](#maximal) + 2.5. [Embedding Models](#embeddings) + + + + +## 1. About the Project +[Back to ToC](#toc) + +Although there are already many methods available for keyword generation +(e.g., +[Rake](https://github.com/aneesha/RAKE), +[YAKE!](https://github.com/LIAAD/yake), TF-IDF, etc.) +I wanted to create a very basic, but powerful method for extracting keywords and keyphrases. +This is where **KeyBERT** comes in! Which uses BERT-embeddings and simple cosine similarity +to find the sub-phrases in a document that are the most similar to the document itself. + +First, document embeddings are extracted with BERT to get a document-level representation. +Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity +to find the words/phrases that are the most similar to the document. The most similar words could +then be identified as the words that best describe the entire document. + +KeyBERT is by no means unique and is created as a quick and easy method +for creating keywords and keyphrases. Although there are many great +papers and solutions out there that use BERT-embeddings +(e.g., +[1](https://github.com/pranav-ust/BERT-keyphrase-extraction), +[2](https://github.com/ibatra/BERT-Keyword-Extractor), +[3](https://www.preprints.org/manuscript/201908.0073/download/final_file), +), I could not find a BERT-based solution that did not have to be trained from scratch and +could be used for beginners (**correct me if I'm wrong!**). +Thus, the goal was a `pip install keybert` and at most 3 lines of code in usage. + + +## 2. Getting Started +[Back to ToC](#toc) + + +### 2.1. Installation +Installation can be done using [pypi](https://pypi.org/project/keybert/): + +``` +pip install keybert +``` + +You may want to install more depending on the transformers and language backends that you will be using. The possible installations are: + +``` +pip install keybert[flair] +pip install keybert[gensim] +pip install keybert[spacy] +pip install keybert[use] +``` + + +### 2.2. Usage + +The most minimal example can be seen below for the extraction of keywords: +```python +from keybert import KeyBERT + +doc = """ + Supervised learning is the machine learning task of learning a function that + maps an input to an output based on example input-output pairs. It infers a + function from labeled training data consisting of a set of training examples. + In supervised learning, each example is a pair consisting of an input object + (typically a vector) and a desired output value (also called the supervisory signal). + A supervised learning algorithm analyzes the training data and produces an inferred function, + which can be used for mapping new examples. An optimal scenario will allow for the + algorithm to correctly determine the class labels for unseen instances. This requires + the learning algorithm to generalize from the training data to unseen situations in a + 'reasonable' way (see inductive bias). + """ +kw_model = KeyBERT() +keywords = kw_model.extract_keywords(doc) +``` + +You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases: + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None) +[('learning', 0.4604), + ('algorithm', 0.4556), + ('training', 0.4487), + ('class', 0.4086), + ('mapping', 0.3700)] +``` + +To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number +of words you would like in the resulting keyphrases: + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None) +[('learning algorithm', 0.6978), + ('machine learning', 0.6305), + ('supervised learning', 0.5985), + ('algorithm analyzes', 0.5860), + ('learning function', 0.5850)] +``` + +We can highlight the keywords in the document by simply setting `highlight`: + +```python +keywords = kw_model.extract_keywords(doc, highlight=True) +``` + + + +**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html). +I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"` +for multi-lingual documents or any other language. + + +### 2.3. Max Sum Distance + +To diversify the results, we take the 2 x top_n most similar words/phrases to the document. +Then, we take all top_n combinations from the 2 x top_n words and extract the combination +that are the least similar to each other by cosine similarity. + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', + use_maxsum=True, nr_candidates=20, top_n=5) +[('set training examples', 0.7504), + ('generalize training data', 0.7727), + ('requires learning algorithm', 0.5050), + ('supervised learning algorithm', 0.3779), + ('learning machine learning', 0.2891)] +``` + + + +### 2.4. Maximal Marginal Relevance + +To diversify the results, we can use Maximal Margin Relevance (MMR) to create +keywords / keyphrases which is also based on cosine similarity. The results +with **high diversity**: + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', + use_mmr=True, diversity=0.7) +[('algorithm generalize training', 0.7727), + ('labels unseen instances', 0.1649), + ('new examples optimal', 0.4185), + ('determine class labels', 0.4774), + ('supervised learning algorithm', 0.7502)] +``` + +The results with **low diversity**: + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', + use_mmr=True, diversity=0.2) +[('algorithm generalize training', 0.7727), + ('supervised learning algorithm', 0.7502), + ('learning machine learning', 0.7577), + ('learning algorithm analyzes', 0.7587), + ('learning algorithm generalize', 0.7514)] +``` + + + +### 2.5. Embedding Models +KeyBERT supports many embedding models that can be used to embed the documents and words: + +* Sentence-Transformers +* Flair +* Spacy +* Gensim +* USE + +Click [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html) for a full overview of all supported embedding models. + +**Sentence-Transformers** +You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html) +and pass it through KeyBERT with `model`: + +```python +from keybert import KeyBERT +kw_model = KeyBERT(model='all-MiniLM-L6-v2') +``` + +Or select a SentenceTransformer model with your own parameters: + +```python +from keybert import KeyBERT +from sentence_transformers import SentenceTransformer + +sentence_model = SentenceTransformer("all-MiniLM-L6-v2") +kw_model = KeyBERT(model=sentence_model) +``` + +**Flair** +[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that +is publicly available. Flair can be used as follows: + +```python +from keybert import KeyBERT +from flair.embeddings import TransformerDocumentEmbeddings + +roberta = TransformerDocumentEmbeddings('roberta-base') +kw_model = KeyBERT(model=roberta) +``` + +You can select any 🤗 transformers model [here](https://huggingface.co/models). + + +## Citation +To cite KeyBERT in your work, please use the following bibtex reference: + +```bibtex +@misc{grootendorst2020keybert, + author = {Maarten Grootendorst}, + title = {KeyBERT: Minimal keyword extraction with BERT.}, + year = 2020, + publisher = {Zenodo}, + version = {v0.3.0}, + doi = {10.5281/zenodo.4461265}, + url = {https://doi.org/10.5281/zenodo.4461265} +} +``` + +## References +Below, you can find several resources that were used for the creation of KeyBERT +but most importantly, these are amazing resources for creating impressive keyword extraction models: + +**Papers**: +* Sharma, P., & Li, Y. (2019). [Self-Supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling.](https://www.preprints.org/manuscript/201908.0073/download/final_file) + +**Github Repos**: +* https://github.com/thunlp/BERT-KPE +* https://github.com/ibatra/BERT-Keyword-Extractor +* https://github.com/pranav-ust/BERT-keyphrase-extraction +* https://github.com/swisscom/ai-research-keyphrase-extraction + +**MMR**: +The selection of keywords/keyphrases was modeled after: +* https://github.com/swisscom/ai-research-keyphrase-extraction + +**NOTE**: If you find a paper or github repo that has an easy-to-use implementation +of BERT-embeddings for keyword/keyphrase extraction, let me know! I'll make sure to +add a reference to this repo. + + + + +%package help +Summary: Development documents and examples for keybert +Provides: python3-keybert-doc +%description help +[![PyPI - Python](https://img.shields.io/badge/python-3.6%20|%203.7%20|%203.8-blue.svg)](https://pypi.org/project/keybert/) +[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE) +[![PyPI - PyPi](https://img.shields.io/pypi/v/keyBERT)](https://pypi.org/project/keybert/) +[![Build](https://img.shields.io/github/workflow/status/MaartenGr/keyBERT/Code%20Checks/master)](https://pypi.org/project/keybert/) +[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OxpgwKqSzODtO3vS7Xe1nEmZMCAIMckX?usp=sharing) + + + +# KeyBERT + +KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to +create keywords and keyphrases that are most similar to a document. + +Corresponding medium post can be found [here](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea). + + +## Table of Contents + + 1. [About the Project](#about) + 2. [Getting Started](#gettingstarted) + 2.1. [Installation](#installation) + 2.2. [Basic Usage](#usage) + 2.3. [Max Sum Distance](#maxsum) + 2.4. [Maximal Marginal Relevance](#maximal) + 2.5. [Embedding Models](#embeddings) + + + + +## 1. About the Project +[Back to ToC](#toc) + +Although there are already many methods available for keyword generation +(e.g., +[Rake](https://github.com/aneesha/RAKE), +[YAKE!](https://github.com/LIAAD/yake), TF-IDF, etc.) +I wanted to create a very basic, but powerful method for extracting keywords and keyphrases. +This is where **KeyBERT** comes in! Which uses BERT-embeddings and simple cosine similarity +to find the sub-phrases in a document that are the most similar to the document itself. + +First, document embeddings are extracted with BERT to get a document-level representation. +Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity +to find the words/phrases that are the most similar to the document. The most similar words could +then be identified as the words that best describe the entire document. + +KeyBERT is by no means unique and is created as a quick and easy method +for creating keywords and keyphrases. Although there are many great +papers and solutions out there that use BERT-embeddings +(e.g., +[1](https://github.com/pranav-ust/BERT-keyphrase-extraction), +[2](https://github.com/ibatra/BERT-Keyword-Extractor), +[3](https://www.preprints.org/manuscript/201908.0073/download/final_file), +), I could not find a BERT-based solution that did not have to be trained from scratch and +could be used for beginners (**correct me if I'm wrong!**). +Thus, the goal was a `pip install keybert` and at most 3 lines of code in usage. + + +## 2. Getting Started +[Back to ToC](#toc) + + +### 2.1. Installation +Installation can be done using [pypi](https://pypi.org/project/keybert/): + +``` +pip install keybert +``` + +You may want to install more depending on the transformers and language backends that you will be using. The possible installations are: + +``` +pip install keybert[flair] +pip install keybert[gensim] +pip install keybert[spacy] +pip install keybert[use] +``` + + +### 2.2. Usage + +The most minimal example can be seen below for the extraction of keywords: +```python +from keybert import KeyBERT + +doc = """ + Supervised learning is the machine learning task of learning a function that + maps an input to an output based on example input-output pairs. It infers a + function from labeled training data consisting of a set of training examples. + In supervised learning, each example is a pair consisting of an input object + (typically a vector) and a desired output value (also called the supervisory signal). + A supervised learning algorithm analyzes the training data and produces an inferred function, + which can be used for mapping new examples. An optimal scenario will allow for the + algorithm to correctly determine the class labels for unseen instances. This requires + the learning algorithm to generalize from the training data to unseen situations in a + 'reasonable' way (see inductive bias). + """ +kw_model = KeyBERT() +keywords = kw_model.extract_keywords(doc) +``` + +You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases: + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None) +[('learning', 0.4604), + ('algorithm', 0.4556), + ('training', 0.4487), + ('class', 0.4086), + ('mapping', 0.3700)] +``` + +To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number +of words you would like in the resulting keyphrases: + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None) +[('learning algorithm', 0.6978), + ('machine learning', 0.6305), + ('supervised learning', 0.5985), + ('algorithm analyzes', 0.5860), + ('learning function', 0.5850)] +``` + +We can highlight the keywords in the document by simply setting `highlight`: + +```python +keywords = kw_model.extract_keywords(doc, highlight=True) +``` + + + +**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html). +I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"` +for multi-lingual documents or any other language. + + +### 2.3. Max Sum Distance + +To diversify the results, we take the 2 x top_n most similar words/phrases to the document. +Then, we take all top_n combinations from the 2 x top_n words and extract the combination +that are the least similar to each other by cosine similarity. + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', + use_maxsum=True, nr_candidates=20, top_n=5) +[('set training examples', 0.7504), + ('generalize training data', 0.7727), + ('requires learning algorithm', 0.5050), + ('supervised learning algorithm', 0.3779), + ('learning machine learning', 0.2891)] +``` + + + +### 2.4. Maximal Marginal Relevance + +To diversify the results, we can use Maximal Margin Relevance (MMR) to create +keywords / keyphrases which is also based on cosine similarity. The results +with **high diversity**: + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', + use_mmr=True, diversity=0.7) +[('algorithm generalize training', 0.7727), + ('labels unseen instances', 0.1649), + ('new examples optimal', 0.4185), + ('determine class labels', 0.4774), + ('supervised learning algorithm', 0.7502)] +``` + +The results with **low diversity**: + +```python +>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', + use_mmr=True, diversity=0.2) +[('algorithm generalize training', 0.7727), + ('supervised learning algorithm', 0.7502), + ('learning machine learning', 0.7577), + ('learning algorithm analyzes', 0.7587), + ('learning algorithm generalize', 0.7514)] +``` + + + +### 2.5. Embedding Models +KeyBERT supports many embedding models that can be used to embed the documents and words: + +* Sentence-Transformers +* Flair +* Spacy +* Gensim +* USE + +Click [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html) for a full overview of all supported embedding models. + +**Sentence-Transformers** +You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html) +and pass it through KeyBERT with `model`: + +```python +from keybert import KeyBERT +kw_model = KeyBERT(model='all-MiniLM-L6-v2') +``` + +Or select a SentenceTransformer model with your own parameters: + +```python +from keybert import KeyBERT +from sentence_transformers import SentenceTransformer + +sentence_model = SentenceTransformer("all-MiniLM-L6-v2") +kw_model = KeyBERT(model=sentence_model) +``` + +**Flair** +[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that +is publicly available. Flair can be used as follows: + +```python +from keybert import KeyBERT +from flair.embeddings import TransformerDocumentEmbeddings + +roberta = TransformerDocumentEmbeddings('roberta-base') +kw_model = KeyBERT(model=roberta) +``` + +You can select any 🤗 transformers model [here](https://huggingface.co/models). + + +## Citation +To cite KeyBERT in your work, please use the following bibtex reference: + +```bibtex +@misc{grootendorst2020keybert, + author = {Maarten Grootendorst}, + title = {KeyBERT: Minimal keyword extraction with BERT.}, + year = 2020, + publisher = {Zenodo}, + version = {v0.3.0}, + doi = {10.5281/zenodo.4461265}, + url = {https://doi.org/10.5281/zenodo.4461265} +} +``` + +## References +Below, you can find several resources that were used for the creation of KeyBERT +but most importantly, these are amazing resources for creating impressive keyword extraction models: + +**Papers**: +* Sharma, P., & Li, Y. (2019). [Self-Supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling.](https://www.preprints.org/manuscript/201908.0073/download/final_file) + +**Github Repos**: +* https://github.com/thunlp/BERT-KPE +* https://github.com/ibatra/BERT-Keyword-Extractor +* https://github.com/pranav-ust/BERT-keyphrase-extraction +* https://github.com/swisscom/ai-research-keyphrase-extraction + +**MMR**: +The selection of keywords/keyphrases was modeled after: +* https://github.com/swisscom/ai-research-keyphrase-extraction + +**NOTE**: If you find a paper or github repo that has an easy-to-use implementation +of BERT-embeddings for keyword/keyphrase extraction, let me know! I'll make sure to +add a reference to this repo. + + + + +%prep +%autosetup -n keybert-0.7.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-keybert -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed Apr 12 2023 Python_Bot - 0.7.0-1 +- Package Spec generated diff --git a/sources b/sources new file mode 100644 index 0000000..d3f0763 --- /dev/null +++ b/sources @@ -0,0 +1 @@ +c17a1fc0c3c0c2c6cc8acf378e5f11ab keybert-0.7.0.tar.gz -- cgit v1.2.3