automatic import of python-polyfuzz

author: CoprDistGit <infra@openeuler.org> 2023-04-11 18:37:48 +0000
committer: CoprDistGit <infra@openeuler.org> 2023-04-11 18:37:48 +0000
commit: cc342b1bfcaa48a4e6adbce607ca782948db6d50 (patch)
tree: ec1e41d99d90a6300bd207e3cf8f9cb2b66c443d
parent: 106de98e95c328ef20854693f88cb348807acbd1 (diff)
3 files changed, 1012 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..b5a6881 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/polyfuzz-0.4.0.tar.gz
diff --git a/python-polyfuzz.spec b/python-polyfuzz.spec
new file mode 100644
index 0000000..40ac8e2
--- /dev/null
+++ b/python-polyfuzz.spec
@@ -0,0 +1,1010 @@
+%global _empty_manifest_terminate_build 0
+Name:		python-polyfuzz
+Version:	0.4.0
+Release:	1
+Summary:	PolyFuzz performs fuzzy string matching, grouping, and evaluation.
+License:	MIT License
+URL:		https://github.com/MaartenGr/PolyFuzz
+Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/fe/90/79ac771627a14ef47d16f2e3d1662332af790d5b942c8af55f1a32aa8ef6/polyfuzz-0.4.0.tar.gz
+BuildArch:	noarch
+
+Requires:	python3-numpy
+Requires:	python3-scipy
+Requires:	python3-pandas
+Requires:	python3-tqdm
+Requires:	python3-joblib
+Requires:	python3-matplotlib
+Requires:	python3-seaborn
+Requires:	python3-rapidfuzz
+Requires:	python3-scikit-learn
+Requires:	python3-mkdocs
+Requires:	python3-mkdocs-material
+Requires:	python3-mkdocstrings
+Requires:	python3-pytest
+Requires:	python3-pytest-cov
+Requires:	python3-torch
+Requires:	python3-flair
+Requires:	python3-sparse-dot-topn
+Requires:	python3-sentence-transformers
+Requires:	python3-spacy
+Requires:	python3-tensorflow
+Requires:	python3-tensorflow-hub
+Requires:	python3-tensorflow-text
+Requires:	python3-mkdocs
+Requires:	python3-mkdocs-material
+Requires:	python3-mkdocstrings
+Requires:	python3-sparse-dot-topn
+Requires:	python3-torch
+Requires:	python3-flair
+Requires:	python3-gensim
+Requires:	python3-sentence-transformers
+Requires:	python3-pytest
+Requires:	python3-pytest-cov
+Requires:	python3-tensorflow
+Requires:	python3-tensorflow-hub
+Requires:	python3-tensorflow-text
+
+%description
+<img src="images/logo.png" width="70%" height="70%"/>
+
+[![PyPI - Python](https://img.shields.io/badge/python-3.6%20|%203.7%20|%203.8-blue.svg)](https://pypi.org/project/keybert/)
+[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE)
+[![PyPI - PyPi](https://img.shields.io/pypi/v/polyfuzz)](https://pypi.org/project/polyfuzz/)
+[![Build](https://img.shields.io/github/workflow/status/MaartenGr/polyfuzz/Code%20Checks/master)](https://pypi.org/project/polyfuzz/)
+[![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/PolyFuzz/)  
+**`PolyFuzz`** performs fuzzy string matching, string grouping, and contains extensive evaluation functions. 
+PolyFuzz is meant to bring fuzzy string matching techniques together within a single framework.
+
+Currently, methods include a variety of edit distance measures, a character-based n-gram TF-IDF, word embedding
+techniques such as FastText and GloVe, and ðŸ¤— transformers embeddings.  
+
+Corresponding medium post can be found [here](https://towardsdatascience.com/string-matching-with-bert-tf-idf-and-more-274bb3a95136?source=friends_link&sk=0f765b76ceaba49363829c13dfdc9d98).
+
+
+<a name="installation"/></a>
+## Installation
+You can install **`PolyFuzz`** via pip:
+ 
+```bash
+pip install polyfuzz
+```
+
+You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
+
+```python
+pip install bertopic[sbert]
+pip install bertopic[flair]
+pip install bertopic[gensim]
+pip install bertopic[spacy]
+pip install bertopic[use]
+```
+
+If you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models, 
+you can use `sparse_dot_topn` which is installed via:
+
+```bash
+pip install polyfuzz[fast]
+```
+
+<details>
+<summary>Installation Issues</summary>
+
+You might run into installation issues with `sparse_dot_topn`. If so, one solution that has worked for many 
+is by installing it via conda first before installing PolyFuzz:
+
+```bash
+conda install -c conda-forge sparse_dot_topn
+```
+
+If that does not work, I would advise you to look through their 
+issues](https://github.com/ing-bank/sparse_dot_topn/issues) page or continue to use PolyFuzz without `sparse_dot_topn`. 
+
+</details>  
+
+
+<a name="gettingstarted"/></a>
+## Getting Started
+
+For an in-depth overview of the possibilities of `PolyFuzz` 
+you can check the full documentation [here](https://maartengr.github.io/PolyFuzz/) or you can follow along 
+with the notebook [here](https://github.com/MaartenGr/PolyFuzz/blob/master/notebooks/Overview.ipynb).
+
+### Quick Start
+
+The main goal of `PolyFuzz` is to allow the user to perform different methods for matching strings. 
+We start by defining two lists, one to map from and one to map to. We are going to be using `TF-IDF` to create 
+n-grams on a character level in order to compare similarity between strings. Then, we calculate the similarity 
+between strings by calculating the cosine similarity between vector representations. 
+
+We only have to instantiate `PolyFuzz` with `TF-IDF` and match the lists:
+
+```python
+from polyfuzz import PolyFuzz
+
+from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
+to_list = ["apple", "apples", "mouse"]
+
+model = PolyFuzz("TF-IDF")
+model.match(from_list, to_list)
+```  
+
+The resulting matches can be accessed through `model.get_matches()`:
+
+```python
+>>> model.get_matches()
+         From      To    Similarity
+0       apple   apple    1.000000
+1      apples  apples    1.000000
+2        appl   apple    0.783751
+3       recal    None    0.000000
+4       house   mouse    0.587927
+5  similarity    None    0.000000
+
+``` 
+
+**NOTE 1**: If you want to compare distances within a single list, you can simply pass that list as such: `model.match(from_list)`
+
+**NOTE 2**: When instantiating `PolyFuzz` we also could have used "EditDistance" or "Embeddings" to quickly 
+access Levenshtein and FastText (English) respectively. 
+
+### Production
+The `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz 
+in production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions. 
+
+Let's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`. 
+In other words, we `fit` on `train_words` and we use `transform` on any incoming words:
+
+```python
+from sklearn.datasets import fetch_20newsgroups
+from sklearn.feature_extraction.text import CountVectorizer
+from polyfuzz import PolyFuzz
+
+train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
+unseen_words = ["apple", "apples", "mouse"]
+
+# Fit
+model = PolyFuzz("TF-IDF")
+model.fit(train_words)
+
+# Transform
+results = model.transform(unseen_words)
+```
+
+In the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`. 
+This speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`. 
+
+Then, we apply save and load the model as follows to be used in production:
+
+```python
+# Save the model
+model.save("my_model")
+
+# Load the model
+loaded_model = PolyFuzz.load("my_model")
+```
+
+### Group Matches
+We can group the matches `To` as there might be significant overlap in strings in our to_list. 
+To do this, we calculate the similarity within strings in to_list and use `single linkage` to then 
+group the strings with a high similarity.
+
+When we extract the new matches, we can see an additional column `Group` in which all the `To` matches were grouped to:
+
+```python
+>>> model.group(link_min_similarity=0.75)
+>>> model.get_matches()
+	      From	To		Similarity	Group
+0	     apple	apple	1.000000	apples
+1	    apples	apples	1.000000	apples
+2	      appl	apple	0.783751	apples
+3	     recal	None	0.000000	None
+4	     house	mouse	0.587927	mouse
+5	similarity	None	0.000000	None
+```
+
+As can be seen above, we grouped apple and apples together to `apple` such that when a string is mapped to `apple` it 
+will fall in the cluster of `[apples, apple]` and will be mapped to the first instance in the cluster which is `apples`.
+
+### Precision-Recall Curve  
+Next, we would like to see how well our model is doing on our data. We express our results as 
+**`precision`** and **`recall`** where precision is defined as the minimum similarity score before a match is correct and 
+recall the percentage of matches found at a certain minimum similarity score.  
+
+Creating the visualizations is as simple as:
+
+```
+model.visualize_precision_recall()
+```
+<img src="images/tfidf.png" width="100%" height="100%"/> 
+
+## Models
+Currently, the following models are implemented in PolyFuzz:
+* TF-IDF
+* EditDistance (you can use any distance measure, see [documentation](https://maartengr.github.io/PolyFuzz/tutorial/models/#EditDistance))
+* FastText and GloVe
+* ðŸ¤— Transformers
+
+With `Flair`, we can use all ðŸ¤— Transformers [models](https://huggingface.co/transformers/pretrained_models.html). 
+We simply have to instantiate any Flair WordEmbedding method and pass it through PolyFuzzy.
+
+All models listed above can be found in `polyfuzz.models` and can be used to create and compare different models:
+
+```python
+from polyfuzz.models import EditDistance, TFIDF, Embeddings
+from flair.embeddings import TransformerWordEmbeddings
+
+embeddings = TransformerWordEmbeddings('bert-base-multilingual-cased')
+bert = Embeddings(embeddings, min_similarity=0, model_id="BERT")
+tfidf = TFIDF(min_similarity=0)
+edit = EditDistance()
+
+string_models = [bert, tfidf, edit]
+model = PolyFuzz(string_models)
+model.match(from_list, to_list)
+```
+
+To access the results, we again can call `get_matches` but since we have multiple models we get back a dictionary 
+of dataframes back. 
+
+In order to access the results of a specific model, call `get_matches` with the correct id: 
+
+```python
+>>> model.get_matches("BERT")
+        From	    To          Similarity
+0	apple	    apple	1.000000
+1	apples	    apples	1.000000
+2	appl	    apple	0.928045
+3	recal	    apples	0.825268
+4	house	    mouse	0.887524
+5	similarity  mouse	0.791548
+``` 
+
+Finally, visualize the results to compare the models:
+
+```python
+model.visualize_precision_recall(kde=True)
+```
+
+<img src="images/multiple_models.png" width="100%" height="100%"/>
+
+## Custom Grouper
+We can even use one of the `polyfuzz.models` to be used as the grouper in case you would like to use 
+something else than the standard TF-IDF model:
+
+```python
+model = PolyFuzz("TF-IDF")
+model.match(from_list, to_list)
+
+edit_grouper = EditDistance(n_jobs=1)
+model.group(edit_grouper)
+```
+
+## Custom Models
+Although the options above are a great solution for comparing different models, what if you have developed your own? 
+If you follow the structure of PolyFuzz's `BaseMatcher`  
+you can quickly implement any model you would like.
+
+Below, we are implementing the ratio similarity measure from RapidFuzz.
+
+```python
+import numpy as np
+import pandas as pd
+from rapidfuzz import fuzz
+from polyfuzz.models import BaseMatcher
+
+
+class MyModel(BaseMatcher):
+    def match(self, from_list, to_list, **kwargs):
+        # Calculate distances
+        matches = [[fuzz.ratio(from_string, to_string) / 100 for to_string in to_list] 
+                    for from_string in from_list]
+        
+        # Get best matches
+        mappings = [to_list[index] for index in np.argmax(matches, axis=1)]
+        scores = np.max(matches, axis=1)
+        
+        # Prepare dataframe
+        matches = pd.DataFrame({'From': from_list,'To': mappings, 'Similarity': scores})
+        return matches
+```
+Then, we can simply create an instance of MyModel and pass it through PolyFuzz:
+```python
+custom_model = MyModel()
+model = PolyFuzz(custom_model)
+```
+
+## Citation
+To cite PolyFuzz in your work, please use the following bibtex reference:
+
+```bibtex
+@misc{grootendorst2020polyfuzz,
+  author       = {Maarten Grootendorst},
+  title        = {PolyFuzz: Fuzzy string matching, grouping, and evaluation.},
+  year         = 2020,
+  publisher    = {Zenodo},
+  version      = {v0.2.2},
+  doi          = {10.5281/zenodo.4461050},
+  url          = {https://doi.org/10.5281/zenodo.4461050}
+}
+```
+
+## References
+Below, you can find several resources that were used for or inspired by when developing PolyFuzz:  
+  
+**Edit distance algorithms**:  
+These algorithms focus primarily on edit distance measures and can be used in `polyfuzz.models.EditDistance`:
+
+* https://github.com/jamesturk/jellyfish
+* https://github.com/ztane/python-Levenshtein
+* https://github.com/seatgeek/fuzzywuzzy
+* https://github.com/maxbachmann/rapidfuzz
+* https://github.com/roy-ht/editdistance
+
+**Other interesting repos**:
+
+* https://github.com/ing-bank/sparse_dot_topn
+    * Used in PolyFuzz for fast cosine similarity between sparse matrices
+
+
+
+
+%package -n python3-polyfuzz
+Summary:	PolyFuzz performs fuzzy string matching, grouping, and evaluation.
+Provides:	python-polyfuzz
+BuildRequires:	python3-devel
+BuildRequires:	python3-setuptools
+BuildRequires:	python3-pip
+%description -n python3-polyfuzz
+<img src="images/logo.png" width="70%" height="70%"/>
+
+[![PyPI - Python](https://img.shields.io/badge/python-3.6%20|%203.7%20|%203.8-blue.svg)](https://pypi.org/project/keybert/)
+[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE)
+[![PyPI - PyPi](https://img.shields.io/pypi/v/polyfuzz)](https://pypi.org/project/polyfuzz/)
+[![Build](https://img.shields.io/github/workflow/status/MaartenGr/polyfuzz/Code%20Checks/master)](https://pypi.org/project/polyfuzz/)
+[![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/PolyFuzz/)  
+**`PolyFuzz`** performs fuzzy string matching, string grouping, and contains extensive evaluation functions. 
+PolyFuzz is meant to bring fuzzy string matching techniques together within a single framework.
+
+Currently, methods include a variety of edit distance measures, a character-based n-gram TF-IDF, word embedding
+techniques such as FastText and GloVe, and ðŸ¤— transformers embeddings.  
+
+Corresponding medium post can be found [here](https://towardsdatascience.com/string-matching-with-bert-tf-idf-and-more-274bb3a95136?source=friends_link&sk=0f765b76ceaba49363829c13dfdc9d98).
+
+
+<a name="installation"/></a>
+## Installation
+You can install **`PolyFuzz`** via pip:
+ 
+```bash
+pip install polyfuzz
+```
+
+You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
+
+```python
+pip install bertopic[sbert]
+pip install bertopic[flair]
+pip install bertopic[gensim]
+pip install bertopic[spacy]
+pip install bertopic[use]
+```
+
+If you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models, 
+you can use `sparse_dot_topn` which is installed via:
+
+```bash
+pip install polyfuzz[fast]
+```
+
+<details>
+<summary>Installation Issues</summary>
+
+You might run into installation issues with `sparse_dot_topn`. If so, one solution that has worked for many 
+is by installing it via conda first before installing PolyFuzz:
+
+```bash
+conda install -c conda-forge sparse_dot_topn
+```
+
+If that does not work, I would advise you to look through their 
+issues](https://github.com/ing-bank/sparse_dot_topn/issues) page or continue to use PolyFuzz without `sparse_dot_topn`. 
+
+</details>  
+
+
+<a name="gettingstarted"/></a>
+## Getting Started
+
+For an in-depth overview of the possibilities of `PolyFuzz` 
+you can check the full documentation [here](https://maartengr.github.io/PolyFuzz/) or you can follow along 
+with the notebook [here](https://github.com/MaartenGr/PolyFuzz/blob/master/notebooks/Overview.ipynb).
+
+### Quick Start
+
+The main goal of `PolyFuzz` is to allow the user to perform different methods for matching strings. 
+We start by defining two lists, one to map from and one to map to. We are going to be using `TF-IDF` to create 
+n-grams on a character level in order to compare similarity between strings. Then, we calculate the similarity 
+between strings by calculating the cosine similarity between vector representations. 
+
+We only have to instantiate `PolyFuzz` with `TF-IDF` and match the lists:
+
+```python
+from polyfuzz import PolyFuzz
+
+from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
+to_list = ["apple", "apples", "mouse"]
+
+model = PolyFuzz("TF-IDF")
+model.match(from_list, to_list)
+```  
+
+The resulting matches can be accessed through `model.get_matches()`:
+
+```python
+>>> model.get_matches()
+         From      To    Similarity
+0       apple   apple    1.000000
+1      apples  apples    1.000000
+2        appl   apple    0.783751
+3       recal    None    0.000000
+4       house   mouse    0.587927
+5  similarity    None    0.000000
+
+``` 
+
+**NOTE 1**: If you want to compare distances within a single list, you can simply pass that list as such: `model.match(from_list)`
+
+**NOTE 2**: When instantiating `PolyFuzz` we also could have used "EditDistance" or "Embeddings" to quickly 
+access Levenshtein and FastText (English) respectively. 
+
+### Production
+The `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz 
+in production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions. 
+
+Let's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`. 
+In other words, we `fit` on `train_words` and we use `transform` on any incoming words:
+
+```python
+from sklearn.datasets import fetch_20newsgroups
+from sklearn.feature_extraction.text import CountVectorizer
+from polyfuzz import PolyFuzz
+
+train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
+unseen_words = ["apple", "apples", "mouse"]
+
+# Fit
+model = PolyFuzz("TF-IDF")
+model.fit(train_words)
+
+# Transform
+results = model.transform(unseen_words)
+```
+
+In the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`. 
+This speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`. 
+
+Then, we apply save and load the model as follows to be used in production:
+
+```python
+# Save the model
+model.save("my_model")
+
+# Load the model
+loaded_model = PolyFuzz.load("my_model")
+```
+
+### Group Matches
+We can group the matches `To` as there might be significant overlap in strings in our to_list. 
+To do this, we calculate the similarity within strings in to_list and use `single linkage` to then 
+group the strings with a high similarity.
+
+When we extract the new matches, we can see an additional column `Group` in which all the `To` matches were grouped to:
+
+```python
+>>> model.group(link_min_similarity=0.75)
+>>> model.get_matches()
+	      From	To		Similarity	Group
+0	     apple	apple	1.000000	apples
+1	    apples	apples	1.000000	apples
+2	      appl	apple	0.783751	apples
+3	     recal	None	0.000000	None
+4	     house	mouse	0.587927	mouse
+5	similarity	None	0.000000	None
+```
+
+As can be seen above, we grouped apple and apples together to `apple` such that when a string is mapped to `apple` it 
+will fall in the cluster of `[apples, apple]` and will be mapped to the first instance in the cluster which is `apples`.
+
+### Precision-Recall Curve  
+Next, we would like to see how well our model is doing on our data. We express our results as 
+**`precision`** and **`recall`** where precision is defined as the minimum similarity score before a match is correct and 
+recall the percentage of matches found at a certain minimum similarity score.  
+
+Creating the visualizations is as simple as:
+
+```
+model.visualize_precision_recall()
+```
+<img src="images/tfidf.png" width="100%" height="100%"/> 
+
+## Models
+Currently, the following models are implemented in PolyFuzz:
+* TF-IDF
+* EditDistance (you can use any distance measure, see [documentation](https://maartengr.github.io/PolyFuzz/tutorial/models/#EditDistance))
+* FastText and GloVe
+* ðŸ¤— Transformers
+
+With `Flair`, we can use all ðŸ¤— Transformers [models](https://huggingface.co/transformers/pretrained_models.html). 
+We simply have to instantiate any Flair WordEmbedding method and pass it through PolyFuzzy.
+
+All models listed above can be found in `polyfuzz.models` and can be used to create and compare different models:
+
+```python
+from polyfuzz.models import EditDistance, TFIDF, Embeddings
+from flair.embeddings import TransformerWordEmbeddings
+
+embeddings = TransformerWordEmbeddings('bert-base-multilingual-cased')
+bert = Embeddings(embeddings, min_similarity=0, model_id="BERT")
+tfidf = TFIDF(min_similarity=0)
+edit = EditDistance()
+
+string_models = [bert, tfidf, edit]
+model = PolyFuzz(string_models)
+model.match(from_list, to_list)
+```
+
+To access the results, we again can call `get_matches` but since we have multiple models we get back a dictionary 
+of dataframes back. 
+
+In order to access the results of a specific model, call `get_matches` with the correct id: 
+
+```python
+>>> model.get_matches("BERT")
+        From	    To          Similarity
+0	apple	    apple	1.000000
+1	apples	    apples	1.000000
+2	appl	    apple	0.928045
+3	recal	    apples	0.825268
+4	house	    mouse	0.887524
+5	similarity  mouse	0.791548
+``` 
+
+Finally, visualize the results to compare the models:
+
+```python
+model.visualize_precision_recall(kde=True)
+```
+
+<img src="images/multiple_models.png" width="100%" height="100%"/>
+
+## Custom Grouper
+We can even use one of the `polyfuzz.models` to be used as the grouper in case you would like to use 
+something else than the standard TF-IDF model:
+
+```python
+model = PolyFuzz("TF-IDF")
+model.match(from_list, to_list)
+
+edit_grouper = EditDistance(n_jobs=1)
+model.group(edit_grouper)
+```
+
+## Custom Models
+Although the options above are a great solution for comparing different models, what if you have developed your own? 
+If you follow the structure of PolyFuzz's `BaseMatcher`  
+you can quickly implement any model you would like.
+
+Below, we are implementing the ratio similarity measure from RapidFuzz.
+
+```python
+import numpy as np
+import pandas as pd
+from rapidfuzz import fuzz
+from polyfuzz.models import BaseMatcher
+
+
+class MyModel(BaseMatcher):
+    def match(self, from_list, to_list, **kwargs):
+        # Calculate distances
+        matches = [[fuzz.ratio(from_string, to_string) / 100 for to_string in to_list] 
+                    for from_string in from_list]
+        
+        # Get best matches
+        mappings = [to_list[index] for index in np.argmax(matches, axis=1)]
+        scores = np.max(matches, axis=1)
+        
+        # Prepare dataframe
+        matches = pd.DataFrame({'From': from_list,'To': mappings, 'Similarity': scores})
+        return matches
+```
+Then, we can simply create an instance of MyModel and pass it through PolyFuzz:
+```python
+custom_model = MyModel()
+model = PolyFuzz(custom_model)
+```
+
+## Citation
+To cite PolyFuzz in your work, please use the following bibtex reference:
+
+```bibtex
+@misc{grootendorst2020polyfuzz,
+  author       = {Maarten Grootendorst},
+  title        = {PolyFuzz: Fuzzy string matching, grouping, and evaluation.},
+  year         = 2020,
+  publisher    = {Zenodo},
+  version      = {v0.2.2},
+  doi          = {10.5281/zenodo.4461050},
+  url          = {https://doi.org/10.5281/zenodo.4461050}
+}
+```
+
+## References
+Below, you can find several resources that were used for or inspired by when developing PolyFuzz:  
+  
+**Edit distance algorithms**:  
+These algorithms focus primarily on edit distance measures and can be used in `polyfuzz.models.EditDistance`:
+
+* https://github.com/jamesturk/jellyfish
+* https://github.com/ztane/python-Levenshtein
+* https://github.com/seatgeek/fuzzywuzzy
+* https://github.com/maxbachmann/rapidfuzz
+* https://github.com/roy-ht/editdistance
+
+**Other interesting repos**:
+
+* https://github.com/ing-bank/sparse_dot_topn
+    * Used in PolyFuzz for fast cosine similarity between sparse matrices
+
+
+
+
+%package help
+Summary:	Development documents and examples for polyfuzz
+Provides:	python3-polyfuzz-doc
+%description help
+<img src="images/logo.png" width="70%" height="70%"/>
+
+[![PyPI - Python](https://img.shields.io/badge/python-3.6%20|%203.7%20|%203.8-blue.svg)](https://pypi.org/project/keybert/)
+[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE)
+[![PyPI - PyPi](https://img.shields.io/pypi/v/polyfuzz)](https://pypi.org/project/polyfuzz/)
+[![Build](https://img.shields.io/github/workflow/status/MaartenGr/polyfuzz/Code%20Checks/master)](https://pypi.org/project/polyfuzz/)
+[![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/PolyFuzz/)  
+**`PolyFuzz`** performs fuzzy string matching, string grouping, and contains extensive evaluation functions. 
+PolyFuzz is meant to bring fuzzy string matching techniques together within a single framework.
+
+Currently, methods include a variety of edit distance measures, a character-based n-gram TF-IDF, word embedding
+techniques such as FastText and GloVe, and ðŸ¤— transformers embeddings.  
+
+Corresponding medium post can be found [here](https://towardsdatascience.com/string-matching-with-bert-tf-idf-and-more-274bb3a95136?source=friends_link&sk=0f765b76ceaba49363829c13dfdc9d98).
+
+
+<a name="installation"/></a>
+## Installation
+You can install **`PolyFuzz`** via pip:
+ 
+```bash
+pip install polyfuzz
+```
+
+You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
+
+```python
+pip install bertopic[sbert]
+pip install bertopic[flair]
+pip install bertopic[gensim]
+pip install bertopic[spacy]
+pip install bertopic[use]
+```
+
+If you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models, 
+you can use `sparse_dot_topn` which is installed via:
+
+```bash
+pip install polyfuzz[fast]
+```
+
+<details>
+<summary>Installation Issues</summary>
+
+You might run into installation issues with `sparse_dot_topn`. If so, one solution that has worked for many 
+is by installing it via conda first before installing PolyFuzz:
+
+```bash
+conda install -c conda-forge sparse_dot_topn
+```
+
+If that does not work, I would advise you to look through their 
+issues](https://github.com/ing-bank/sparse_dot_topn/issues) page or continue to use PolyFuzz without `sparse_dot_topn`. 
+
+</details>  
+
+
+<a name="gettingstarted"/></a>
+## Getting Started
+
+For an in-depth overview of the possibilities of `PolyFuzz` 
+you can check the full documentation [here](https://maartengr.github.io/PolyFuzz/) or you can follow along 
+with the notebook [here](https://github.com/MaartenGr/PolyFuzz/blob/master/notebooks/Overview.ipynb).
+
+### Quick Start
+
+The main goal of `PolyFuzz` is to allow the user to perform different methods for matching strings. 
+We start by defining two lists, one to map from and one to map to. We are going to be using `TF-IDF` to create 
+n-grams on a character level in order to compare similarity between strings. Then, we calculate the similarity 
+between strings by calculating the cosine similarity between vector representations. 
+
+We only have to instantiate `PolyFuzz` with `TF-IDF` and match the lists:
+
+```python
+from polyfuzz import PolyFuzz
+
+from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
+to_list = ["apple", "apples", "mouse"]
+
+model = PolyFuzz("TF-IDF")
+model.match(from_list, to_list)
+```  
+
+The resulting matches can be accessed through `model.get_matches()`:
+
+```python
+>>> model.get_matches()
+         From      To    Similarity
+0       apple   apple    1.000000
+1      apples  apples    1.000000
+2        appl   apple    0.783751
+3       recal    None    0.000000
+4       house   mouse    0.587927
+5  similarity    None    0.000000
+
+``` 
+
+**NOTE 1**: If you want to compare distances within a single list, you can simply pass that list as such: `model.match(from_list)`
+
+**NOTE 2**: When instantiating `PolyFuzz` we also could have used "EditDistance" or "Embeddings" to quickly 
+access Levenshtein and FastText (English) respectively. 
+
+### Production
+The `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz 
+in production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions. 
+
+Let's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`. 
+In other words, we `fit` on `train_words` and we use `transform` on any incoming words:
+
+```python
+from sklearn.datasets import fetch_20newsgroups
+from sklearn.feature_extraction.text import CountVectorizer
+from polyfuzz import PolyFuzz
+
+train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
+unseen_words = ["apple", "apples", "mouse"]
+
+# Fit
+model = PolyFuzz("TF-IDF")
+model.fit(train_words)
+
+# Transform
+results = model.transform(unseen_words)
+```
+
+In the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`. 
+This speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`. 
+
+Then, we apply save and load the model as follows to be used in production:
+
+```python
+# Save the model
+model.save("my_model")
+
+# Load the model
+loaded_model = PolyFuzz.load("my_model")
+```
+
+### Group Matches
+We can group the matches `To` as there might be significant overlap in strings in our to_list. 
+To do this, we calculate the similarity within strings in to_list and use `single linkage` to then 
+group the strings with a high similarity.
+
+When we extract the new matches, we can see an additional column `Group` in which all the `To` matches were grouped to:
+
+```python
+>>> model.group(link_min_similarity=0.75)
+>>> model.get_matches()
+	      From	To		Similarity	Group
+0	     apple	apple	1.000000	apples
+1	    apples	apples	1.000000	apples
+2	      appl	apple	0.783751	apples
+3	     recal	None	0.000000	None
+4	     house	mouse	0.587927	mouse
+5	similarity	None	0.000000	None
+```
+
+As can be seen above, we grouped apple and apples together to `apple` such that when a string is mapped to `apple` it 
+will fall in the cluster of `[apples, apple]` and will be mapped to the first instance in the cluster which is `apples`.
+
+### Precision-Recall Curve  
+Next, we would like to see how well our model is doing on our data. We express our results as 
+**`precision`** and **`recall`** where precision is defined as the minimum similarity score before a match is correct and 
+recall the percentage of matches found at a certain minimum similarity score.  
+
+Creating the visualizations is as simple as:
+
+```
+model.visualize_precision_recall()
+```
+<img src="images/tfidf.png" width="100%" height="100%"/> 
+
+## Models
+Currently, the following models are implemented in PolyFuzz:
+* TF-IDF
+* EditDistance (you can use any distance measure, see [documentation](https://maartengr.github.io/PolyFuzz/tutorial/models/#EditDistance))
+* FastText and GloVe
+* ðŸ¤— Transformers
+
+With `Flair`, we can use all ðŸ¤— Transformers [models](https://huggingface.co/transformers/pretrained_models.html). 
+We simply have to instantiate any Flair WordEmbedding method and pass it through PolyFuzzy.
+
+All models listed above can be found in `polyfuzz.models` and can be used to create and compare different models:
+
+```python
+from polyfuzz.models import EditDistance, TFIDF, Embeddings
+from flair.embeddings import TransformerWordEmbeddings
+
+embeddings = TransformerWordEmbeddings('bert-base-multilingual-cased')
+bert = Embeddings(embeddings, min_similarity=0, model_id="BERT")
+tfidf = TFIDF(min_similarity=0)
+edit = EditDistance()
+
+string_models = [bert, tfidf, edit]
+model = PolyFuzz(string_models)
+model.match(from_list, to_list)
+```
+
+To access the results, we again can call `get_matches` but since we have multiple models we get back a dictionary 
+of dataframes back. 
+
+In order to access the results of a specific model, call `get_matches` with the correct id: 
+
+```python
+>>> model.get_matches("BERT")
+        From	    To          Similarity
+0	apple	    apple	1.000000
+1	apples	    apples	1.000000
+2	appl	    apple	0.928045
+3	recal	    apples	0.825268
+4	house	    mouse	0.887524
+5	similarity  mouse	0.791548
+``` 
+
+Finally, visualize the results to compare the models:
+
+```python
+model.visualize_precision_recall(kde=True)
+```
+
+<img src="images/multiple_models.png" width="100%" height="100%"/>
+
+## Custom Grouper
+We can even use one of the `polyfuzz.models` to be used as the grouper in case you would like to use 
+something else than the standard TF-IDF model:
+
+```python
+model = PolyFuzz("TF-IDF")
+model.match(from_list, to_list)
+
+edit_grouper = EditDistance(n_jobs=1)
+model.group(edit_grouper)
+```
+
+## Custom Models
+Although the options above are a great solution for comparing different models, what if you have developed your own? 
+If you follow the structure of PolyFuzz's `BaseMatcher`  
+you can quickly implement any model you would like.
+
+Below, we are implementing the ratio similarity measure from RapidFuzz.
+
+```python
+import numpy as np
+import pandas as pd
+from rapidfuzz import fuzz
+from polyfuzz.models import BaseMatcher
+
+
+class MyModel(BaseMatcher):
+    def match(self, from_list, to_list, **kwargs):
+        # Calculate distances
+        matches = [[fuzz.ratio(from_string, to_string) / 100 for to_string in to_list] 
+                    for from_string in from_list]
+        
+        # Get best matches
+        mappings = [to_list[index] for index in np.argmax(matches, axis=1)]
+        scores = np.max(matches, axis=1)
+        
+        # Prepare dataframe
+        matches = pd.DataFrame({'From': from_list,'To': mappings, 'Similarity': scores})
+        return matches
+```
+Then, we can simply create an instance of MyModel and pass it through PolyFuzz:
+```python
+custom_model = MyModel()
+model = PolyFuzz(custom_model)
+```
+
+## Citation
+To cite PolyFuzz in your work, please use the following bibtex reference:
+
+```bibtex
+@misc{grootendorst2020polyfuzz,
+  author       = {Maarten Grootendorst},
+  title        = {PolyFuzz: Fuzzy string matching, grouping, and evaluation.},
+  year         = 2020,
+  publisher    = {Zenodo},
+  version      = {v0.2.2},
+  doi          = {10.5281/zenodo.4461050},
+  url          = {https://doi.org/10.5281/zenodo.4461050}
+}
+```
+
+## References
+Below, you can find several resources that were used for or inspired by when developing PolyFuzz:  
+  
+**Edit distance algorithms**:  
+These algorithms focus primarily on edit distance measures and can be used in `polyfuzz.models.EditDistance`:
+
+* https://github.com/jamesturk/jellyfish
+* https://github.com/ztane/python-Levenshtein
+* https://github.com/seatgeek/fuzzywuzzy
+* https://github.com/maxbachmann/rapidfuzz
+* https://github.com/roy-ht/editdistance
+
+**Other interesting repos**:
+
+* https://github.com/ing-bank/sparse_dot_topn
+    * Used in PolyFuzz for fast cosine similarity between sparse matrices
+
+
+
+
+%prep
+%autosetup -n polyfuzz-0.4.0
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-polyfuzz -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Tue Apr 11 2023 Python_Bot <Python_Bot@openeuler.org> - 0.4.0-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..cf4b83e
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+a603e85e2c4135f8bea3ca0b737c948b  polyfuzz-0.4.0.tar.gz
author	CoprDistGit <infra@openeuler.org>	2023-04-11 18:37:48 +0000
committer	CoprDistGit <infra@openeuler.org>	2023-04-11 18:37:48 +0000
commit	cc342b1bfcaa48a4e6adbce607ca782948db6d50 (patch)
tree	ec1e41d99d90a6300bd207e3cf8f9cb2b66c443d
parent	106de98e95c328ef20854693f88cb348807acbd1 (diff)