diff options
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-polyfuzz.spec | 1010 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 1012 insertions, 0 deletions
@@ -0,0 +1 @@ +/polyfuzz-0.4.0.tar.gz diff --git a/python-polyfuzz.spec b/python-polyfuzz.spec new file mode 100644 index 0000000..40ac8e2 --- /dev/null +++ b/python-polyfuzz.spec @@ -0,0 +1,1010 @@ +%global _empty_manifest_terminate_build 0 +Name: python-polyfuzz +Version: 0.4.0 +Release: 1 +Summary: PolyFuzz performs fuzzy string matching, grouping, and evaluation. +License: MIT License +URL: https://github.com/MaartenGr/PolyFuzz +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/fe/90/79ac771627a14ef47d16f2e3d1662332af790d5b942c8af55f1a32aa8ef6/polyfuzz-0.4.0.tar.gz +BuildArch: noarch + +Requires: python3-numpy +Requires: python3-scipy +Requires: python3-pandas +Requires: python3-tqdm +Requires: python3-joblib +Requires: python3-matplotlib +Requires: python3-seaborn +Requires: python3-rapidfuzz +Requires: python3-scikit-learn +Requires: python3-mkdocs +Requires: python3-mkdocs-material +Requires: python3-mkdocstrings +Requires: python3-pytest +Requires: python3-pytest-cov +Requires: python3-torch +Requires: python3-flair +Requires: python3-sparse-dot-topn +Requires: python3-sentence-transformers +Requires: python3-spacy +Requires: python3-tensorflow +Requires: python3-tensorflow-hub +Requires: python3-tensorflow-text +Requires: python3-mkdocs +Requires: python3-mkdocs-material +Requires: python3-mkdocstrings +Requires: python3-sparse-dot-topn +Requires: python3-torch +Requires: python3-flair +Requires: python3-gensim +Requires: python3-sentence-transformers +Requires: python3-pytest +Requires: python3-pytest-cov +Requires: python3-tensorflow +Requires: python3-tensorflow-hub +Requires: python3-tensorflow-text + +%description +<img src="images/logo.png" width="70%" height="70%"/> + +[](https://pypi.org/project/keybert/) +[](https://github.com/MaartenGr/keybert/blob/master/LICENSE) +[](https://pypi.org/project/polyfuzz/) +[](https://pypi.org/project/polyfuzz/) +[](https://maartengr.github.io/PolyFuzz/) +**`PolyFuzz`** performs fuzzy string matching, string grouping, and contains extensive evaluation functions. +PolyFuzz is meant to bring fuzzy string matching techniques together within a single framework. + +Currently, methods include a variety of edit distance measures, a character-based n-gram TF-IDF, word embedding +techniques such as FastText and GloVe, and 🤗 transformers embeddings. + +Corresponding medium post can be found [here](https://towardsdatascience.com/string-matching-with-bert-tf-idf-and-more-274bb3a95136?source=friends_link&sk=0f765b76ceaba49363829c13dfdc9d98). + + +<a name="installation"/></a> +## Installation +You can install **`PolyFuzz`** via pip: + +```bash +pip install polyfuzz +``` + +You may want to install more depending on the transformers and language backends that you will be using. The possible installations are: + +```python +pip install bertopic[sbert] +pip install bertopic[flair] +pip install bertopic[gensim] +pip install bertopic[spacy] +pip install bertopic[use] +``` + +If you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models, +you can use `sparse_dot_topn` which is installed via: + +```bash +pip install polyfuzz[fast] +``` + +<details> +<summary>Installation Issues</summary> + +You might run into installation issues with `sparse_dot_topn`. If so, one solution that has worked for many +is by installing it via conda first before installing PolyFuzz: + +```bash +conda install -c conda-forge sparse_dot_topn +``` + +If that does not work, I would advise you to look through their +issues](https://github.com/ing-bank/sparse_dot_topn/issues) page or continue to use PolyFuzz without `sparse_dot_topn`. + +</details> + + +<a name="gettingstarted"/></a> +## Getting Started + +For an in-depth overview of the possibilities of `PolyFuzz` +you can check the full documentation [here](https://maartengr.github.io/PolyFuzz/) or you can follow along +with the notebook [here](https://github.com/MaartenGr/PolyFuzz/blob/master/notebooks/Overview.ipynb). + +### Quick Start + +The main goal of `PolyFuzz` is to allow the user to perform different methods for matching strings. +We start by defining two lists, one to map from and one to map to. We are going to be using `TF-IDF` to create +n-grams on a character level in order to compare similarity between strings. Then, we calculate the similarity +between strings by calculating the cosine similarity between vector representations. + +We only have to instantiate `PolyFuzz` with `TF-IDF` and match the lists: + +```python +from polyfuzz import PolyFuzz + +from_list = ["apple", "apples", "appl", "recal", "house", "similarity"] +to_list = ["apple", "apples", "mouse"] + +model = PolyFuzz("TF-IDF") +model.match(from_list, to_list) +``` + +The resulting matches can be accessed through `model.get_matches()`: + +```python +>>> model.get_matches() + From To Similarity +0 apple apple 1.000000 +1 apples apples 1.000000 +2 appl apple 0.783751 +3 recal None 0.000000 +4 house mouse 0.587927 +5 similarity None 0.000000 + +``` + +**NOTE 1**: If you want to compare distances within a single list, you can simply pass that list as such: `model.match(from_list)` + +**NOTE 2**: When instantiating `PolyFuzz` we also could have used "EditDistance" or "Embeddings" to quickly +access Levenshtein and FastText (English) respectively. + +### Production +The `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz +in production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions. + +Let's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`. +In other words, we `fit` on `train_words` and we use `transform` on any incoming words: + +```python +from sklearn.datasets import fetch_20newsgroups +from sklearn.feature_extraction.text import CountVectorizer +from polyfuzz import PolyFuzz + +train_words = ["apple", "apples", "appl", "recal", "house", "similarity"] +unseen_words = ["apple", "apples", "mouse"] + +# Fit +model = PolyFuzz("TF-IDF") +model.fit(train_words) + +# Transform +results = model.transform(unseen_words) +``` + +In the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`. +This speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`. + +Then, we apply save and load the model as follows to be used in production: + +```python +# Save the model +model.save("my_model") + +# Load the model +loaded_model = PolyFuzz.load("my_model") +``` + +### Group Matches +We can group the matches `To` as there might be significant overlap in strings in our to_list. +To do this, we calculate the similarity within strings in to_list and use `single linkage` to then +group the strings with a high similarity. + +When we extract the new matches, we can see an additional column `Group` in which all the `To` matches were grouped to: + +```python +>>> model.group(link_min_similarity=0.75) +>>> model.get_matches() + From To Similarity Group +0 apple apple 1.000000 apples +1 apples apples 1.000000 apples +2 appl apple 0.783751 apples +3 recal None 0.000000 None +4 house mouse 0.587927 mouse +5 similarity None 0.000000 None +``` + +As can be seen above, we grouped apple and apples together to `apple` such that when a string is mapped to `apple` it +will fall in the cluster of `[apples, apple]` and will be mapped to the first instance in the cluster which is `apples`. + +### Precision-Recall Curve +Next, we would like to see how well our model is doing on our data. We express our results as +**`precision`** and **`recall`** where precision is defined as the minimum similarity score before a match is correct and +recall the percentage of matches found at a certain minimum similarity score. + +Creating the visualizations is as simple as: + +``` +model.visualize_precision_recall() +``` +<img src="images/tfidf.png" width="100%" height="100%"/> + +## Models +Currently, the following models are implemented in PolyFuzz: +* TF-IDF +* EditDistance (you can use any distance measure, see [documentation](https://maartengr.github.io/PolyFuzz/tutorial/models/#EditDistance)) +* FastText and GloVe +* 🤗 Transformers + +With `Flair`, we can use all 🤗 Transformers [models](https://huggingface.co/transformers/pretrained_models.html). +We simply have to instantiate any Flair WordEmbedding method and pass it through PolyFuzzy. + +All models listed above can be found in `polyfuzz.models` and can be used to create and compare different models: + +```python +from polyfuzz.models import EditDistance, TFIDF, Embeddings +from flair.embeddings import TransformerWordEmbeddings + +embeddings = TransformerWordEmbeddings('bert-base-multilingual-cased') +bert = Embeddings(embeddings, min_similarity=0, model_id="BERT") +tfidf = TFIDF(min_similarity=0) +edit = EditDistance() + +string_models = [bert, tfidf, edit] +model = PolyFuzz(string_models) +model.match(from_list, to_list) +``` + +To access the results, we again can call `get_matches` but since we have multiple models we get back a dictionary +of dataframes back. + +In order to access the results of a specific model, call `get_matches` with the correct id: + +```python +>>> model.get_matches("BERT") + From To Similarity +0 apple apple 1.000000 +1 apples apples 1.000000 +2 appl apple 0.928045 +3 recal apples 0.825268 +4 house mouse 0.887524 +5 similarity mouse 0.791548 +``` + +Finally, visualize the results to compare the models: + +```python +model.visualize_precision_recall(kde=True) +``` + +<img src="images/multiple_models.png" width="100%" height="100%"/> + +## Custom Grouper +We can even use one of the `polyfuzz.models` to be used as the grouper in case you would like to use +something else than the standard TF-IDF model: + +```python +model = PolyFuzz("TF-IDF") +model.match(from_list, to_list) + +edit_grouper = EditDistance(n_jobs=1) +model.group(edit_grouper) +``` + +## Custom Models +Although the options above are a great solution for comparing different models, what if you have developed your own? +If you follow the structure of PolyFuzz's `BaseMatcher` +you can quickly implement any model you would like. + +Below, we are implementing the ratio similarity measure from RapidFuzz. + +```python +import numpy as np +import pandas as pd +from rapidfuzz import fuzz +from polyfuzz.models import BaseMatcher + + +class MyModel(BaseMatcher): + def match(self, from_list, to_list, **kwargs): + # Calculate distances + matches = [[fuzz.ratio(from_string, to_string) / 100 for to_string in to_list] + for from_string in from_list] + + # Get best matches + mappings = [to_list[index] for index in np.argmax(matches, axis=1)] + scores = np.max(matches, axis=1) + + # Prepare dataframe + matches = pd.DataFrame({'From': from_list,'To': mappings, 'Similarity': scores}) + return matches +``` +Then, we can simply create an instance of MyModel and pass it through PolyFuzz: +```python +custom_model = MyModel() +model = PolyFuzz(custom_model) +``` + +## Citation +To cite PolyFuzz in your work, please use the following bibtex reference: + +```bibtex +@misc{grootendorst2020polyfuzz, + author = {Maarten Grootendorst}, + title = {PolyFuzz: Fuzzy string matching, grouping, and evaluation.}, + year = 2020, + publisher = {Zenodo}, + version = {v0.2.2}, + doi = {10.5281/zenodo.4461050}, + url = {https://doi.org/10.5281/zenodo.4461050} +} +``` + +## References +Below, you can find several resources that were used for or inspired by when developing PolyFuzz: + +**Edit distance algorithms**: +These algorithms focus primarily on edit distance measures and can be used in `polyfuzz.models.EditDistance`: + +* https://github.com/jamesturk/jellyfish +* https://github.com/ztane/python-Levenshtein +* https://github.com/seatgeek/fuzzywuzzy +* https://github.com/maxbachmann/rapidfuzz +* https://github.com/roy-ht/editdistance + +**Other interesting repos**: + +* https://github.com/ing-bank/sparse_dot_topn + * Used in PolyFuzz for fast cosine similarity between sparse matrices + + + + +%package -n python3-polyfuzz +Summary: PolyFuzz performs fuzzy string matching, grouping, and evaluation. +Provides: python-polyfuzz +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-polyfuzz +<img src="images/logo.png" width="70%" height="70%"/> + +[](https://pypi.org/project/keybert/) +[](https://github.com/MaartenGr/keybert/blob/master/LICENSE) +[](https://pypi.org/project/polyfuzz/) +[](https://pypi.org/project/polyfuzz/) +[](https://maartengr.github.io/PolyFuzz/) +**`PolyFuzz`** performs fuzzy string matching, string grouping, and contains extensive evaluation functions. +PolyFuzz is meant to bring fuzzy string matching techniques together within a single framework. + +Currently, methods include a variety of edit distance measures, a character-based n-gram TF-IDF, word embedding +techniques such as FastText and GloVe, and 🤗 transformers embeddings. + +Corresponding medium post can be found [here](https://towardsdatascience.com/string-matching-with-bert-tf-idf-and-more-274bb3a95136?source=friends_link&sk=0f765b76ceaba49363829c13dfdc9d98). + + +<a name="installation"/></a> +## Installation +You can install **`PolyFuzz`** via pip: + +```bash +pip install polyfuzz +``` + +You may want to install more depending on the transformers and language backends that you will be using. The possible installations are: + +```python +pip install bertopic[sbert] +pip install bertopic[flair] +pip install bertopic[gensim] +pip install bertopic[spacy] +pip install bertopic[use] +``` + +If you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models, +you can use `sparse_dot_topn` which is installed via: + +```bash +pip install polyfuzz[fast] +``` + +<details> +<summary>Installation Issues</summary> + +You might run into installation issues with `sparse_dot_topn`. If so, one solution that has worked for many +is by installing it via conda first before installing PolyFuzz: + +```bash +conda install -c conda-forge sparse_dot_topn +``` + +If that does not work, I would advise you to look through their +issues](https://github.com/ing-bank/sparse_dot_topn/issues) page or continue to use PolyFuzz without `sparse_dot_topn`. + +</details> + + +<a name="gettingstarted"/></a> +## Getting Started + +For an in-depth overview of the possibilities of `PolyFuzz` +you can check the full documentation [here](https://maartengr.github.io/PolyFuzz/) or you can follow along +with the notebook [here](https://github.com/MaartenGr/PolyFuzz/blob/master/notebooks/Overview.ipynb). + +### Quick Start + +The main goal of `PolyFuzz` is to allow the user to perform different methods for matching strings. +We start by defining two lists, one to map from and one to map to. We are going to be using `TF-IDF` to create +n-grams on a character level in order to compare similarity between strings. Then, we calculate the similarity +between strings by calculating the cosine similarity between vector representations. + +We only have to instantiate `PolyFuzz` with `TF-IDF` and match the lists: + +```python +from polyfuzz import PolyFuzz + +from_list = ["apple", "apples", "appl", "recal", "house", "similarity"] +to_list = ["apple", "apples", "mouse"] + +model = PolyFuzz("TF-IDF") +model.match(from_list, to_list) +``` + +The resulting matches can be accessed through `model.get_matches()`: + +```python +>>> model.get_matches() + From To Similarity +0 apple apple 1.000000 +1 apples apples 1.000000 +2 appl apple 0.783751 +3 recal None 0.000000 +4 house mouse 0.587927 +5 similarity None 0.000000 + +``` + +**NOTE 1**: If you want to compare distances within a single list, you can simply pass that list as such: `model.match(from_list)` + +**NOTE 2**: When instantiating `PolyFuzz` we also could have used "EditDistance" or "Embeddings" to quickly +access Levenshtein and FastText (English) respectively. + +### Production +The `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz +in production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions. + +Let's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`. +In other words, we `fit` on `train_words` and we use `transform` on any incoming words: + +```python +from sklearn.datasets import fetch_20newsgroups +from sklearn.feature_extraction.text import CountVectorizer +from polyfuzz import PolyFuzz + +train_words = ["apple", "apples", "appl", "recal", "house", "similarity"] +unseen_words = ["apple", "apples", "mouse"] + +# Fit +model = PolyFuzz("TF-IDF") +model.fit(train_words) + +# Transform +results = model.transform(unseen_words) +``` + +In the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`. +This speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`. + +Then, we apply save and load the model as follows to be used in production: + +```python +# Save the model +model.save("my_model") + +# Load the model +loaded_model = PolyFuzz.load("my_model") +``` + +### Group Matches +We can group the matches `To` as there might be significant overlap in strings in our to_list. +To do this, we calculate the similarity within strings in to_list and use `single linkage` to then +group the strings with a high similarity. + +When we extract the new matches, we can see an additional column `Group` in which all the `To` matches were grouped to: + +```python +>>> model.group(link_min_similarity=0.75) +>>> model.get_matches() + From To Similarity Group +0 apple apple 1.000000 apples +1 apples apples 1.000000 apples +2 appl apple 0.783751 apples +3 recal None 0.000000 None +4 house mouse 0.587927 mouse +5 similarity None 0.000000 None +``` + +As can be seen above, we grouped apple and apples together to `apple` such that when a string is mapped to `apple` it +will fall in the cluster of `[apples, apple]` and will be mapped to the first instance in the cluster which is `apples`. + +### Precision-Recall Curve +Next, we would like to see how well our model is doing on our data. We express our results as +**`precision`** and **`recall`** where precision is defined as the minimum similarity score before a match is correct and +recall the percentage of matches found at a certain minimum similarity score. + +Creating the visualizations is as simple as: + +``` +model.visualize_precision_recall() +``` +<img src="images/tfidf.png" width="100%" height="100%"/> + +## Models +Currently, the following models are implemented in PolyFuzz: +* TF-IDF +* EditDistance (you can use any distance measure, see [documentation](https://maartengr.github.io/PolyFuzz/tutorial/models/#EditDistance)) +* FastText and GloVe +* 🤗 Transformers + +With `Flair`, we can use all 🤗 Transformers [models](https://huggingface.co/transformers/pretrained_models.html). +We simply have to instantiate any Flair WordEmbedding method and pass it through PolyFuzzy. + +All models listed above can be found in `polyfuzz.models` and can be used to create and compare different models: + +```python +from polyfuzz.models import EditDistance, TFIDF, Embeddings +from flair.embeddings import TransformerWordEmbeddings + +embeddings = TransformerWordEmbeddings('bert-base-multilingual-cased') +bert = Embeddings(embeddings, min_similarity=0, model_id="BERT") +tfidf = TFIDF(min_similarity=0) +edit = EditDistance() + +string_models = [bert, tfidf, edit] +model = PolyFuzz(string_models) +model.match(from_list, to_list) +``` + +To access the results, we again can call `get_matches` but since we have multiple models we get back a dictionary +of dataframes back. + +In order to access the results of a specific model, call `get_matches` with the correct id: + +```python +>>> model.get_matches("BERT") + From To Similarity +0 apple apple 1.000000 +1 apples apples 1.000000 +2 appl apple 0.928045 +3 recal apples 0.825268 +4 house mouse 0.887524 +5 similarity mouse 0.791548 +``` + +Finally, visualize the results to compare the models: + +```python +model.visualize_precision_recall(kde=True) +``` + +<img src="images/multiple_models.png" width="100%" height="100%"/> + +## Custom Grouper +We can even use one of the `polyfuzz.models` to be used as the grouper in case you would like to use +something else than the standard TF-IDF model: + +```python +model = PolyFuzz("TF-IDF") +model.match(from_list, to_list) + +edit_grouper = EditDistance(n_jobs=1) +model.group(edit_grouper) +``` + +## Custom Models +Although the options above are a great solution for comparing different models, what if you have developed your own? +If you follow the structure of PolyFuzz's `BaseMatcher` +you can quickly implement any model you would like. + +Below, we are implementing the ratio similarity measure from RapidFuzz. + +```python +import numpy as np +import pandas as pd +from rapidfuzz import fuzz +from polyfuzz.models import BaseMatcher + + +class MyModel(BaseMatcher): + def match(self, from_list, to_list, **kwargs): + # Calculate distances + matches = [[fuzz.ratio(from_string, to_string) / 100 for to_string in to_list] + for from_string in from_list] + + # Get best matches + mappings = [to_list[index] for index in np.argmax(matches, axis=1)] + scores = np.max(matches, axis=1) + + # Prepare dataframe + matches = pd.DataFrame({'From': from_list,'To': mappings, 'Similarity': scores}) + return matches +``` +Then, we can simply create an instance of MyModel and pass it through PolyFuzz: +```python +custom_model = MyModel() +model = PolyFuzz(custom_model) +``` + +## Citation +To cite PolyFuzz in your work, please use the following bibtex reference: + +```bibtex +@misc{grootendorst2020polyfuzz, + author = {Maarten Grootendorst}, + title = {PolyFuzz: Fuzzy string matching, grouping, and evaluation.}, + year = 2020, + publisher = {Zenodo}, + version = {v0.2.2}, + doi = {10.5281/zenodo.4461050}, + url = {https://doi.org/10.5281/zenodo.4461050} +} +``` + +## References +Below, you can find several resources that were used for or inspired by when developing PolyFuzz: + +**Edit distance algorithms**: +These algorithms focus primarily on edit distance measures and can be used in `polyfuzz.models.EditDistance`: + +* https://github.com/jamesturk/jellyfish +* https://github.com/ztane/python-Levenshtein +* https://github.com/seatgeek/fuzzywuzzy +* https://github.com/maxbachmann/rapidfuzz +* https://github.com/roy-ht/editdistance + +**Other interesting repos**: + +* https://github.com/ing-bank/sparse_dot_topn + * Used in PolyFuzz for fast cosine similarity between sparse matrices + + + + +%package help +Summary: Development documents and examples for polyfuzz +Provides: python3-polyfuzz-doc +%description help +<img src="images/logo.png" width="70%" height="70%"/> + +[](https://pypi.org/project/keybert/) +[](https://github.com/MaartenGr/keybert/blob/master/LICENSE) +[](https://pypi.org/project/polyfuzz/) +[](https://pypi.org/project/polyfuzz/) +[](https://maartengr.github.io/PolyFuzz/) +**`PolyFuzz`** performs fuzzy string matching, string grouping, and contains extensive evaluation functions. +PolyFuzz is meant to bring fuzzy string matching techniques together within a single framework. + +Currently, methods include a variety of edit distance measures, a character-based n-gram TF-IDF, word embedding +techniques such as FastText and GloVe, and 🤗 transformers embeddings. + +Corresponding medium post can be found [here](https://towardsdatascience.com/string-matching-with-bert-tf-idf-and-more-274bb3a95136?source=friends_link&sk=0f765b76ceaba49363829c13dfdc9d98). + + +<a name="installation"/></a> +## Installation +You can install **`PolyFuzz`** via pip: + +```bash +pip install polyfuzz +``` + +You may want to install more depending on the transformers and language backends that you will be using. The possible installations are: + +```python +pip install bertopic[sbert] +pip install bertopic[flair] +pip install bertopic[gensim] +pip install bertopic[spacy] +pip install bertopic[use] +``` + +If you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models, +you can use `sparse_dot_topn` which is installed via: + +```bash +pip install polyfuzz[fast] +``` + +<details> +<summary>Installation Issues</summary> + +You might run into installation issues with `sparse_dot_topn`. If so, one solution that has worked for many +is by installing it via conda first before installing PolyFuzz: + +```bash +conda install -c conda-forge sparse_dot_topn +``` + +If that does not work, I would advise you to look through their +issues](https://github.com/ing-bank/sparse_dot_topn/issues) page or continue to use PolyFuzz without `sparse_dot_topn`. + +</details> + + +<a name="gettingstarted"/></a> +## Getting Started + +For an in-depth overview of the possibilities of `PolyFuzz` +you can check the full documentation [here](https://maartengr.github.io/PolyFuzz/) or you can follow along +with the notebook [here](https://github.com/MaartenGr/PolyFuzz/blob/master/notebooks/Overview.ipynb). + +### Quick Start + +The main goal of `PolyFuzz` is to allow the user to perform different methods for matching strings. +We start by defining two lists, one to map from and one to map to. We are going to be using `TF-IDF` to create +n-grams on a character level in order to compare similarity between strings. Then, we calculate the similarity +between strings by calculating the cosine similarity between vector representations. + +We only have to instantiate `PolyFuzz` with `TF-IDF` and match the lists: + +```python +from polyfuzz import PolyFuzz + +from_list = ["apple", "apples", "appl", "recal", "house", "similarity"] +to_list = ["apple", "apples", "mouse"] + +model = PolyFuzz("TF-IDF") +model.match(from_list, to_list) +``` + +The resulting matches can be accessed through `model.get_matches()`: + +```python +>>> model.get_matches() + From To Similarity +0 apple apple 1.000000 +1 apples apples 1.000000 +2 appl apple 0.783751 +3 recal None 0.000000 +4 house mouse 0.587927 +5 similarity None 0.000000 + +``` + +**NOTE 1**: If you want to compare distances within a single list, you can simply pass that list as such: `model.match(from_list)` + +**NOTE 2**: When instantiating `PolyFuzz` we also could have used "EditDistance" or "Embeddings" to quickly +access Levenshtein and FastText (English) respectively. + +### Production +The `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz +in production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions. + +Let's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`. +In other words, we `fit` on `train_words` and we use `transform` on any incoming words: + +```python +from sklearn.datasets import fetch_20newsgroups +from sklearn.feature_extraction.text import CountVectorizer +from polyfuzz import PolyFuzz + +train_words = ["apple", "apples", "appl", "recal", "house", "similarity"] +unseen_words = ["apple", "apples", "mouse"] + +# Fit +model = PolyFuzz("TF-IDF") +model.fit(train_words) + +# Transform +results = model.transform(unseen_words) +``` + +In the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`. +This speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`. + +Then, we apply save and load the model as follows to be used in production: + +```python +# Save the model +model.save("my_model") + +# Load the model +loaded_model = PolyFuzz.load("my_model") +``` + +### Group Matches +We can group the matches `To` as there might be significant overlap in strings in our to_list. +To do this, we calculate the similarity within strings in to_list and use `single linkage` to then +group the strings with a high similarity. + +When we extract the new matches, we can see an additional column `Group` in which all the `To` matches were grouped to: + +```python +>>> model.group(link_min_similarity=0.75) +>>> model.get_matches() + From To Similarity Group +0 apple apple 1.000000 apples +1 apples apples 1.000000 apples +2 appl apple 0.783751 apples +3 recal None 0.000000 None +4 house mouse 0.587927 mouse +5 similarity None 0.000000 None +``` + +As can be seen above, we grouped apple and apples together to `apple` such that when a string is mapped to `apple` it +will fall in the cluster of `[apples, apple]` and will be mapped to the first instance in the cluster which is `apples`. + +### Precision-Recall Curve +Next, we would like to see how well our model is doing on our data. We express our results as +**`precision`** and **`recall`** where precision is defined as the minimum similarity score before a match is correct and +recall the percentage of matches found at a certain minimum similarity score. + +Creating the visualizations is as simple as: + +``` +model.visualize_precision_recall() +``` +<img src="images/tfidf.png" width="100%" height="100%"/> + +## Models +Currently, the following models are implemented in PolyFuzz: +* TF-IDF +* EditDistance (you can use any distance measure, see [documentation](https://maartengr.github.io/PolyFuzz/tutorial/models/#EditDistance)) +* FastText and GloVe +* 🤗 Transformers + +With `Flair`, we can use all 🤗 Transformers [models](https://huggingface.co/transformers/pretrained_models.html). +We simply have to instantiate any Flair WordEmbedding method and pass it through PolyFuzzy. + +All models listed above can be found in `polyfuzz.models` and can be used to create and compare different models: + +```python +from polyfuzz.models import EditDistance, TFIDF, Embeddings +from flair.embeddings import TransformerWordEmbeddings + +embeddings = TransformerWordEmbeddings('bert-base-multilingual-cased') +bert = Embeddings(embeddings, min_similarity=0, model_id="BERT") +tfidf = TFIDF(min_similarity=0) +edit = EditDistance() + +string_models = [bert, tfidf, edit] +model = PolyFuzz(string_models) +model.match(from_list, to_list) +``` + +To access the results, we again can call `get_matches` but since we have multiple models we get back a dictionary +of dataframes back. + +In order to access the results of a specific model, call `get_matches` with the correct id: + +```python +>>> model.get_matches("BERT") + From To Similarity +0 apple apple 1.000000 +1 apples apples 1.000000 +2 appl apple 0.928045 +3 recal apples 0.825268 +4 house mouse 0.887524 +5 similarity mouse 0.791548 +``` + +Finally, visualize the results to compare the models: + +```python +model.visualize_precision_recall(kde=True) +``` + +<img src="images/multiple_models.png" width="100%" height="100%"/> + +## Custom Grouper +We can even use one of the `polyfuzz.models` to be used as the grouper in case you would like to use +something else than the standard TF-IDF model: + +```python +model = PolyFuzz("TF-IDF") +model.match(from_list, to_list) + +edit_grouper = EditDistance(n_jobs=1) +model.group(edit_grouper) +``` + +## Custom Models +Although the options above are a great solution for comparing different models, what if you have developed your own? +If you follow the structure of PolyFuzz's `BaseMatcher` +you can quickly implement any model you would like. + +Below, we are implementing the ratio similarity measure from RapidFuzz. + +```python +import numpy as np +import pandas as pd +from rapidfuzz import fuzz +from polyfuzz.models import BaseMatcher + + +class MyModel(BaseMatcher): + def match(self, from_list, to_list, **kwargs): + # Calculate distances + matches = [[fuzz.ratio(from_string, to_string) / 100 for to_string in to_list] + for from_string in from_list] + + # Get best matches + mappings = [to_list[index] for index in np.argmax(matches, axis=1)] + scores = np.max(matches, axis=1) + + # Prepare dataframe + matches = pd.DataFrame({'From': from_list,'To': mappings, 'Similarity': scores}) + return matches +``` +Then, we can simply create an instance of MyModel and pass it through PolyFuzz: +```python +custom_model = MyModel() +model = PolyFuzz(custom_model) +``` + +## Citation +To cite PolyFuzz in your work, please use the following bibtex reference: + +```bibtex +@misc{grootendorst2020polyfuzz, + author = {Maarten Grootendorst}, + title = {PolyFuzz: Fuzzy string matching, grouping, and evaluation.}, + year = 2020, + publisher = {Zenodo}, + version = {v0.2.2}, + doi = {10.5281/zenodo.4461050}, + url = {https://doi.org/10.5281/zenodo.4461050} +} +``` + +## References +Below, you can find several resources that were used for or inspired by when developing PolyFuzz: + +**Edit distance algorithms**: +These algorithms focus primarily on edit distance measures and can be used in `polyfuzz.models.EditDistance`: + +* https://github.com/jamesturk/jellyfish +* https://github.com/ztane/python-Levenshtein +* https://github.com/seatgeek/fuzzywuzzy +* https://github.com/maxbachmann/rapidfuzz +* https://github.com/roy-ht/editdistance + +**Other interesting repos**: + +* https://github.com/ing-bank/sparse_dot_topn + * Used in PolyFuzz for fast cosine similarity between sparse matrices + + + + +%prep +%autosetup -n polyfuzz-0.4.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-polyfuzz -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Tue Apr 11 2023 Python_Bot <Python_Bot@openeuler.org> - 0.4.0-1 +- Package Spec generated @@ -0,0 +1 @@ +a603e85e2c4135f8bea3ca0b737c948b polyfuzz-0.4.0.tar.gz |