diff options
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-rank-bm25.spec | 290 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 292 insertions, 0 deletions
@@ -0,0 +1 @@ +/rank_bm25-0.2.2.tar.gz diff --git a/python-rank-bm25.spec b/python-rank-bm25.spec new file mode 100644 index 0000000..ff6ac5c --- /dev/null +++ b/python-rank-bm25.spec @@ -0,0 +1,290 @@ +%global _empty_manifest_terminate_build 0 +Name: python-rank-bm25 +Version: 0.2.2 +Release: 1 +Summary: Various BM25 algorithms for document ranking +License: Apache2.0 +URL: https://github.com/dorianbrown/rank_bm25 +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/fc/0a/f9579384aa017d8b4c15613f86954b92a95a93d641cc849182467cf0bb3b/rank_bm25-0.2.2.tar.gz +BuildArch: noarch + +Requires: python3-numpy +Requires: python3-pytest + +%description + +# Rank-BM25: A two line search engine + + +[](https://badge.fury.io/py/rank-bm25) +[](https://zenodo.org/badge/latestdoi/166720547) + +A collection of algorithms for querying a set of documents and returning the ones most relevant to the query. The most common use case for these algorithms is, as you might have guessed, to create search engines. + +So far the algorithms that have been implemented are: +- [x] Okapi BM25 +- [x] BM25L +- [x] BM25+ +- [ ] BM25-Adpt +- [ ] BM25T + +These algorithms were taken from [this paper](http://www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf), which gives a nice overview of each method, and also benchmarks them against each other. A nice inclusion is that they compare different kinds of preprocessing like stemming vs no-stemming, stopword removal or not, etc. Great read if you're new to the topic. + +## Installation +The easiest way to install this package is through `pip`, using +```bash +pip install rank_bm25 +``` +If you want to be sure you're getting the newest version, you can install it directly from github with +```bash +pip install git+ssh://git@github.com/dorianbrown/rank_bm25.git +``` + +## Usage +For this example we'll be using the `BM25Okapi` algorithm, but the others are used in pretty much the same way. + +### Initalizing + +First thing to do is create an instance of the BM25 class, which reads in a corpus of text and does some indexing on it: +```python +from rank_bm25 import BM25Okapi + +corpus = [ + "Hello there good man!", + "It is quite windy in London", + "How is the weather today?" +] + +tokenized_corpus = [doc.split(" ") for doc in corpus] + +bm25 = BM25Okapi(tokenized_corpus) +# <rank_bm25.BM25Okapi at 0x1047881d0> +``` +Note that this package doesn't do any text preprocessing. If you want to do things like lowercasing, stopword removal, stemming, etc, you need to do it yourself. + +The only requirements is that the class receives a list of lists of strings, which are the document tokens. + +### Ranking of documents + +Now that we've created our document indexes, we can give it queries and see which documents are the most relevant: +```python +query = "windy London" +tokenized_query = query.split(" ") + +doc_scores = bm25.get_scores(tokenized_query) +# array([0. , 0.93729472, 0. ]) +``` +Good to note that we also need to tokenize our query, and apply the same preprocessing steps we did to the documents in order to have an apples-to-apples comparison + +Instead of getting the document scores, you can also just retrieve the best documents with +```python +bm25.get_top_n(tokenized_query, corpus, n=1) +# ['It is quite windy in London'] +``` +And that's pretty much it! + + + + +%package -n python3-rank-bm25 +Summary: Various BM25 algorithms for document ranking +Provides: python-rank-bm25 +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-rank-bm25 + +# Rank-BM25: A two line search engine + + +[](https://badge.fury.io/py/rank-bm25) +[](https://zenodo.org/badge/latestdoi/166720547) + +A collection of algorithms for querying a set of documents and returning the ones most relevant to the query. The most common use case for these algorithms is, as you might have guessed, to create search engines. + +So far the algorithms that have been implemented are: +- [x] Okapi BM25 +- [x] BM25L +- [x] BM25+ +- [ ] BM25-Adpt +- [ ] BM25T + +These algorithms were taken from [this paper](http://www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf), which gives a nice overview of each method, and also benchmarks them against each other. A nice inclusion is that they compare different kinds of preprocessing like stemming vs no-stemming, stopword removal or not, etc. Great read if you're new to the topic. + +## Installation +The easiest way to install this package is through `pip`, using +```bash +pip install rank_bm25 +``` +If you want to be sure you're getting the newest version, you can install it directly from github with +```bash +pip install git+ssh://git@github.com/dorianbrown/rank_bm25.git +``` + +## Usage +For this example we'll be using the `BM25Okapi` algorithm, but the others are used in pretty much the same way. + +### Initalizing + +First thing to do is create an instance of the BM25 class, which reads in a corpus of text and does some indexing on it: +```python +from rank_bm25 import BM25Okapi + +corpus = [ + "Hello there good man!", + "It is quite windy in London", + "How is the weather today?" +] + +tokenized_corpus = [doc.split(" ") for doc in corpus] + +bm25 = BM25Okapi(tokenized_corpus) +# <rank_bm25.BM25Okapi at 0x1047881d0> +``` +Note that this package doesn't do any text preprocessing. If you want to do things like lowercasing, stopword removal, stemming, etc, you need to do it yourself. + +The only requirements is that the class receives a list of lists of strings, which are the document tokens. + +### Ranking of documents + +Now that we've created our document indexes, we can give it queries and see which documents are the most relevant: +```python +query = "windy London" +tokenized_query = query.split(" ") + +doc_scores = bm25.get_scores(tokenized_query) +# array([0. , 0.93729472, 0. ]) +``` +Good to note that we also need to tokenize our query, and apply the same preprocessing steps we did to the documents in order to have an apples-to-apples comparison + +Instead of getting the document scores, you can also just retrieve the best documents with +```python +bm25.get_top_n(tokenized_query, corpus, n=1) +# ['It is quite windy in London'] +``` +And that's pretty much it! + + + + +%package help +Summary: Development documents and examples for rank-bm25 +Provides: python3-rank-bm25-doc +%description help + +# Rank-BM25: A two line search engine + + +[](https://badge.fury.io/py/rank-bm25) +[](https://zenodo.org/badge/latestdoi/166720547) + +A collection of algorithms for querying a set of documents and returning the ones most relevant to the query. The most common use case for these algorithms is, as you might have guessed, to create search engines. + +So far the algorithms that have been implemented are: +- [x] Okapi BM25 +- [x] BM25L +- [x] BM25+ +- [ ] BM25-Adpt +- [ ] BM25T + +These algorithms were taken from [this paper](http://www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf), which gives a nice overview of each method, and also benchmarks them against each other. A nice inclusion is that they compare different kinds of preprocessing like stemming vs no-stemming, stopword removal or not, etc. Great read if you're new to the topic. + +## Installation +The easiest way to install this package is through `pip`, using +```bash +pip install rank_bm25 +``` +If you want to be sure you're getting the newest version, you can install it directly from github with +```bash +pip install git+ssh://git@github.com/dorianbrown/rank_bm25.git +``` + +## Usage +For this example we'll be using the `BM25Okapi` algorithm, but the others are used in pretty much the same way. + +### Initalizing + +First thing to do is create an instance of the BM25 class, which reads in a corpus of text and does some indexing on it: +```python +from rank_bm25 import BM25Okapi + +corpus = [ + "Hello there good man!", + "It is quite windy in London", + "How is the weather today?" +] + +tokenized_corpus = [doc.split(" ") for doc in corpus] + +bm25 = BM25Okapi(tokenized_corpus) +# <rank_bm25.BM25Okapi at 0x1047881d0> +``` +Note that this package doesn't do any text preprocessing. If you want to do things like lowercasing, stopword removal, stemming, etc, you need to do it yourself. + +The only requirements is that the class receives a list of lists of strings, which are the document tokens. + +### Ranking of documents + +Now that we've created our document indexes, we can give it queries and see which documents are the most relevant: +```python +query = "windy London" +tokenized_query = query.split(" ") + +doc_scores = bm25.get_scores(tokenized_query) +# array([0. , 0.93729472, 0. ]) +``` +Good to note that we also need to tokenize our query, and apply the same preprocessing steps we did to the documents in order to have an apples-to-apples comparison + +Instead of getting the document scores, you can also just retrieve the best documents with +```python +bm25.get_top_n(tokenized_query, corpus, n=1) +# ['It is quite windy in London'] +``` +And that's pretty much it! + + + + +%prep +%autosetup -n rank-bm25-0.2.2 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-rank-bm25 -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Fri May 05 2023 Python_Bot <Python_Bot@openeuler.org> - 0.2.2-1 +- Package Spec generated @@ -0,0 +1 @@ +48562f27ad8795c3097bff5fec1721eb rank_bm25-0.2.2.tar.gz |