automatic import of python-lazyreader

author: CoprDistGit <infra@openeuler.org> 2023-05-31 06:22:55 +0000
committer: CoprDistGit <infra@openeuler.org> 2023-05-31 06:22:55 +0000
commit: eb4188075c5899b26c7f95b5d58287755a1e5a57 (patch)
tree: 982bfd473853ca8f1dae6b8584f325339b1f7462
parent: c3890fd5727cf7e2713ff9d66a37a81c78a65b55 (diff)
3 files changed, 350 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..921f45a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/lazyreader-1.0.1.tar.gz
diff --git a/python-lazyreader.spec b/python-lazyreader.spec
new file mode 100644
index 0000000..1fc92e4
--- /dev/null
+++ b/python-lazyreader.spec
@@ -0,0 +1,348 @@
+%global _empty_manifest_terminate_build 0
+Name:		python-lazyreader
+Version:	1.0.1
+Release:	1
+Summary:	Lazy reading of file objects for efficient batch processing
+License:	MIT
+URL:		https://github.com/alexwlchan/lazyreader
+Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/d6/4b/6b310bced42a4777819d94d1734a6da9978fcde2c297fccb07a3bb5eb3f2/lazyreader-1.0.1.tar.gz
+BuildArch:	noarch
+
+
+%description
+lazyreader is a Python module for doing lazy reading of file objects.
+The Python standard library lets you read a file a line-at-a-time, saving you from loading the entire file into memory.
+For example:
+   with open('large_file.txt') as f:
+       for line in f:
+           print(line)
+lazyreader lets you do the same thing, but with an arbitrary delimiter, and for any object that presents a ``.read()`` method.
+For example:
+   from lazyreader import lazyread
+   with open('large_file.txt') as f:
+       for doc in lazyread(f, delimiter=';'):
+           print(doc)
+This is a snippet of code I spun out from the `Wellcome Digital Platform <https://github.com/wellcometrust/platform-api>`_.
+We have large XML and JSON files stored in S3 -- sometimes multiple GBs -- but each file is really a series of "documents", separated by known delimiters.
+Downloading and parsing the entire file would be prohibitively expensive, but lazyreader allows us to hold just a single document in memory at a time.
+Installation
+************
+lazyreader is available from PyPI:
+   $ pip install lazyreader
+Examples
+********
+If we have a file stored locally, we can open it and split based on any choice of delimiter.
+For example, if we had a text file in which record were separated by commas:
+   with open('lots_of_records.txt') as f:
+       for doc in lazyread(f, delimiter=','):
+           print(doc)
+Another example: we have a file stored in Amazon S3, and we'd like to read it line-by-line.
+The `boto3 <https://boto3.readthedocs.io/en/stable/>`_ API gives us a file object for reading from S3:
+   import boto3
+   client = boto3.client('s3')
+   s3_object = client.get_object(Bucket='example-bucket', Key='words.txt')
+   body = s3_object['Body']
+   for doc in lazyread(body, delimiter=b'\n'):
+       print(doc)
+(This is the use case for which this code was originally written.)
+One more example: we're fetching an HTML page, and want to read lines separated by ``<br>`` in the underlying HTML.
+Like so:
+   import urllib.request
+   with urllib.request.urlopen('https://example.org/') as f:
+       for doc in lazyread(f, delimiter=b'<br>'):
+           print(doc)
+Advanced usage
+**************
+``lazyread()`` returns a generator, which you can wrap to build a pipeline of generators which do processing on the data.
+First example: we have a file which contains a list of JSON objects, one per line.
+(This is the format of output files from `elasticdump <https://github.com/taskrabbit/elasticsearch-dump>`_.)
+What the caller really needs is Python dictionaries, not JSON strings.
+We can wrap ``lazyread()`` like so:
+   import json
+   def lazyjson(f, delimiter=b'\n'):
+       for doc in lazyread(f, delimiter=delimiter):
+           # Ignore empty lines, e.g. the last line in a file
+           if not doc.strip():
+               continue
+           yield json.loads(doc)
+Another example: we want to parse a large XML file, but not load it all into memory at once.
+We can write the following wrapper:
+   from lxml import etree
+   def lazyxmlstrings(f, opening_tag, closing_tag):
+       for doc in lazyread(f, delimiter=closing_tag):
+           if opening_tag not in doc:
+               continue
+           # We want complete XML blocks, so look for the opening tag and
+           # just return its contents
+           block = doc.split(opening_tag)[-1]
+           yield opening_tag + block
+   def lazyxml(f, opening_tag, closing_tag):
+       for xml_string in lazyxmlstrings(f, opening_tag, closing_tag):
+            yield etree.fromstring(xml_string)
+We use both of these wrappers at Wellcome to do efficient processing of large files that are kept in Amazon S3.
+Isn't this a bit simple to be a module?
+***************************************
+Maybe.
+There are recipes on Stack Overflow that do very similar, but I find it useful to have in a standalone module.
+And it's not completely trivial -- at least, not for me.
+I made two mistakes when I first wrote this:
+*  I was hard-coding the initial running string as
+      running = b''
+   That only works if your file object is returning bytestrings.
+   If it's returning Unicode strings, you get a ``TypeError`` (`can't concat bytes to str`) when it first tries to read from the file.
+   String types are important!
+*  After I'd read another 1024 characters from the file, I checked for the delimiter like so:
+      running += new_data
+      if delimiter in running:
+          curr, running = running.split(delimiter)
+          yield curr + delimiter
+   For my initial use case, individual documents were `much` bigger than 1024 characters, so the new data would never contain multiple delimiters.
+   But with smaller documents, you might get multiple delimiters in one read, and then unpacking the result of ``.split()`` would throw a ``ValueError``.
+   So now the code correctly checks and handles the case where a single read includes more than one delimiter.
+Now it's encoded and tested in a module, I don't have to worry about making the same mistakes again.
+License
+*******
+MIT.
+
+%package -n python3-lazyreader
+Summary:	Lazy reading of file objects for efficient batch processing
+Provides:	python-lazyreader
+BuildRequires:	python3-devel
+BuildRequires:	python3-setuptools
+BuildRequires:	python3-pip
+%description -n python3-lazyreader
+lazyreader is a Python module for doing lazy reading of file objects.
+The Python standard library lets you read a file a line-at-a-time, saving you from loading the entire file into memory.
+For example:
+   with open('large_file.txt') as f:
+       for line in f:
+           print(line)
+lazyreader lets you do the same thing, but with an arbitrary delimiter, and for any object that presents a ``.read()`` method.
+For example:
+   from lazyreader import lazyread
+   with open('large_file.txt') as f:
+       for doc in lazyread(f, delimiter=';'):
+           print(doc)
+This is a snippet of code I spun out from the `Wellcome Digital Platform <https://github.com/wellcometrust/platform-api>`_.
+We have large XML and JSON files stored in S3 -- sometimes multiple GBs -- but each file is really a series of "documents", separated by known delimiters.
+Downloading and parsing the entire file would be prohibitively expensive, but lazyreader allows us to hold just a single document in memory at a time.
+Installation
+************
+lazyreader is available from PyPI:
+   $ pip install lazyreader
+Examples
+********
+If we have a file stored locally, we can open it and split based on any choice of delimiter.
+For example, if we had a text file in which record were separated by commas:
+   with open('lots_of_records.txt') as f:
+       for doc in lazyread(f, delimiter=','):
+           print(doc)
+Another example: we have a file stored in Amazon S3, and we'd like to read it line-by-line.
+The `boto3 <https://boto3.readthedocs.io/en/stable/>`_ API gives us a file object for reading from S3:
+   import boto3
+   client = boto3.client('s3')
+   s3_object = client.get_object(Bucket='example-bucket', Key='words.txt')
+   body = s3_object['Body']
+   for doc in lazyread(body, delimiter=b'\n'):
+       print(doc)
+(This is the use case for which this code was originally written.)
+One more example: we're fetching an HTML page, and want to read lines separated by ``<br>`` in the underlying HTML.
+Like so:
+   import urllib.request
+   with urllib.request.urlopen('https://example.org/') as f:
+       for doc in lazyread(f, delimiter=b'<br>'):
+           print(doc)
+Advanced usage
+**************
+``lazyread()`` returns a generator, which you can wrap to build a pipeline of generators which do processing on the data.
+First example: we have a file which contains a list of JSON objects, one per line.
+(This is the format of output files from `elasticdump <https://github.com/taskrabbit/elasticsearch-dump>`_.)
+What the caller really needs is Python dictionaries, not JSON strings.
+We can wrap ``lazyread()`` like so:
+   import json
+   def lazyjson(f, delimiter=b'\n'):
+       for doc in lazyread(f, delimiter=delimiter):
+           # Ignore empty lines, e.g. the last line in a file
+           if not doc.strip():
+               continue
+           yield json.loads(doc)
+Another example: we want to parse a large XML file, but not load it all into memory at once.
+We can write the following wrapper:
+   from lxml import etree
+   def lazyxmlstrings(f, opening_tag, closing_tag):
+       for doc in lazyread(f, delimiter=closing_tag):
+           if opening_tag not in doc:
+               continue
+           # We want complete XML blocks, so look for the opening tag and
+           # just return its contents
+           block = doc.split(opening_tag)[-1]
+           yield opening_tag + block
+   def lazyxml(f, opening_tag, closing_tag):
+       for xml_string in lazyxmlstrings(f, opening_tag, closing_tag):
+            yield etree.fromstring(xml_string)
+We use both of these wrappers at Wellcome to do efficient processing of large files that are kept in Amazon S3.
+Isn't this a bit simple to be a module?
+***************************************
+Maybe.
+There are recipes on Stack Overflow that do very similar, but I find it useful to have in a standalone module.
+And it's not completely trivial -- at least, not for me.
+I made two mistakes when I first wrote this:
+*  I was hard-coding the initial running string as
+      running = b''
+   That only works if your file object is returning bytestrings.
+   If it's returning Unicode strings, you get a ``TypeError`` (`can't concat bytes to str`) when it first tries to read from the file.
+   String types are important!
+*  After I'd read another 1024 characters from the file, I checked for the delimiter like so:
+      running += new_data
+      if delimiter in running:
+          curr, running = running.split(delimiter)
+          yield curr + delimiter
+   For my initial use case, individual documents were `much` bigger than 1024 characters, so the new data would never contain multiple delimiters.
+   But with smaller documents, you might get multiple delimiters in one read, and then unpacking the result of ``.split()`` would throw a ``ValueError``.
+   So now the code correctly checks and handles the case where a single read includes more than one delimiter.
+Now it's encoded and tested in a module, I don't have to worry about making the same mistakes again.
+License
+*******
+MIT.
+
+%package help
+Summary:	Development documents and examples for lazyreader
+Provides:	python3-lazyreader-doc
+%description help
+lazyreader is a Python module for doing lazy reading of file objects.
+The Python standard library lets you read a file a line-at-a-time, saving you from loading the entire file into memory.
+For example:
+   with open('large_file.txt') as f:
+       for line in f:
+           print(line)
+lazyreader lets you do the same thing, but with an arbitrary delimiter, and for any object that presents a ``.read()`` method.
+For example:
+   from lazyreader import lazyread
+   with open('large_file.txt') as f:
+       for doc in lazyread(f, delimiter=';'):
+           print(doc)
+This is a snippet of code I spun out from the `Wellcome Digital Platform <https://github.com/wellcometrust/platform-api>`_.
+We have large XML and JSON files stored in S3 -- sometimes multiple GBs -- but each file is really a series of "documents", separated by known delimiters.
+Downloading and parsing the entire file would be prohibitively expensive, but lazyreader allows us to hold just a single document in memory at a time.
+Installation
+************
+lazyreader is available from PyPI:
+   $ pip install lazyreader
+Examples
+********
+If we have a file stored locally, we can open it and split based on any choice of delimiter.
+For example, if we had a text file in which record were separated by commas:
+   with open('lots_of_records.txt') as f:
+       for doc in lazyread(f, delimiter=','):
+           print(doc)
+Another example: we have a file stored in Amazon S3, and we'd like to read it line-by-line.
+The `boto3 <https://boto3.readthedocs.io/en/stable/>`_ API gives us a file object for reading from S3:
+   import boto3
+   client = boto3.client('s3')
+   s3_object = client.get_object(Bucket='example-bucket', Key='words.txt')
+   body = s3_object['Body']
+   for doc in lazyread(body, delimiter=b'\n'):
+       print(doc)
+(This is the use case for which this code was originally written.)
+One more example: we're fetching an HTML page, and want to read lines separated by ``<br>`` in the underlying HTML.
+Like so:
+   import urllib.request
+   with urllib.request.urlopen('https://example.org/') as f:
+       for doc in lazyread(f, delimiter=b'<br>'):
+           print(doc)
+Advanced usage
+**************
+``lazyread()`` returns a generator, which you can wrap to build a pipeline of generators which do processing on the data.
+First example: we have a file which contains a list of JSON objects, one per line.
+(This is the format of output files from `elasticdump <https://github.com/taskrabbit/elasticsearch-dump>`_.)
+What the caller really needs is Python dictionaries, not JSON strings.
+We can wrap ``lazyread()`` like so:
+   import json
+   def lazyjson(f, delimiter=b'\n'):
+       for doc in lazyread(f, delimiter=delimiter):
+           # Ignore empty lines, e.g. the last line in a file
+           if not doc.strip():
+               continue
+           yield json.loads(doc)
+Another example: we want to parse a large XML file, but not load it all into memory at once.
+We can write the following wrapper:
+   from lxml import etree
+   def lazyxmlstrings(f, opening_tag, closing_tag):
+       for doc in lazyread(f, delimiter=closing_tag):
+           if opening_tag not in doc:
+               continue
+           # We want complete XML blocks, so look for the opening tag and
+           # just return its contents
+           block = doc.split(opening_tag)[-1]
+           yield opening_tag + block
+   def lazyxml(f, opening_tag, closing_tag):
+       for xml_string in lazyxmlstrings(f, opening_tag, closing_tag):
+            yield etree.fromstring(xml_string)
+We use both of these wrappers at Wellcome to do efficient processing of large files that are kept in Amazon S3.
+Isn't this a bit simple to be a module?
+***************************************
+Maybe.
+There are recipes on Stack Overflow that do very similar, but I find it useful to have in a standalone module.
+And it's not completely trivial -- at least, not for me.
+I made two mistakes when I first wrote this:
+*  I was hard-coding the initial running string as
+      running = b''
+   That only works if your file object is returning bytestrings.
+   If it's returning Unicode strings, you get a ``TypeError`` (`can't concat bytes to str`) when it first tries to read from the file.
+   String types are important!
+*  After I'd read another 1024 characters from the file, I checked for the delimiter like so:
+      running += new_data
+      if delimiter in running:
+          curr, running = running.split(delimiter)
+          yield curr + delimiter
+   For my initial use case, individual documents were `much` bigger than 1024 characters, so the new data would never contain multiple delimiters.
+   But with smaller documents, you might get multiple delimiters in one read, and then unpacking the result of ``.split()`` would throw a ``ValueError``.
+   So now the code correctly checks and handles the case where a single read includes more than one delimiter.
+Now it's encoded and tested in a module, I don't have to worry about making the same mistakes again.
+License
+*******
+MIT.
+
+%prep
+%autosetup -n lazyreader-1.0.1
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-lazyreader -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Wed May 31 2023 Python_Bot <Python_Bot@openeuler.org> - 1.0.1-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..476868d
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+184799baf8d848a3dfbeb10a93c5cd9f  lazyreader-1.0.1.tar.gz
author	CoprDistGit <infra@openeuler.org>	2023-05-31 06:22:55 +0000
committer	CoprDistGit <infra@openeuler.org>	2023-05-31 06:22:55 +0000
commit	eb4188075c5899b26c7f95b5d58287755a1e5a57 (patch)
tree	982bfd473853ca8f1dae6b8584f325339b1f7462
parent	c3890fd5727cf7e2713ff9d66a37a81c78a65b55 (diff)