summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorCoprDistGit <infra@openeuler.org>2023-05-31 06:22:55 +0000
committerCoprDistGit <infra@openeuler.org>2023-05-31 06:22:55 +0000
commiteb4188075c5899b26c7f95b5d58287755a1e5a57 (patch)
tree982bfd473853ca8f1dae6b8584f325339b1f7462
parentc3890fd5727cf7e2713ff9d66a37a81c78a65b55 (diff)
automatic import of python-lazyreader
-rw-r--r--.gitignore1
-rw-r--r--python-lazyreader.spec348
-rw-r--r--sources1
3 files changed, 350 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..921f45a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/lazyreader-1.0.1.tar.gz
diff --git a/python-lazyreader.spec b/python-lazyreader.spec
new file mode 100644
index 0000000..1fc92e4
--- /dev/null
+++ b/python-lazyreader.spec
@@ -0,0 +1,348 @@
+%global _empty_manifest_terminate_build 0
+Name: python-lazyreader
+Version: 1.0.1
+Release: 1
+Summary: Lazy reading of file objects for efficient batch processing
+License: MIT
+URL: https://github.com/alexwlchan/lazyreader
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/d6/4b/6b310bced42a4777819d94d1734a6da9978fcde2c297fccb07a3bb5eb3f2/lazyreader-1.0.1.tar.gz
+BuildArch: noarch
+
+
+%description
+lazyreader is a Python module for doing lazy reading of file objects.
+The Python standard library lets you read a file a line-at-a-time, saving you from loading the entire file into memory.
+For example:
+ with open('large_file.txt') as f:
+ for line in f:
+ print(line)
+lazyreader lets you do the same thing, but with an arbitrary delimiter, and for any object that presents a ``.read()`` method.
+For example:
+ from lazyreader import lazyread
+ with open('large_file.txt') as f:
+ for doc in lazyread(f, delimiter=';'):
+ print(doc)
+This is a snippet of code I spun out from the `Wellcome Digital Platform <https://github.com/wellcometrust/platform-api>`_.
+We have large XML and JSON files stored in S3 -- sometimes multiple GBs -- but each file is really a series of "documents", separated by known delimiters.
+Downloading and parsing the entire file would be prohibitively expensive, but lazyreader allows us to hold just a single document in memory at a time.
+Installation
+************
+lazyreader is available from PyPI:
+ $ pip install lazyreader
+Examples
+********
+If we have a file stored locally, we can open it and split based on any choice of delimiter.
+For example, if we had a text file in which record were separated by commas:
+ with open('lots_of_records.txt') as f:
+ for doc in lazyread(f, delimiter=','):
+ print(doc)
+Another example: we have a file stored in Amazon S3, and we'd like to read it line-by-line.
+The `boto3 <https://boto3.readthedocs.io/en/stable/>`_ API gives us a file object for reading from S3:
+ import boto3
+ client = boto3.client('s3')
+ s3_object = client.get_object(Bucket='example-bucket', Key='words.txt')
+ body = s3_object['Body']
+ for doc in lazyread(body, delimiter=b'\n'):
+ print(doc)
+(This is the use case for which this code was originally written.)
+One more example: we're fetching an HTML page, and want to read lines separated by ``<br>`` in the underlying HTML.
+Like so:
+ import urllib.request
+ with urllib.request.urlopen('https://example.org/') as f:
+ for doc in lazyread(f, delimiter=b'<br>'):
+ print(doc)
+Advanced usage
+**************
+``lazyread()`` returns a generator, which you can wrap to build a pipeline of generators which do processing on the data.
+First example: we have a file which contains a list of JSON objects, one per line.
+(This is the format of output files from `elasticdump <https://github.com/taskrabbit/elasticsearch-dump>`_.)
+What the caller really needs is Python dictionaries, not JSON strings.
+We can wrap ``lazyread()`` like so:
+ import json
+ def lazyjson(f, delimiter=b'\n'):
+ for doc in lazyread(f, delimiter=delimiter):
+ # Ignore empty lines, e.g. the last line in a file
+ if not doc.strip():
+ continue
+ yield json.loads(doc)
+Another example: we want to parse a large XML file, but not load it all into memory at once.
+We can write the following wrapper:
+ from lxml import etree
+ def lazyxmlstrings(f, opening_tag, closing_tag):
+ for doc in lazyread(f, delimiter=closing_tag):
+ if opening_tag not in doc:
+ continue
+ # We want complete XML blocks, so look for the opening tag and
+ # just return its contents
+ block = doc.split(opening_tag)[-1]
+ yield opening_tag + block
+ def lazyxml(f, opening_tag, closing_tag):
+ for xml_string in lazyxmlstrings(f, opening_tag, closing_tag):
+ yield etree.fromstring(xml_string)
+We use both of these wrappers at Wellcome to do efficient processing of large files that are kept in Amazon S3.
+Isn't this a bit simple to be a module?
+***************************************
+Maybe.
+There are recipes on Stack Overflow that do very similar, but I find it useful to have in a standalone module.
+And it's not completely trivial -- at least, not for me.
+I made two mistakes when I first wrote this:
+* I was hard-coding the initial running string as
+ running = b''
+ That only works if your file object is returning bytestrings.
+ If it's returning Unicode strings, you get a ``TypeError`` (`can't concat bytes to str`) when it first tries to read from the file.
+ String types are important!
+* After I'd read another 1024 characters from the file, I checked for the delimiter like so:
+ running += new_data
+ if delimiter in running:
+ curr, running = running.split(delimiter)
+ yield curr + delimiter
+ For my initial use case, individual documents were `much` bigger than 1024 characters, so the new data would never contain multiple delimiters.
+ But with smaller documents, you might get multiple delimiters in one read, and then unpacking the result of ``.split()`` would throw a ``ValueError``.
+ So now the code correctly checks and handles the case where a single read includes more than one delimiter.
+Now it's encoded and tested in a module, I don't have to worry about making the same mistakes again.
+License
+*******
+MIT.
+
+%package -n python3-lazyreader
+Summary: Lazy reading of file objects for efficient batch processing
+Provides: python-lazyreader
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-lazyreader
+lazyreader is a Python module for doing lazy reading of file objects.
+The Python standard library lets you read a file a line-at-a-time, saving you from loading the entire file into memory.
+For example:
+ with open('large_file.txt') as f:
+ for line in f:
+ print(line)
+lazyreader lets you do the same thing, but with an arbitrary delimiter, and for any object that presents a ``.read()`` method.
+For example:
+ from lazyreader import lazyread
+ with open('large_file.txt') as f:
+ for doc in lazyread(f, delimiter=';'):
+ print(doc)
+This is a snippet of code I spun out from the `Wellcome Digital Platform <https://github.com/wellcometrust/platform-api>`_.
+We have large XML and JSON files stored in S3 -- sometimes multiple GBs -- but each file is really a series of "documents", separated by known delimiters.
+Downloading and parsing the entire file would be prohibitively expensive, but lazyreader allows us to hold just a single document in memory at a time.
+Installation
+************
+lazyreader is available from PyPI:
+ $ pip install lazyreader
+Examples
+********
+If we have a file stored locally, we can open it and split based on any choice of delimiter.
+For example, if we had a text file in which record were separated by commas:
+ with open('lots_of_records.txt') as f:
+ for doc in lazyread(f, delimiter=','):
+ print(doc)
+Another example: we have a file stored in Amazon S3, and we'd like to read it line-by-line.
+The `boto3 <https://boto3.readthedocs.io/en/stable/>`_ API gives us a file object for reading from S3:
+ import boto3
+ client = boto3.client('s3')
+ s3_object = client.get_object(Bucket='example-bucket', Key='words.txt')
+ body = s3_object['Body']
+ for doc in lazyread(body, delimiter=b'\n'):
+ print(doc)
+(This is the use case for which this code was originally written.)
+One more example: we're fetching an HTML page, and want to read lines separated by ``<br>`` in the underlying HTML.
+Like so:
+ import urllib.request
+ with urllib.request.urlopen('https://example.org/') as f:
+ for doc in lazyread(f, delimiter=b'<br>'):
+ print(doc)
+Advanced usage
+**************
+``lazyread()`` returns a generator, which you can wrap to build a pipeline of generators which do processing on the data.
+First example: we have a file which contains a list of JSON objects, one per line.
+(This is the format of output files from `elasticdump <https://github.com/taskrabbit/elasticsearch-dump>`_.)
+What the caller really needs is Python dictionaries, not JSON strings.
+We can wrap ``lazyread()`` like so:
+ import json
+ def lazyjson(f, delimiter=b'\n'):
+ for doc in lazyread(f, delimiter=delimiter):
+ # Ignore empty lines, e.g. the last line in a file
+ if not doc.strip():
+ continue
+ yield json.loads(doc)
+Another example: we want to parse a large XML file, but not load it all into memory at once.
+We can write the following wrapper:
+ from lxml import etree
+ def lazyxmlstrings(f, opening_tag, closing_tag):
+ for doc in lazyread(f, delimiter=closing_tag):
+ if opening_tag not in doc:
+ continue
+ # We want complete XML blocks, so look for the opening tag and
+ # just return its contents
+ block = doc.split(opening_tag)[-1]
+ yield opening_tag + block
+ def lazyxml(f, opening_tag, closing_tag):
+ for xml_string in lazyxmlstrings(f, opening_tag, closing_tag):
+ yield etree.fromstring(xml_string)
+We use both of these wrappers at Wellcome to do efficient processing of large files that are kept in Amazon S3.
+Isn't this a bit simple to be a module?
+***************************************
+Maybe.
+There are recipes on Stack Overflow that do very similar, but I find it useful to have in a standalone module.
+And it's not completely trivial -- at least, not for me.
+I made two mistakes when I first wrote this:
+* I was hard-coding the initial running string as
+ running = b''
+ That only works if your file object is returning bytestrings.
+ If it's returning Unicode strings, you get a ``TypeError`` (`can't concat bytes to str`) when it first tries to read from the file.
+ String types are important!
+* After I'd read another 1024 characters from the file, I checked for the delimiter like so:
+ running += new_data
+ if delimiter in running:
+ curr, running = running.split(delimiter)
+ yield curr + delimiter
+ For my initial use case, individual documents were `much` bigger than 1024 characters, so the new data would never contain multiple delimiters.
+ But with smaller documents, you might get multiple delimiters in one read, and then unpacking the result of ``.split()`` would throw a ``ValueError``.
+ So now the code correctly checks and handles the case where a single read includes more than one delimiter.
+Now it's encoded and tested in a module, I don't have to worry about making the same mistakes again.
+License
+*******
+MIT.
+
+%package help
+Summary: Development documents and examples for lazyreader
+Provides: python3-lazyreader-doc
+%description help
+lazyreader is a Python module for doing lazy reading of file objects.
+The Python standard library lets you read a file a line-at-a-time, saving you from loading the entire file into memory.
+For example:
+ with open('large_file.txt') as f:
+ for line in f:
+ print(line)
+lazyreader lets you do the same thing, but with an arbitrary delimiter, and for any object that presents a ``.read()`` method.
+For example:
+ from lazyreader import lazyread
+ with open('large_file.txt') as f:
+ for doc in lazyread(f, delimiter=';'):
+ print(doc)
+This is a snippet of code I spun out from the `Wellcome Digital Platform <https://github.com/wellcometrust/platform-api>`_.
+We have large XML and JSON files stored in S3 -- sometimes multiple GBs -- but each file is really a series of "documents", separated by known delimiters.
+Downloading and parsing the entire file would be prohibitively expensive, but lazyreader allows us to hold just a single document in memory at a time.
+Installation
+************
+lazyreader is available from PyPI:
+ $ pip install lazyreader
+Examples
+********
+If we have a file stored locally, we can open it and split based on any choice of delimiter.
+For example, if we had a text file in which record were separated by commas:
+ with open('lots_of_records.txt') as f:
+ for doc in lazyread(f, delimiter=','):
+ print(doc)
+Another example: we have a file stored in Amazon S3, and we'd like to read it line-by-line.
+The `boto3 <https://boto3.readthedocs.io/en/stable/>`_ API gives us a file object for reading from S3:
+ import boto3
+ client = boto3.client('s3')
+ s3_object = client.get_object(Bucket='example-bucket', Key='words.txt')
+ body = s3_object['Body']
+ for doc in lazyread(body, delimiter=b'\n'):
+ print(doc)
+(This is the use case for which this code was originally written.)
+One more example: we're fetching an HTML page, and want to read lines separated by ``<br>`` in the underlying HTML.
+Like so:
+ import urllib.request
+ with urllib.request.urlopen('https://example.org/') as f:
+ for doc in lazyread(f, delimiter=b'<br>'):
+ print(doc)
+Advanced usage
+**************
+``lazyread()`` returns a generator, which you can wrap to build a pipeline of generators which do processing on the data.
+First example: we have a file which contains a list of JSON objects, one per line.
+(This is the format of output files from `elasticdump <https://github.com/taskrabbit/elasticsearch-dump>`_.)
+What the caller really needs is Python dictionaries, not JSON strings.
+We can wrap ``lazyread()`` like so:
+ import json
+ def lazyjson(f, delimiter=b'\n'):
+ for doc in lazyread(f, delimiter=delimiter):
+ # Ignore empty lines, e.g. the last line in a file
+ if not doc.strip():
+ continue
+ yield json.loads(doc)
+Another example: we want to parse a large XML file, but not load it all into memory at once.
+We can write the following wrapper:
+ from lxml import etree
+ def lazyxmlstrings(f, opening_tag, closing_tag):
+ for doc in lazyread(f, delimiter=closing_tag):
+ if opening_tag not in doc:
+ continue
+ # We want complete XML blocks, so look for the opening tag and
+ # just return its contents
+ block = doc.split(opening_tag)[-1]
+ yield opening_tag + block
+ def lazyxml(f, opening_tag, closing_tag):
+ for xml_string in lazyxmlstrings(f, opening_tag, closing_tag):
+ yield etree.fromstring(xml_string)
+We use both of these wrappers at Wellcome to do efficient processing of large files that are kept in Amazon S3.
+Isn't this a bit simple to be a module?
+***************************************
+Maybe.
+There are recipes on Stack Overflow that do very similar, but I find it useful to have in a standalone module.
+And it's not completely trivial -- at least, not for me.
+I made two mistakes when I first wrote this:
+* I was hard-coding the initial running string as
+ running = b''
+ That only works if your file object is returning bytestrings.
+ If it's returning Unicode strings, you get a ``TypeError`` (`can't concat bytes to str`) when it first tries to read from the file.
+ String types are important!
+* After I'd read another 1024 characters from the file, I checked for the delimiter like so:
+ running += new_data
+ if delimiter in running:
+ curr, running = running.split(delimiter)
+ yield curr + delimiter
+ For my initial use case, individual documents were `much` bigger than 1024 characters, so the new data would never contain multiple delimiters.
+ But with smaller documents, you might get multiple delimiters in one read, and then unpacking the result of ``.split()`` would throw a ``ValueError``.
+ So now the code correctly checks and handles the case where a single read includes more than one delimiter.
+Now it's encoded and tested in a module, I don't have to worry about making the same mistakes again.
+License
+*******
+MIT.
+
+%prep
+%autosetup -n lazyreader-1.0.1
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-lazyreader -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Wed May 31 2023 Python_Bot <Python_Bot@openeuler.org> - 1.0.1-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..476868d
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+184799baf8d848a3dfbeb10a93c5cd9f lazyreader-1.0.1.tar.gz