diff options
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-lazyreader.spec | 348 | ||||
| -rw-r--r-- | sources | 1 | 
3 files changed, 350 insertions, 0 deletions
| @@ -0,0 +1 @@ +/lazyreader-1.0.1.tar.gz diff --git a/python-lazyreader.spec b/python-lazyreader.spec new file mode 100644 index 0000000..1fc92e4 --- /dev/null +++ b/python-lazyreader.spec @@ -0,0 +1,348 @@ +%global _empty_manifest_terminate_build 0 +Name:		python-lazyreader +Version:	1.0.1 +Release:	1 +Summary:	Lazy reading of file objects for efficient batch processing +License:	MIT +URL:		https://github.com/alexwlchan/lazyreader +Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/d6/4b/6b310bced42a4777819d94d1734a6da9978fcde2c297fccb07a3bb5eb3f2/lazyreader-1.0.1.tar.gz +BuildArch:	noarch + + +%description +lazyreader is a Python module for doing lazy reading of file objects. +The Python standard library lets you read a file a line-at-a-time, saving you from loading the entire file into memory. +For example: +   with open('large_file.txt') as f: +       for line in f: +           print(line) +lazyreader lets you do the same thing, but with an arbitrary delimiter, and for any object that presents a ``.read()`` method. +For example: +   from lazyreader import lazyread +   with open('large_file.txt') as f: +       for doc in lazyread(f, delimiter=';'): +           print(doc) +This is a snippet of code I spun out from the `Wellcome Digital Platform <https://github.com/wellcometrust/platform-api>`_. +We have large XML and JSON files stored in S3 -- sometimes multiple GBs -- but each file is really a series of "documents", separated by known delimiters. +Downloading and parsing the entire file would be prohibitively expensive, but lazyreader allows us to hold just a single document in memory at a time. +Installation +************ +lazyreader is available from PyPI: +   $ pip install lazyreader +Examples +******** +If we have a file stored locally, we can open it and split based on any choice of delimiter. +For example, if we had a text file in which record were separated by commas: +   with open('lots_of_records.txt') as f: +       for doc in lazyread(f, delimiter=','): +           print(doc) +Another example: we have a file stored in Amazon S3, and we'd like to read it line-by-line. +The `boto3 <https://boto3.readthedocs.io/en/stable/>`_ API gives us a file object for reading from S3: +   import boto3 +   client = boto3.client('s3') +   s3_object = client.get_object(Bucket='example-bucket', Key='words.txt') +   body = s3_object['Body'] +   for doc in lazyread(body, delimiter=b'\n'): +       print(doc) +(This is the use case for which this code was originally written.) +One more example: we're fetching an HTML page, and want to read lines separated by ``<br>`` in the underlying HTML. +Like so: +   import urllib.request +   with urllib.request.urlopen('https://example.org/') as f: +       for doc in lazyread(f, delimiter=b'<br>'): +           print(doc) +Advanced usage +************** +``lazyread()`` returns a generator, which you can wrap to build a pipeline of generators which do processing on the data. +First example: we have a file which contains a list of JSON objects, one per line. +(This is the format of output files from `elasticdump <https://github.com/taskrabbit/elasticsearch-dump>`_.) +What the caller really needs is Python dictionaries, not JSON strings. +We can wrap ``lazyread()`` like so: +   import json +   def lazyjson(f, delimiter=b'\n'): +       for doc in lazyread(f, delimiter=delimiter): +           # Ignore empty lines, e.g. the last line in a file +           if not doc.strip(): +               continue +           yield json.loads(doc) +Another example: we want to parse a large XML file, but not load it all into memory at once. +We can write the following wrapper: +   from lxml import etree +   def lazyxmlstrings(f, opening_tag, closing_tag): +       for doc in lazyread(f, delimiter=closing_tag): +           if opening_tag not in doc: +               continue +           # We want complete XML blocks, so look for the opening tag and +           # just return its contents +           block = doc.split(opening_tag)[-1] +           yield opening_tag + block +   def lazyxml(f, opening_tag, closing_tag): +       for xml_string in lazyxmlstrings(f, opening_tag, closing_tag): +            yield etree.fromstring(xml_string) +We use both of these wrappers at Wellcome to do efficient processing of large files that are kept in Amazon S3. +Isn't this a bit simple to be a module? +*************************************** +Maybe. +There are recipes on Stack Overflow that do very similar, but I find it useful to have in a standalone module. +And it's not completely trivial -- at least, not for me. +I made two mistakes when I first wrote this: +*  I was hard-coding the initial running string as +      running = b'' +   That only works if your file object is returning bytestrings. +   If it's returning Unicode strings, you get a ``TypeError`` (`can't concat bytes to str`) when it first tries to read from the file. +   String types are important! +*  After I'd read another 1024 characters from the file, I checked for the delimiter like so: +      running += new_data +      if delimiter in running: +          curr, running = running.split(delimiter) +          yield curr + delimiter +   For my initial use case, individual documents were `much` bigger than 1024 characters, so the new data would never contain multiple delimiters. +   But with smaller documents, you might get multiple delimiters in one read, and then unpacking the result of ``.split()`` would throw a ``ValueError``. +   So now the code correctly checks and handles the case where a single read includes more than one delimiter. +Now it's encoded and tested in a module, I don't have to worry about making the same mistakes again. +License +******* +MIT. + +%package -n python3-lazyreader +Summary:	Lazy reading of file objects for efficient batch processing +Provides:	python-lazyreader +BuildRequires:	python3-devel +BuildRequires:	python3-setuptools +BuildRequires:	python3-pip +%description -n python3-lazyreader +lazyreader is a Python module for doing lazy reading of file objects. +The Python standard library lets you read a file a line-at-a-time, saving you from loading the entire file into memory. +For example: +   with open('large_file.txt') as f: +       for line in f: +           print(line) +lazyreader lets you do the same thing, but with an arbitrary delimiter, and for any object that presents a ``.read()`` method. +For example: +   from lazyreader import lazyread +   with open('large_file.txt') as f: +       for doc in lazyread(f, delimiter=';'): +           print(doc) +This is a snippet of code I spun out from the `Wellcome Digital Platform <https://github.com/wellcometrust/platform-api>`_. +We have large XML and JSON files stored in S3 -- sometimes multiple GBs -- but each file is really a series of "documents", separated by known delimiters. +Downloading and parsing the entire file would be prohibitively expensive, but lazyreader allows us to hold just a single document in memory at a time. +Installation +************ +lazyreader is available from PyPI: +   $ pip install lazyreader +Examples +******** +If we have a file stored locally, we can open it and split based on any choice of delimiter. +For example, if we had a text file in which record were separated by commas: +   with open('lots_of_records.txt') as f: +       for doc in lazyread(f, delimiter=','): +           print(doc) +Another example: we have a file stored in Amazon S3, and we'd like to read it line-by-line. +The `boto3 <https://boto3.readthedocs.io/en/stable/>`_ API gives us a file object for reading from S3: +   import boto3 +   client = boto3.client('s3') +   s3_object = client.get_object(Bucket='example-bucket', Key='words.txt') +   body = s3_object['Body'] +   for doc in lazyread(body, delimiter=b'\n'): +       print(doc) +(This is the use case for which this code was originally written.) +One more example: we're fetching an HTML page, and want to read lines separated by ``<br>`` in the underlying HTML. +Like so: +   import urllib.request +   with urllib.request.urlopen('https://example.org/') as f: +       for doc in lazyread(f, delimiter=b'<br>'): +           print(doc) +Advanced usage +************** +``lazyread()`` returns a generator, which you can wrap to build a pipeline of generators which do processing on the data. +First example: we have a file which contains a list of JSON objects, one per line. +(This is the format of output files from `elasticdump <https://github.com/taskrabbit/elasticsearch-dump>`_.) +What the caller really needs is Python dictionaries, not JSON strings. +We can wrap ``lazyread()`` like so: +   import json +   def lazyjson(f, delimiter=b'\n'): +       for doc in lazyread(f, delimiter=delimiter): +           # Ignore empty lines, e.g. the last line in a file +           if not doc.strip(): +               continue +           yield json.loads(doc) +Another example: we want to parse a large XML file, but not load it all into memory at once. +We can write the following wrapper: +   from lxml import etree +   def lazyxmlstrings(f, opening_tag, closing_tag): +       for doc in lazyread(f, delimiter=closing_tag): +           if opening_tag not in doc: +               continue +           # We want complete XML blocks, so look for the opening tag and +           # just return its contents +           block = doc.split(opening_tag)[-1] +           yield opening_tag + block +   def lazyxml(f, opening_tag, closing_tag): +       for xml_string in lazyxmlstrings(f, opening_tag, closing_tag): +            yield etree.fromstring(xml_string) +We use both of these wrappers at Wellcome to do efficient processing of large files that are kept in Amazon S3. +Isn't this a bit simple to be a module? +*************************************** +Maybe. +There are recipes on Stack Overflow that do very similar, but I find it useful to have in a standalone module. +And it's not completely trivial -- at least, not for me. +I made two mistakes when I first wrote this: +*  I was hard-coding the initial running string as +      running = b'' +   That only works if your file object is returning bytestrings. +   If it's returning Unicode strings, you get a ``TypeError`` (`can't concat bytes to str`) when it first tries to read from the file. +   String types are important! +*  After I'd read another 1024 characters from the file, I checked for the delimiter like so: +      running += new_data +      if delimiter in running: +          curr, running = running.split(delimiter) +          yield curr + delimiter +   For my initial use case, individual documents were `much` bigger than 1024 characters, so the new data would never contain multiple delimiters. +   But with smaller documents, you might get multiple delimiters in one read, and then unpacking the result of ``.split()`` would throw a ``ValueError``. +   So now the code correctly checks and handles the case where a single read includes more than one delimiter. +Now it's encoded and tested in a module, I don't have to worry about making the same mistakes again. +License +******* +MIT. + +%package help +Summary:	Development documents and examples for lazyreader +Provides:	python3-lazyreader-doc +%description help +lazyreader is a Python module for doing lazy reading of file objects. +The Python standard library lets you read a file a line-at-a-time, saving you from loading the entire file into memory. +For example: +   with open('large_file.txt') as f: +       for line in f: +           print(line) +lazyreader lets you do the same thing, but with an arbitrary delimiter, and for any object that presents a ``.read()`` method. +For example: +   from lazyreader import lazyread +   with open('large_file.txt') as f: +       for doc in lazyread(f, delimiter=';'): +           print(doc) +This is a snippet of code I spun out from the `Wellcome Digital Platform <https://github.com/wellcometrust/platform-api>`_. +We have large XML and JSON files stored in S3 -- sometimes multiple GBs -- but each file is really a series of "documents", separated by known delimiters. +Downloading and parsing the entire file would be prohibitively expensive, but lazyreader allows us to hold just a single document in memory at a time. +Installation +************ +lazyreader is available from PyPI: +   $ pip install lazyreader +Examples +******** +If we have a file stored locally, we can open it and split based on any choice of delimiter. +For example, if we had a text file in which record were separated by commas: +   with open('lots_of_records.txt') as f: +       for doc in lazyread(f, delimiter=','): +           print(doc) +Another example: we have a file stored in Amazon S3, and we'd like to read it line-by-line. +The `boto3 <https://boto3.readthedocs.io/en/stable/>`_ API gives us a file object for reading from S3: +   import boto3 +   client = boto3.client('s3') +   s3_object = client.get_object(Bucket='example-bucket', Key='words.txt') +   body = s3_object['Body'] +   for doc in lazyread(body, delimiter=b'\n'): +       print(doc) +(This is the use case for which this code was originally written.) +One more example: we're fetching an HTML page, and want to read lines separated by ``<br>`` in the underlying HTML. +Like so: +   import urllib.request +   with urllib.request.urlopen('https://example.org/') as f: +       for doc in lazyread(f, delimiter=b'<br>'): +           print(doc) +Advanced usage +************** +``lazyread()`` returns a generator, which you can wrap to build a pipeline of generators which do processing on the data. +First example: we have a file which contains a list of JSON objects, one per line. +(This is the format of output files from `elasticdump <https://github.com/taskrabbit/elasticsearch-dump>`_.) +What the caller really needs is Python dictionaries, not JSON strings. +We can wrap ``lazyread()`` like so: +   import json +   def lazyjson(f, delimiter=b'\n'): +       for doc in lazyread(f, delimiter=delimiter): +           # Ignore empty lines, e.g. the last line in a file +           if not doc.strip(): +               continue +           yield json.loads(doc) +Another example: we want to parse a large XML file, but not load it all into memory at once. +We can write the following wrapper: +   from lxml import etree +   def lazyxmlstrings(f, opening_tag, closing_tag): +       for doc in lazyread(f, delimiter=closing_tag): +           if opening_tag not in doc: +               continue +           # We want complete XML blocks, so look for the opening tag and +           # just return its contents +           block = doc.split(opening_tag)[-1] +           yield opening_tag + block +   def lazyxml(f, opening_tag, closing_tag): +       for xml_string in lazyxmlstrings(f, opening_tag, closing_tag): +            yield etree.fromstring(xml_string) +We use both of these wrappers at Wellcome to do efficient processing of large files that are kept in Amazon S3. +Isn't this a bit simple to be a module? +*************************************** +Maybe. +There are recipes on Stack Overflow that do very similar, but I find it useful to have in a standalone module. +And it's not completely trivial -- at least, not for me. +I made two mistakes when I first wrote this: +*  I was hard-coding the initial running string as +      running = b'' +   That only works if your file object is returning bytestrings. +   If it's returning Unicode strings, you get a ``TypeError`` (`can't concat bytes to str`) when it first tries to read from the file. +   String types are important! +*  After I'd read another 1024 characters from the file, I checked for the delimiter like so: +      running += new_data +      if delimiter in running: +          curr, running = running.split(delimiter) +          yield curr + delimiter +   For my initial use case, individual documents were `much` bigger than 1024 characters, so the new data would never contain multiple delimiters. +   But with smaller documents, you might get multiple delimiters in one read, and then unpacking the result of ``.split()`` would throw a ``ValueError``. +   So now the code correctly checks and handles the case where a single read includes more than one delimiter. +Now it's encoded and tested in a module, I don't have to worry about making the same mistakes again. +License +******* +MIT. + +%prep +%autosetup -n lazyreader-1.0.1 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then +	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then +	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then +	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then +	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then +	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-lazyreader -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed May 31 2023 Python_Bot <Python_Bot@openeuler.org> - 1.0.1-1 +- Package Spec generated @@ -0,0 +1 @@ +184799baf8d848a3dfbeb10a93c5cd9f  lazyreader-1.0.1.tar.gz | 
