1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
|
%global _empty_manifest_terminate_build 0
Name: python-TextSpitter
Version: 0.3.6
Release: 1
Summary: Python package that spits out text from your document files!
License: MIT License
URL: https://github.com/fsecada01/TextSpitter
Source0: https://mirrors.nju.edu.cn/pypi/web/packages/9a/85/a8f8829ed4b17609fddef9ce240c63b601600c54a81464a488cfc85ec0b8/TextSpitter-0.3.6.tar.gz
BuildArch: noarch
Requires: python3-docx
Requires: python3-PyPDF2
%description
# THANK YOU FOR USING TEXTSPITTER!! #
I created this little app to help me process documents from folder sets and batches. Instead of trying to determine each file type and process accordingly, I thought it would be more prudent to read file names and then route text extraction functions accordingly. Also, I was having a really difficult time getting textract/pdftotext to work **because of damn Poppler**. So instead of troubleshooting that whole process after 6+ hours, I figured this was more time-efficient.
This is my first python module, so I hope I did this well!
## Installation ##
* Type `pip install TextSpitter`
* **OPTIONAL** type `pip install PyMuPDF` to install the Python-MuPDF engine for better fidelity with text extraction (i.e.: maintaining correct White Spacing)
* You will need to follow instructions to ensure that PyMuPDF's dependencies install to your system. There are wheels and binaries available for Windows, Linux, and MacOSX, though if you're on something weird like NetBSD/FreeBSD/specialty linux distros, you may e SOL. Fortunately, CLI options like Yum, Pkgin, Apt-Get and so forth will have packages available straight from the terminal.
* For detailed instructions, please visit here: https://github.com/rk700/PyMuPDF and maybe give those guys some kudos, because they worked their tails off.
## Directions ##
This module is designed to run as simply as possible. Just provide the file location string data into the argument, and get your text returned to you.
```
from TextSpitter import TextSpitter as TS
folder_loc = 'foo/bar/'
docx_file = folder_loc + 'file_thing.docx'
pdf_file = folder_loc + 'file_thing.pdf'
text_file = folder_loc + 'file_thing.txt'
doc_tup = (docx_file, pdf_file, text_file)
raw_text_payload = [TS(filename=ele) for ele in doc_tup]
text = '\n'.join(raw_text_payload)
return text
```
## TO DOs ##
* [x] spruce up documentation
* [X] Add stream functionality for s3-based file reading
* [ ] expand functionality to other file types
* [ ] TDB
## WANT TO CONTRIBUTE!? ##
_*OH MY GOD, PLEASE DO.*_
Just make a pull request and add whatever you want (or fix whatever you want). I'll review and approve if everything seems good.
Thanks, everyone!
%package -n python3-TextSpitter
Summary: Python package that spits out text from your document files!
Provides: python-TextSpitter
BuildRequires: python3-devel
BuildRequires: python3-setuptools
BuildRequires: python3-pip
%description -n python3-TextSpitter
# THANK YOU FOR USING TEXTSPITTER!! #
I created this little app to help me process documents from folder sets and batches. Instead of trying to determine each file type and process accordingly, I thought it would be more prudent to read file names and then route text extraction functions accordingly. Also, I was having a really difficult time getting textract/pdftotext to work **because of damn Poppler**. So instead of troubleshooting that whole process after 6+ hours, I figured this was more time-efficient.
This is my first python module, so I hope I did this well!
## Installation ##
* Type `pip install TextSpitter`
* **OPTIONAL** type `pip install PyMuPDF` to install the Python-MuPDF engine for better fidelity with text extraction (i.e.: maintaining correct White Spacing)
* You will need to follow instructions to ensure that PyMuPDF's dependencies install to your system. There are wheels and binaries available for Windows, Linux, and MacOSX, though if you're on something weird like NetBSD/FreeBSD/specialty linux distros, you may e SOL. Fortunately, CLI options like Yum, Pkgin, Apt-Get and so forth will have packages available straight from the terminal.
* For detailed instructions, please visit here: https://github.com/rk700/PyMuPDF and maybe give those guys some kudos, because they worked their tails off.
## Directions ##
This module is designed to run as simply as possible. Just provide the file location string data into the argument, and get your text returned to you.
```
from TextSpitter import TextSpitter as TS
folder_loc = 'foo/bar/'
docx_file = folder_loc + 'file_thing.docx'
pdf_file = folder_loc + 'file_thing.pdf'
text_file = folder_loc + 'file_thing.txt'
doc_tup = (docx_file, pdf_file, text_file)
raw_text_payload = [TS(filename=ele) for ele in doc_tup]
text = '\n'.join(raw_text_payload)
return text
```
## TO DOs ##
* [x] spruce up documentation
* [X] Add stream functionality for s3-based file reading
* [ ] expand functionality to other file types
* [ ] TDB
## WANT TO CONTRIBUTE!? ##
_*OH MY GOD, PLEASE DO.*_
Just make a pull request and add whatever you want (or fix whatever you want). I'll review and approve if everything seems good.
Thanks, everyone!
%package help
Summary: Development documents and examples for TextSpitter
Provides: python3-TextSpitter-doc
%description help
# THANK YOU FOR USING TEXTSPITTER!! #
I created this little app to help me process documents from folder sets and batches. Instead of trying to determine each file type and process accordingly, I thought it would be more prudent to read file names and then route text extraction functions accordingly. Also, I was having a really difficult time getting textract/pdftotext to work **because of damn Poppler**. So instead of troubleshooting that whole process after 6+ hours, I figured this was more time-efficient.
This is my first python module, so I hope I did this well!
## Installation ##
* Type `pip install TextSpitter`
* **OPTIONAL** type `pip install PyMuPDF` to install the Python-MuPDF engine for better fidelity with text extraction (i.e.: maintaining correct White Spacing)
* You will need to follow instructions to ensure that PyMuPDF's dependencies install to your system. There are wheels and binaries available for Windows, Linux, and MacOSX, though if you're on something weird like NetBSD/FreeBSD/specialty linux distros, you may e SOL. Fortunately, CLI options like Yum, Pkgin, Apt-Get and so forth will have packages available straight from the terminal.
* For detailed instructions, please visit here: https://github.com/rk700/PyMuPDF and maybe give those guys some kudos, because they worked their tails off.
## Directions ##
This module is designed to run as simply as possible. Just provide the file location string data into the argument, and get your text returned to you.
```
from TextSpitter import TextSpitter as TS
folder_loc = 'foo/bar/'
docx_file = folder_loc + 'file_thing.docx'
pdf_file = folder_loc + 'file_thing.pdf'
text_file = folder_loc + 'file_thing.txt'
doc_tup = (docx_file, pdf_file, text_file)
raw_text_payload = [TS(filename=ele) for ele in doc_tup]
text = '\n'.join(raw_text_payload)
return text
```
## TO DOs ##
* [x] spruce up documentation
* [X] Add stream functionality for s3-based file reading
* [ ] expand functionality to other file types
* [ ] TDB
## WANT TO CONTRIBUTE!? ##
_*OH MY GOD, PLEASE DO.*_
Just make a pull request and add whatever you want (or fix whatever you want). I'll review and approve if everything seems good.
Thanks, everyone!
%prep
%autosetup -n TextSpitter-0.3.6
%build
%py3_build
%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .
%files -n python3-TextSpitter -f filelist.lst
%dir %{python3_sitelib}/*
%files help -f doclist.lst
%{_docdir}/*
%changelog
* Mon May 29 2023 Python_Bot <Python_Bot@openeuler.org> - 0.3.6-1
- Package Spec generated
|