%global _empty_manifest_terminate_build 0 Name: python-act-scio Version: 0.0.61 Release: 1 Summary: ACT SCIO License: ISC URL: https://github.com/mnemonic-no/act-scio2 Source0: https://mirrors.aliyun.com/pypi/web/packages/d6/5b/11e0c07061a840377bebd11665db21a295428e429a6d2b14da4d3662da1e/act-scio-0.0.61.tar.gz BuildArch: noarch %description # act-scio2 Scio v2 is a reimplementation of [Scio](https://github.com/mnemonic-no/act-scio) in Python3. Scio uses [tika](https://tika.apache.org) to extract text from documents (PDF, HTML, DOC, etc). The result is sent to the Scio Analyzer that extracts information using a combination of NLP (Natural Language Processing) and pattern matching. ## Changelog ### 0.0.42 SCIO now supports setting TLP on data upload, to annotate documents with `tlp` tag. Documents downloaded by feeds will have a default TLP white, but this can be changed in the config for feeds. ## Source code The source code the workers are available on [github](https://github.com/mnemonic-no/act-scio2). ## Setup To setup, first install from PyPi: ```bash sudo pip3 install act-scio ``` You will also need to install [beanstalkd](https://beanstalkd.github.io/). On debian/ubuntu you can run: ```bash sudo apt install beanstalkd ``` Configure beanstalk to accept larger payloads with the `-z` option. For red hat derived setups this can be configured in `/etc/sysconfig/beanstalkd`: ```bash MAX_JOB_SIZE=-z 524288 ``` You then need to install NLTK data files. A helper utility to do this is included: ```bash scio-nltk-download ``` You will also need to create a default configuration: ```bash scio-config user ``` ## API To run the api, execute: ```bash scio-api ``` This will setup the API on 127.0.0.1:3000. Use `--port and --host ` to listen on another port and/or another interface. For documentation of the API endpoint see [API.md](API.md). ## Configuration You can create a default configuration using this command (should be run as the user running scio): ```bash scio-config user ``` Common configuration can be found under ~/.config/scio/etc/scio.ini ## Running Manually ### Scio Tika Server The Scio Tika server reads jobs from the beanstalk tube `scio_doc` and the extracted text will be sent to the tube `scio_analyze`. The first time the server runs, it will download tika using maven. It will use a proxy if `$https_proxy` is set. ```bash scio-tika-server ``` `scio-tika-server` uses [tika-python](https://github.com/chrismattmann/tika-python) which depends on tika-server.jar. If your server has internet access, this will downloaded automatically. If not or you need proxy to connect to the internet, follow the instructions on "Airagap Environment Setup" here: [https://github.com/chrismattmann/tika-python](https://github.com/chrismattmann/tika-python). Currently only tested with tika-server version 2.7.0. ### Scio Analyze Server Scio Analyze Server reads (by default) jobs from the beanstalk tube `scio_analyze`. ```bash scio-analyze ``` You can also read directly from stdin like this: ```bash echo "The companies in the Bus; Finanical, Aviation and Automobile industry are large." | scio-analyze --beanstalk= --elasticsearch= ``` ### Scio Submit Submit document (from file or URI) to `scio_api`. Example: ```bash scio-submit \ --uri https://www2.fireeye.com/rs/848-DID-242/images/rpt-apt29-hammertoss.pdf \ --scio-baseuri http://localhost:3000/submit \ --tlp white ``` ## Running as a service Systemd compatible service scripts can be found under examples/systemd. To install: ```bash sudo cp examples/systemd/*.service /usr/lib/systemd/system sudo systemctl enable scio-tika-server sudo systemctl enable scio-analyze sudo service start scio-tika-server sudo service start scio-analyze ``` ## scio-feed cron job To continously fetch new content from feeds, you can add scio-feed to cron like this (make sure the directory $HOME/logs exists): ``` # Fetch scio feeds every hour 0 * * * * /usr/local/bin/scio-feeds >> $HOME/logs/scio-feed.log.$(date +\%s) 2>&1 # Delete logs from scio-feeds older than 7 days 0 * * * * find $HOME/logs/ -name 'scio-feed.log.*' -mmin +10080 -exec rm {} \; ``` ## Local development Use pip to install in [local development mode](https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs). act-scio uses namespacing, so it is not compatible with using `setup.py install` or `setup.py develop`. In repository, run: ```bash pip3 install --user -e . ``` %package -n python3-act-scio Summary: ACT SCIO Provides: python-act-scio BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-act-scio # act-scio2 Scio v2 is a reimplementation of [Scio](https://github.com/mnemonic-no/act-scio) in Python3. Scio uses [tika](https://tika.apache.org) to extract text from documents (PDF, HTML, DOC, etc). The result is sent to the Scio Analyzer that extracts information using a combination of NLP (Natural Language Processing) and pattern matching. ## Changelog ### 0.0.42 SCIO now supports setting TLP on data upload, to annotate documents with `tlp` tag. Documents downloaded by feeds will have a default TLP white, but this can be changed in the config for feeds. ## Source code The source code the workers are available on [github](https://github.com/mnemonic-no/act-scio2). ## Setup To setup, first install from PyPi: ```bash sudo pip3 install act-scio ``` You will also need to install [beanstalkd](https://beanstalkd.github.io/). On debian/ubuntu you can run: ```bash sudo apt install beanstalkd ``` Configure beanstalk to accept larger payloads with the `-z` option. For red hat derived setups this can be configured in `/etc/sysconfig/beanstalkd`: ```bash MAX_JOB_SIZE=-z 524288 ``` You then need to install NLTK data files. A helper utility to do this is included: ```bash scio-nltk-download ``` You will also need to create a default configuration: ```bash scio-config user ``` ## API To run the api, execute: ```bash scio-api ``` This will setup the API on 127.0.0.1:3000. Use `--port and --host ` to listen on another port and/or another interface. For documentation of the API endpoint see [API.md](API.md). ## Configuration You can create a default configuration using this command (should be run as the user running scio): ```bash scio-config user ``` Common configuration can be found under ~/.config/scio/etc/scio.ini ## Running Manually ### Scio Tika Server The Scio Tika server reads jobs from the beanstalk tube `scio_doc` and the extracted text will be sent to the tube `scio_analyze`. The first time the server runs, it will download tika using maven. It will use a proxy if `$https_proxy` is set. ```bash scio-tika-server ``` `scio-tika-server` uses [tika-python](https://github.com/chrismattmann/tika-python) which depends on tika-server.jar. If your server has internet access, this will downloaded automatically. If not or you need proxy to connect to the internet, follow the instructions on "Airagap Environment Setup" here: [https://github.com/chrismattmann/tika-python](https://github.com/chrismattmann/tika-python). Currently only tested with tika-server version 2.7.0. ### Scio Analyze Server Scio Analyze Server reads (by default) jobs from the beanstalk tube `scio_analyze`. ```bash scio-analyze ``` You can also read directly from stdin like this: ```bash echo "The companies in the Bus; Finanical, Aviation and Automobile industry are large." | scio-analyze --beanstalk= --elasticsearch= ``` ### Scio Submit Submit document (from file or URI) to `scio_api`. Example: ```bash scio-submit \ --uri https://www2.fireeye.com/rs/848-DID-242/images/rpt-apt29-hammertoss.pdf \ --scio-baseuri http://localhost:3000/submit \ --tlp white ``` ## Running as a service Systemd compatible service scripts can be found under examples/systemd. To install: ```bash sudo cp examples/systemd/*.service /usr/lib/systemd/system sudo systemctl enable scio-tika-server sudo systemctl enable scio-analyze sudo service start scio-tika-server sudo service start scio-analyze ``` ## scio-feed cron job To continously fetch new content from feeds, you can add scio-feed to cron like this (make sure the directory $HOME/logs exists): ``` # Fetch scio feeds every hour 0 * * * * /usr/local/bin/scio-feeds >> $HOME/logs/scio-feed.log.$(date +\%s) 2>&1 # Delete logs from scio-feeds older than 7 days 0 * * * * find $HOME/logs/ -name 'scio-feed.log.*' -mmin +10080 -exec rm {} \; ``` ## Local development Use pip to install in [local development mode](https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs). act-scio uses namespacing, so it is not compatible with using `setup.py install` or `setup.py develop`. In repository, run: ```bash pip3 install --user -e . ``` %package help Summary: Development documents and examples for act-scio Provides: python3-act-scio-doc %description help # act-scio2 Scio v2 is a reimplementation of [Scio](https://github.com/mnemonic-no/act-scio) in Python3. Scio uses [tika](https://tika.apache.org) to extract text from documents (PDF, HTML, DOC, etc). The result is sent to the Scio Analyzer that extracts information using a combination of NLP (Natural Language Processing) and pattern matching. ## Changelog ### 0.0.42 SCIO now supports setting TLP on data upload, to annotate documents with `tlp` tag. Documents downloaded by feeds will have a default TLP white, but this can be changed in the config for feeds. ## Source code The source code the workers are available on [github](https://github.com/mnemonic-no/act-scio2). ## Setup To setup, first install from PyPi: ```bash sudo pip3 install act-scio ``` You will also need to install [beanstalkd](https://beanstalkd.github.io/). On debian/ubuntu you can run: ```bash sudo apt install beanstalkd ``` Configure beanstalk to accept larger payloads with the `-z` option. For red hat derived setups this can be configured in `/etc/sysconfig/beanstalkd`: ```bash MAX_JOB_SIZE=-z 524288 ``` You then need to install NLTK data files. A helper utility to do this is included: ```bash scio-nltk-download ``` You will also need to create a default configuration: ```bash scio-config user ``` ## API To run the api, execute: ```bash scio-api ``` This will setup the API on 127.0.0.1:3000. Use `--port and --host ` to listen on another port and/or another interface. For documentation of the API endpoint see [API.md](API.md). ## Configuration You can create a default configuration using this command (should be run as the user running scio): ```bash scio-config user ``` Common configuration can be found under ~/.config/scio/etc/scio.ini ## Running Manually ### Scio Tika Server The Scio Tika server reads jobs from the beanstalk tube `scio_doc` and the extracted text will be sent to the tube `scio_analyze`. The first time the server runs, it will download tika using maven. It will use a proxy if `$https_proxy` is set. ```bash scio-tika-server ``` `scio-tika-server` uses [tika-python](https://github.com/chrismattmann/tika-python) which depends on tika-server.jar. If your server has internet access, this will downloaded automatically. If not or you need proxy to connect to the internet, follow the instructions on "Airagap Environment Setup" here: [https://github.com/chrismattmann/tika-python](https://github.com/chrismattmann/tika-python). Currently only tested with tika-server version 2.7.0. ### Scio Analyze Server Scio Analyze Server reads (by default) jobs from the beanstalk tube `scio_analyze`. ```bash scio-analyze ``` You can also read directly from stdin like this: ```bash echo "The companies in the Bus; Finanical, Aviation and Automobile industry are large." | scio-analyze --beanstalk= --elasticsearch= ``` ### Scio Submit Submit document (from file or URI) to `scio_api`. Example: ```bash scio-submit \ --uri https://www2.fireeye.com/rs/848-DID-242/images/rpt-apt29-hammertoss.pdf \ --scio-baseuri http://localhost:3000/submit \ --tlp white ``` ## Running as a service Systemd compatible service scripts can be found under examples/systemd. To install: ```bash sudo cp examples/systemd/*.service /usr/lib/systemd/system sudo systemctl enable scio-tika-server sudo systemctl enable scio-analyze sudo service start scio-tika-server sudo service start scio-analyze ``` ## scio-feed cron job To continously fetch new content from feeds, you can add scio-feed to cron like this (make sure the directory $HOME/logs exists): ``` # Fetch scio feeds every hour 0 * * * * /usr/local/bin/scio-feeds >> $HOME/logs/scio-feed.log.$(date +\%s) 2>&1 # Delete logs from scio-feeds older than 7 days 0 * * * * find $HOME/logs/ -name 'scio-feed.log.*' -mmin +10080 -exec rm {} \; ``` ## Local development Use pip to install in [local development mode](https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs). act-scio uses namespacing, so it is not compatible with using `setup.py install` or `setup.py develop`. In repository, run: ```bash pip3 install --user -e . ``` %prep %autosetup -n act-scio-0.0.61 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-act-scio -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Thu Jun 08 2023 Python_Bot - 0.0.61-1 - Package Spec generated