%global _empty_manifest_terminate_build 0 Name: python-archivebox Version: 0.6.2 Release: 1 Summary: The self-hosted internet archive. License: MIT URL: https://github.com/ArchiveBox/ArchiveBox Source0: https://mirrors.nju.edu.cn/pypi/web/packages/b2/01/37fdcb4bd60ec7187aa8196393d667478b0d1c97ba1b6b78cfd7e4501d69/archivebox-0.6.2.tar.gz BuildArch: noarch Requires: python3-requests Requires: python3-mypy-extensions Requires: python3-django Requires: python3-django-extensions Requires: python3-dateparser Requires: python3-ipython Requires: python3-youtube-dl Requires: python3-crontab Requires: python3-croniter Requires: python3-w3lib Requires: python3-setuptools Requires: python3-twine Requires: python3-wheel Requires: python3-flake8 Requires: python3-ipdb Requires: python3-mypy Requires: python3-django-stubs Requires: python3-sphinx Requires: python3-sphinx-rtd-theme Requires: python3-recommonmark Requires: python3-pytest Requires: python3-bottle Requires: python3-stdeb Requires: python3-django-debug-toolbar Requires: python3-djdt-flamegraph Requires: python3-sonic-client %description

# Overview ## Input formats ArchiveBox supports many input formats for URLs, including Pocket & Pinboard exports, Browser bookmarks, Browser history, plain text, HTML, markdown, and more! *Click these links for instructions on how to propare your links from these sources:* -

TXT, RSS, XML, JSON, CSV, SQL, HTML, Markdown, or [any other text-based format...](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Import-a-list-of-URLs-from-a-text-file) -

[Browser history](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) or [browser bookmarks](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) (see instructions for: [Chrome](https://support.google.com/chrome/answer/96816?hl=en), [Firefox](https://support.mozilla.org/en-US/kb/export-firefox-bookmarks-to-backup-or-transfer), [Safari](http://i.imgur.com/AtcvUZA.png), [IE](https://support.microsoft.com/en-us/help/211089/how-to-import-and-export-the-internet-explorer-favorites-folder-to-a-32-bit-version-of-windows), [Opera](http://help.opera.com/Windows/12.10/en/importexport.html), [and more...](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive)) -

[Pocket](https://getpocket.com/export), [Pinboard](https://pinboard.in/export/), [Instapaper](https://www.instapaper.com/user/export), [Shaarli](https://shaarli.readthedocs.io/en/master/Usage/#importexport), [Delicious](https://www.groovypost.com/howto/howto/export-delicious-bookmarks-xml/), [Reddit Saved](https://github.com/csu/export-saved-reddit), [Wallabag](https://doc.wallabag.org/en/user/import/wallabagv2.html), [Unmark.it](http://help.unmark.it/import-export), [OneTab](https://www.addictivetips.com/web/onetab-save-close-all-chrome-tabs-to-restore-export-or-import/), [and more...](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) ```bash # archivebox add --help archivebox add 'https://example.com/some/page' archivebox add < ~/Downloads/firefox_bookmarks_export.html archivebox add --depth=1 'https://news.ycombinator.com#2020-12-12' echo 'http://example.com' | archivebox add echo 'any_text_with [urls](https://example.com) in it' | archivebox add # (if using docker add -i when piping stdin) echo 'https://example.com' | docker run -v $PWD:/data -i archivebox/archivebox add # (if using docker-compose add -T when piping stdin / stdout) echo 'https://example.com' | docker-compose run -T archivebox add ``` See the [Usage: CLI](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#CLI-Usage) page for documentation and examples. It also includes a built-in scheduled import feature with `archivebox schedule` and browser bookmarklet, so you can pull in URLs from RSS feeds, websites, or the filesystem regularly/on-demand.
## Archive Layout All of ArchiveBox's state (including the index, snapshot data, and config file) is stored in a single folder called the "ArchiveBox data folder". All `archivebox` CLI commands must be run from inside this folder, and you first create it by running `archivebox init`. The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard `index.sqlite3` database in the root of the data folder (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the `./archive/` subfolder. ```bash ./ index.sqlite3 ArchiveBox.conf archive/ 1617687755/ index.html index.json screenshot.png media/some_video.mp4 warc/1617687755.warc.gz git/somerepo.git ``` Each snapshot subfolder `./archive//` includes a static `index.json` and `index.html` describing its contents, and the snapshot extrator outputs are plain files within the folder.
## Output formats Inside each Snapshot folder, ArchiveBox save these different types of extractor outputs as plain files: `./archive//*` - **Index:** `index.html` & `index.json` HTML and JSON index files containing metadata and details - **Title**, **Favicon**, **Headers** Response headers, site favicon, and parsed site title - **SingleFile:** `singlefile.html` HTML snapshot rendered with headless Chrome using SingleFile - **Wget Clone:** `example.com/page-name.html` wget clone of the site with `warc/.gz` - Chrome Headless - **PDF:** `output.pdf` Printed PDF of site using headless chrome - **Screenshot:** `screenshot.png` 1440x900 screenshot of site using headless chrome - **DOM Dump:** `output.html` DOM Dump of the HTML after rendering using headless chrome - **Article Text:** `article.html/json` Article text extraction using Readability & Mercury - **Archive.org Permalink:** `archive.org.txt` A link to the saved site on archive.org - **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl - **Source Code:** `git/` clone of any repository found on github, bitbucket, or gitlab links - _More coming soon! See the [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap)..._ It does everything out-of-the-box by default, but you can disable or tweak [individual archive methods](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) via environment variables / config. ```bash # archivebox config --help archivebox config # see all currently configured options archivebox config --set SAVE_ARCHIVE_DOT_ORG=False archivebox config --set YOUTUBEDL_ARGS='--max-filesize=500m' ```
## Static Archive Exporting You can export the main index to browse it statically without needing to run a server. *Note about large exports: These exports are not paginated, exporting many URLs or the entire archive at once may be slow. Use the filtering CLI flags on the `archivebox list` command to export specific Snapshots or ranges.* ```bash| # archivebox list --help archivebox list --html --with-headers > index.html # export to static html table archivebox list --json --with-headers > index.json # export to json blob archivebox list --csv=timestamp,url,title > index.csv # export to csv spreadsheet # (if using docker-compose, add the -T flag when piping) docker-compose run -T archivebox list --html --filter-type=search snozzberries > index.json ``` The paths in the static exports are relative, make sure to keep them next to your `./archive` folder when backing them up or viewing them.
## Dependencies For better security, easier updating, and to avoid polluting your host system with extra dependencies, **it is strongly recommended to use the official [Docker image](https://github.com/ArchiveBox/ArchiveBox/wiki/Docker)** with everything preinstalled for the best experience. To achieve high fidelity archives in as many situations as possible, ArchiveBox depends on a variety of 3rd-party tools and libraries that specialize in extracting different types of content. These optional dependencies used for archiving sites include: - `chromium` / `chrome` (for screenshots, PDF, DOM HTML, and headless JS scripts) - `node` & `npm` (for readability, mercury, and singlefile) - `wget` (for plain HTML, static files, and WARC saving) - `curl` (for fetching headers, favicon, and posting to Archive.org) - `youtube-dl` (for audio, video, and subtitles) - `git` (for cloning git repos) - and more as we grow... You don't need to install every dependency to use ArchiveBox. ArchiveBox will automatically disable extractors that rely on dependencies that aren't installed, based on what is configured and available in your `$PATH`. *If using Docker, you don't have to install any of these manually, all dependencies are set up properly out-of-the-box*. However, if you prefer not using Docker, you *can* install ArchiveBox and its dependencies using your [system package manager](https://github.com/ArchiveBox/ArchiveBox/wiki/Install) or `pip` directly on any Linux/macOS system. Just make sure to keep the dependencies up-to-date and check that ArchiveBox isn't reporting any incompatibility with the versions you install. ```bash # install python3 and archivebox with your system package manager # apt/brew/pip/etc install ... (see Quickstart instructions above) archivebox setup # auto install all the extractors and extras archivebox --version # see info and check validity of installed dependencies ``` Installing directly on **Windows without Docker or WSL/WSL2/Cygwin is not officially supported**, but some advanced users have reported getting it working. %package -n python3-archivebox Summary: The self-hosted internet archive. Provides: python-archivebox BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-archivebox

TXT, RSS, XML, JSON, CSV, SQL, HTML, Markdown, or [any other text-based format...](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Import-a-list-of-URLs-from-a-text-file) -

[Pocket](https://getpocket.com/export), [Pinboard](https://pinboard.in/export/), [Instapaper](https://www.instapaper.com/user/export), [Shaarli](https://shaarli.readthedocs.io/en/master/Usage/#importexport), [Delicious](https://www.groovypost.com/howto/howto/export-delicious-bookmarks-xml/), [Reddit Saved](https://github.com/csu/export-saved-reddit), [Wallabag](https://doc.wallabag.org/en/user/import/wallabagv2.html), [Unmark.it](http://help.unmark.it/import-export), [OneTab](https://www.addictivetips.com/web/onetab-save-close-all-chrome-tabs-to-restore-export-or-import/), [and more...](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) ```bash # archivebox add --help archivebox add 'https://example.com/some/page' archivebox add < ~/Downloads/firefox_bookmarks_export.html archivebox add --depth=1 'https://news.ycombinator.com#2020-12-12' echo 'http://example.com' | archivebox add echo 'any_text_with [urls](https://example.com) in it' | archivebox add # (if using docker add -i when piping stdin) echo 'https://example.com' | docker run -v $PWD:/data -i archivebox/archivebox add # (if using docker-compose add -T when piping stdin / stdout) echo 'https://example.com' | docker-compose run -T archivebox add ``` See the [Usage: CLI](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#CLI-Usage) page for documentation and examples. It also includes a built-in scheduled import feature with `archivebox schedule` and browser bookmarklet, so you can pull in URLs from RSS feeds, websites, or the filesystem regularly/on-demand.
## Archive Layout All of ArchiveBox's state (including the index, snapshot data, and config file) is stored in a single folder called the "ArchiveBox data folder". All `archivebox` CLI commands must be run from inside this folder, and you first create it by running `archivebox init`. The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard `index.sqlite3` database in the root of the data folder (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the `./archive/` subfolder. ```bash ./ index.sqlite3 ArchiveBox.conf archive/ 1617687755/ index.html index.json screenshot.png media/some_video.mp4 warc/1617687755.warc.gz git/somerepo.git ``` Each snapshot subfolder `./archive//` includes a static `index.json` and `index.html` describing its contents, and the snapshot extrator outputs are plain files within the folder.
## Output formats Inside each Snapshot folder, ArchiveBox save these different types of extractor outputs as plain files: `./archive//*` - **Index:** `index.html` & `index.json` HTML and JSON index files containing metadata and details - **Title**, **Favicon**, **Headers** Response headers, site favicon, and parsed site title - **SingleFile:** `singlefile.html` HTML snapshot rendered with headless Chrome using SingleFile - **Wget Clone:** `example.com/page-name.html` wget clone of the site with `warc/.gz` - Chrome Headless - **PDF:** `output.pdf` Printed PDF of site using headless chrome - **Screenshot:** `screenshot.png` 1440x900 screenshot of site using headless chrome - **DOM Dump:** `output.html` DOM Dump of the HTML after rendering using headless chrome - **Article Text:** `article.html/json` Article text extraction using Readability & Mercury - **Archive.org Permalink:** `archive.org.txt` A link to the saved site on archive.org - **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl - **Source Code:** `git/` clone of any repository found on github, bitbucket, or gitlab links - _More coming soon! See the [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap)..._ It does everything out-of-the-box by default, but you can disable or tweak [individual archive methods](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) via environment variables / config. ```bash # archivebox config --help archivebox config # see all currently configured options archivebox config --set SAVE_ARCHIVE_DOT_ORG=False archivebox config --set YOUTUBEDL_ARGS='--max-filesize=500m' ```
## Static Archive Exporting You can export the main index to browse it statically without needing to run a server. *Note about large exports: These exports are not paginated, exporting many URLs or the entire archive at once may be slow. Use the filtering CLI flags on the `archivebox list` command to export specific Snapshots or ranges.* ```bash| # archivebox list --help archivebox list --html --with-headers > index.html # export to static html table archivebox list --json --with-headers > index.json # export to json blob archivebox list --csv=timestamp,url,title > index.csv # export to csv spreadsheet # (if using docker-compose, add the -T flag when piping) docker-compose run -T archivebox list --html --filter-type=search snozzberries > index.json ``` The paths in the static exports are relative, make sure to keep them next to your `./archive` folder when backing them up or viewing them.
## Dependencies For better security, easier updating, and to avoid polluting your host system with extra dependencies, **it is strongly recommended to use the official [Docker image](https://github.com/ArchiveBox/ArchiveBox/wiki/Docker)** with everything preinstalled for the best experience. To achieve high fidelity archives in as many situations as possible, ArchiveBox depends on a variety of 3rd-party tools and libraries that specialize in extracting different types of content. These optional dependencies used for archiving sites include: - `chromium` / `chrome` (for screenshots, PDF, DOM HTML, and headless JS scripts) - `node` & `npm` (for readability, mercury, and singlefile) - `wget` (for plain HTML, static files, and WARC saving) - `curl` (for fetching headers, favicon, and posting to Archive.org) - `youtube-dl` (for audio, video, and subtitles) - `git` (for cloning git repos) - and more as we grow... You don't need to install every dependency to use ArchiveBox. ArchiveBox will automatically disable extractors that rely on dependencies that aren't installed, based on what is configured and available in your `$PATH`. *If using Docker, you don't have to install any of these manually, all dependencies are set up properly out-of-the-box*. However, if you prefer not using Docker, you *can* install ArchiveBox and its dependencies using your [system package manager](https://github.com/ArchiveBox/ArchiveBox/wiki/Install) or `pip` directly on any Linux/macOS system. Just make sure to keep the dependencies up-to-date and check that ArchiveBox isn't reporting any incompatibility with the versions you install. ```bash # install python3 and archivebox with your system package manager # apt/brew/pip/etc install ... (see Quickstart instructions above) archivebox setup # auto install all the extractors and extras archivebox --version # see info and check validity of installed dependencies ``` Installing directly on **Windows without Docker or WSL/WSL2/Cygwin is not officially supported**, but some advanced users have reported getting it working. %package help Summary: Development documents and examples for archivebox Provides: python3-archivebox-doc %description help

TXT, RSS, XML, JSON, CSV, SQL, HTML, Markdown, or [any other text-based format...](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Import-a-list-of-URLs-from-a-text-file) -

[Pocket](https://getpocket.com/export), [Pinboard](https://pinboard.in/export/), [Instapaper](https://www.instapaper.com/user/export), [Shaarli](https://shaarli.readthedocs.io/en/master/Usage/#importexport), [Delicious](https://www.groovypost.com/howto/howto/export-delicious-bookmarks-xml/), [Reddit Saved](https://github.com/csu/export-saved-reddit), [Wallabag](https://doc.wallabag.org/en/user/import/wallabagv2.html), [Unmark.it](http://help.unmark.it/import-export), [OneTab](https://www.addictivetips.com/web/onetab-save-close-all-chrome-tabs-to-restore-export-or-import/), [and more...](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) ```bash # archivebox add --help archivebox add 'https://example.com/some/page' archivebox add < ~/Downloads/firefox_bookmarks_export.html archivebox add --depth=1 'https://news.ycombinator.com#2020-12-12' echo 'http://example.com' | archivebox add echo 'any_text_with [urls](https://example.com) in it' | archivebox add # (if using docker add -i when piping stdin) echo 'https://example.com' | docker run -v $PWD:/data -i archivebox/archivebox add # (if using docker-compose add -T when piping stdin / stdout) echo 'https://example.com' | docker-compose run -T archivebox add ``` See the [Usage: CLI](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#CLI-Usage) page for documentation and examples. It also includes a built-in scheduled import feature with `archivebox schedule` and browser bookmarklet, so you can pull in URLs from RSS feeds, websites, or the filesystem regularly/on-demand.
## Archive Layout All of ArchiveBox's state (including the index, snapshot data, and config file) is stored in a single folder called the "ArchiveBox data folder". All `archivebox` CLI commands must be run from inside this folder, and you first create it by running `archivebox init`. The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard `index.sqlite3` database in the root of the data folder (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the `./archive/` subfolder. ```bash ./ index.sqlite3 ArchiveBox.conf archive/ 1617687755/ index.html index.json screenshot.png media/some_video.mp4 warc/1617687755.warc.gz git/somerepo.git ``` Each snapshot subfolder `./archive//` includes a static `index.json` and `index.html` describing its contents, and the snapshot extrator outputs are plain files within the folder.
## Output formats Inside each Snapshot folder, ArchiveBox save these different types of extractor outputs as plain files: `./archive//*` - **Index:** `index.html` & `index.json` HTML and JSON index files containing metadata and details - **Title**, **Favicon**, **Headers** Response headers, site favicon, and parsed site title - **SingleFile:** `singlefile.html` HTML snapshot rendered with headless Chrome using SingleFile - **Wget Clone:** `example.com/page-name.html` wget clone of the site with `warc/.gz` - Chrome Headless - **PDF:** `output.pdf` Printed PDF of site using headless chrome - **Screenshot:** `screenshot.png` 1440x900 screenshot of site using headless chrome - **DOM Dump:** `output.html` DOM Dump of the HTML after rendering using headless chrome - **Article Text:** `article.html/json` Article text extraction using Readability & Mercury - **Archive.org Permalink:** `archive.org.txt` A link to the saved site on archive.org - **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl - **Source Code:** `git/` clone of any repository found on github, bitbucket, or gitlab links - _More coming soon! See the [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap)..._ It does everything out-of-the-box by default, but you can disable or tweak [individual archive methods](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) via environment variables / config. ```bash # archivebox config --help archivebox config # see all currently configured options archivebox config --set SAVE_ARCHIVE_DOT_ORG=False archivebox config --set YOUTUBEDL_ARGS='--max-filesize=500m' ```
## Static Archive Exporting You can export the main index to browse it statically without needing to run a server. *Note about large exports: These exports are not paginated, exporting many URLs or the entire archive at once may be slow. Use the filtering CLI flags on the `archivebox list` command to export specific Snapshots or ranges.* ```bash| # archivebox list --help archivebox list --html --with-headers > index.html # export to static html table archivebox list --json --with-headers > index.json # export to json blob archivebox list --csv=timestamp,url,title > index.csv # export to csv spreadsheet # (if using docker-compose, add the -T flag when piping) docker-compose run -T archivebox list --html --filter-type=search snozzberries > index.json ``` The paths in the static exports are relative, make sure to keep them next to your `./archive` folder when backing them up or viewing them.
## Dependencies For better security, easier updating, and to avoid polluting your host system with extra dependencies, **it is strongly recommended to use the official [Docker image](https://github.com/ArchiveBox/ArchiveBox/wiki/Docker)** with everything preinstalled for the best experience. To achieve high fidelity archives in as many situations as possible, ArchiveBox depends on a variety of 3rd-party tools and libraries that specialize in extracting different types of content. These optional dependencies used for archiving sites include: - `chromium` / `chrome` (for screenshots, PDF, DOM HTML, and headless JS scripts) - `node` & `npm` (for readability, mercury, and singlefile) - `wget` (for plain HTML, static files, and WARC saving) - `curl` (for fetching headers, favicon, and posting to Archive.org) - `youtube-dl` (for audio, video, and subtitles) - `git` (for cloning git repos) - and more as we grow... You don't need to install every dependency to use ArchiveBox. ArchiveBox will automatically disable extractors that rely on dependencies that aren't installed, based on what is configured and available in your `$PATH`. *If using Docker, you don't have to install any of these manually, all dependencies are set up properly out-of-the-box*. However, if you prefer not using Docker, you *can* install ArchiveBox and its dependencies using your [system package manager](https://github.com/ArchiveBox/ArchiveBox/wiki/Install) or `pip` directly on any Linux/macOS system. Just make sure to keep the dependencies up-to-date and check that ArchiveBox isn't reporting any incompatibility with the versions you install. ```bash # install python3 and archivebox with your system package manager # apt/brew/pip/etc install ... (see Quickstart instructions above) archivebox setup # auto install all the extractors and extras archivebox --version # see info and check validity of installed dependencies ``` Installing directly on **Windows without Docker or WSL/WSL2/Cygwin is not officially supported**, but some advanced users have reported getting it working. %prep %autosetup -n archivebox-0.6.2 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-archivebox -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Thu May 18 2023 Python_Bot - 0.6.2-1 - Package Spec generated