1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
|
%global _empty_manifest_terminate_build 0
Name: python-ipwb
Version: 0.2023.3.12.2026
Release: 1
Summary: InterPlanetary Wayback (ipwb): Web Archive integration with IPFS
License: MIT
URL: https://github.com/oduwsdl/ipwb
Source0: https://mirrors.nju.edu.cn/pypi/web/packages/a0/3b/4bc1f77718b928af3ff7e72b174478654c1e46e29c1abd2e7c30e5882d4d/ipwb-0.2023.3.12.2026.tar.gz
BuildArch: noarch
Requires: python3-warcio
Requires: python3-ipfshttpclient
Requires: python3-Flask
Requires: python3-pycryptodome
Requires: python3-requests
Requires: python3-beautifulsoup4
Requires: python3-six
Requires: python3-surt
%description
[](https://pypi.python.org/pypi/ipwb)
# InterPlanetary Wayback (ipwb)
**Peer-To-Peer Permanence of Web Archives**
[](https://travis-ci.org/oduwsdl/ipwb) [](https://pypi.org/project/ipwb) [](https://codecov.io/gh/oduwsdl/ipwb)
InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of [WARC](http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717) files into the IPFS network. [IPFS](https://ipfs.io/) is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. ipwb splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a [CDXJ index](https://github.com/oduwsdl/ORS/wiki/CDXJ) with references to the IPFS hashes returned, and combines the header and payload from IPFS at the time of replay.
InterPlanetary Wayback primarily consists of two scripts:
- **ipwb/indexer.py** - archival indexing script that takes the path to a WARC input, extracts the HTTP headers, HTTP payload (response body), and relevant parts of the WARC-response record header from the WARC specified and creates byte string representations. The indexer then pushes the byte strings into IPFS using a locally running IPFS daemon then creates a [CDXJ](https://github.com/oduwsdl/ORS/wiki/CDXJ) file with this metadata for replay.py.
- **ipwb/replay.py** - rudimentary replay script to resolve requests for archival content contained in IPFS for replay in the browser.
A pictorial representation of the ipwb indexing and replay process:

An important aspect of archival replay systems is rewriting various resource references for proper memento reconstruction so that they are dereferenced properly from the archive from around the same datetime as of the root memento and not from the live site (in which case the resource might have changed or gone missing). Many archival replay systems perform server-side rewriting, but it has its limitations when URIs are generated using JavaScript. To handle this we use [Service Worker](https://developer.mozilla.org/en-US/docs/Web/API/Service_Worker_API) for rerouting requests on the client-side when they are dereferenced to avoid any server-side rewiring. For this, we have implemented a separate library, [Reconstructive](https://oduwsdl.github.io/Reconstructive/), which is reusable and extendable by any archival replay system.
Another important feature of archival replays is the inclusion of an archival banner in mementos. The purpose of an archival banner is to highlight that a replayed page is a memento and not a live page, to provide metadata about the memento and the archive, and to facilitate additional interactivity. Many archival banners used in different web archival replay systems are obtrusive in nature and have issues like style leakage. To eliminate both of these issues we have implemented a [Custom HTML Element](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_custom_elements), [<reconstructive-banner>](https://oduwsdl.github.io/Reconstructive/docs/class/Reconstructive/reconstructive-banner.js~ReconstructiveBanner.html) as part of the [Reconstructive](https://oduwsdl.github.io/Reconstructive/) library and used in the ipwb.
## Installing
InterPlanetary Wayback (ipwb) requires Python 3.7+. ipwb can also be used with Docker ([see below](#user-content-using-docker)).
For conventional usage, the latest release of ipwb can be installed using pip:
```
$ pip install ipwb
```
The latest development version containing changes not yet released can be installed from source:
```
$ git clone https://github.com/oduwsdl/ipwb
$ cd ipwb
$ pip install ./
```
## Setup
The InterPlanetary File System (ipfs) daemon must be installed and running before starting ipwb. See the [Install IPFS](https://ipfs.io/docs/install/) page to accomplish this. In the future, we hope to make this more automated. Once ipfs is installed, start the daemon:
```
$ ipfs daemon
```
If you encounter a conflict with the default API port of 5001 when starting the daemon, running the following prior to launching the daemon will change the API port to access to one of your choosing (here, shown to be 5002):
```
$ ipfs config Addresses.API /ip4/127.0.0.1/tcp/5002
```
## Indexing
In a separate terminal session (or the same if you started the daemon in the background), instruct ipwb to push contents of a WARC file into IPFS and create an index of records:
```
$ ipwb index (path to warc or warc.gz)
```
...for example, from the root of the ipwb repository:
```
$ ipwb index samples/warcs/salam-home.warc
```
The ipwb indexer partitions the WARC into WARC Records and extracts the WARC Response headers, HTTP response headers, and the HTTP response bodies (payloads). Relevant information is extracted from the WARC Response headers, temporary byte strings are created for the HTTP response headers and payload, and these two bytes strings are pushed into IPFS. The resulting CDXJ data is written to `STDOUT` by default but can be redirected to a file, e.g.,
```
$ ipwb index (path to warc or warc.gz) >> myArchiveIndex.cdxj
```
## Replaying
An archival replay system is also included with ipwb to re-experience the content disseminated to IPFS. A CDXJ index needs to be provided and used by the ipwb replay system by specifying the path of the index file as a parameter to the replay system:
```
$ ipwb replay <path/to/cdxj>
```
ipwb also supports using an IPFS hash or any HTTP location as the source of the CDXJ:
```
$ ipwb replay http://myDomain/files/myIndex.cdxj
$ ipwb replay QmYwAPJzv5CZsnANOTaREALhashYgPpHdWEz79ojWnPbdG
```
Once started, the replay system's web interface can be accessed through a web browser, e.g., <http://localhost:2016/> by default.
To run it under a domain name other than `localhost`, the easiest approach is to use a reverse proxy that supports HTTPS. The replay system utilizes [Service Worker](https://developer.mozilla.org/en-US/docs/Web/API/Service_Worker_API) for URL rerouting/rewriting to prevent [live leakage (zombies)](http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html). However, for security reason many web browsers have mandated HTTPS for the Service Worker API with only exception if the domain is `localhost`. [Caddy Server](https://caddyserver.com/) and [Traefik](https://traefik.io/) can be used as a reverse-proxy server and are very easy to setup. They come with built-in HTTPS support and manage (install and update) TLS certificates transparently and automatically from [Let's Encrypt](https://letsencrypt.org/). However, any web server proxy that has HTTPS support on the front-end will work. To make ipwb replay aware of the proxy, use `--proxy` or `-P` flag to supply the proxy URL. This way the replay will yield the supplied proxy URL as a prefix when generating various fully qualified domain name (FQDN) URIs or absolute URIs (for example, those in the TimeMap or Link header) instead of the default `http://localhost:2016`. This can be necessary when the service is running in a private network or a container and only exposed via a reverse-proxy. Suppose a reverse-proxy server is running and ready to forward all traffic on the `https://ipwb.example.com` to the ipwb replay server then the replay can be started as following:
```
$ ipwb replay --proxy=https://ipwb.example.com <path/to/cdxj>
```
## Using Docker
A pre-built Docker image is made available that can be run as following:
```
$ docker container run -it --rm -p 2016:2016 oduwsdl/ipwb
```
The container will run an IPFS daemon, index a sample WARC file, and replay it using the newly created index. It will take a few seconds to be ready, then the replay will be accessible at <http://localhost:2016/> with a sample archived page.
To index and replay your own WARC file, bind mount your data folders inside the container using `-v` (or `--volume`) flag and run commands accordingly. The provided docker image has designated `/data` directory, inside which there are `warc`, `cdxj`, and `ipfs` folders where host folders can be mounted separately or as a single mount point at the parent `/data` directory. Assuming that the host machine has a `/path/to/data` folder under which there are `warc`, `cdxj`, and `ipfs` folders and a WARC file at `/path/to/data/warc/custom.warc.gz`.
```
$ docker container run -it --rm -v /path/to/data:/data oduwsdl/ipwb ipwb index -o /data/cdxj/custom.cdxj /data/warc/custom.warc.gz
$ docker container run -it --rm -v /path/to/data:/data -p 2016:2016 oduwsdl/ipwb ipwb replay /data/cdxj/custom.cdxj
```
If the host folder structure is something other than `/some/path/{warc,cdxj,ipfs}` then these volumes need to be mounted separately.
To build an image from the source, run the following command from the directory where the source code is checked out. The name of the locally built image could be anything, but we use `oduwsdl/ipwb` to be consistent with the above commands.
```
$ docker image build -t oduwsdl/ipwb .
```
By default, the image building process also performs tests, so it might take a while to build the image. It ensures that an image will not be created with failing tests. However, it is possible to skip tests by supplying a build-arg `--build-arg SKIPTEST=true` as illustrated below:
```
$ docker image build --build-arg SKIPTEST=true -t oduwsdl/ipwb .
```
## Help
Usage of sub-commands in ipwb can be accessed through providing the `-h` or `--help` flag, like any of the below.
```
$ ipwb -h
usage: ipwb [-h] [-d DAEMON_ADDRESS] [-v] [-u] {index,replay} ...
InterPlanetary Wayback (ipwb)
optional arguments:
-h, --help show this help message and exit
-d DAEMON_ADDRESS, --daemon DAEMON_ADDRESS
Multi-address of IPFS daemon (default
/dns/localhost/tcp/5001/http)
-v, --version Report the version of ipwb
-u, --update-check Check whether an updated version of ipwb is available
ipwb commands:
Invoke using "ipwb <command>", e.g., ipwb replay <cdxjFile>
{index,replay}
index Index a WARC file for replay in ipwb
replay Start the ipwb replay system
```
```
$ ipwb index -h
usage: ipwb [-h] [-e] [-c] [--compressFirst] [-o OUTFILE] [--debug]
index <warc_path> [index <warc_path> ...]
Index a WARC file for replay in ipwb
positional arguments:
index <warc_path> Path to a WARC[.gz] file
optional arguments:
-h, --help show this help message and exit
-e Encrypt WARC content prior to adding to IPFS
-c Compress WARC content prior to adding to IPFS
--compressFirst Compress data before encryption, where applicable
-o OUTFILE, --outfile OUTFILE
Path to an output CDXJ file, defaults to STDOUT
--debug Convenience flag to help with testing and debugging
```
```
$ ipwb replay -h
usage: ipwb replay [-h] [-P [<host:port>]] [index]
Start the ipwb relay system
positional arguments:
index path, URI, or multihash of file to use for replay
optional arguments:
-h, --help show this help message and exit
-P [<host:port>], --proxy [<host:port>]
Proxy URL
```
## Project History
This repo contains the code for integrating [WARC](http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717)s and [IPFS](https://ipfs.io/) as developed at the [Archives Unleashed: Web Archive Hackathon]() in Toronto, Canada in March 2016. The project was also presented at:
- The [Joint Conference on Digital Libraries 2016](http://www.jcdl2016.org/) in Newark, NJ in June 2016.
- The [Web Archiving and Digital Libraries (WADL) 2016 workshop](http://fox.cs.vt.edu/wadl2016.html) in Newark, NJ in June 2016.
- The [Theory and Practice on Digital Libraries (TPDL) 2016](http://www.tpdl2016.org/) in Hannover, Germany in September 2016.
- The [Archives Unleashed 4.0: Web Archive Datathon](https://archivesunleashed.com/call-for-participation-au4/) in London, England in June 2017.
- The [International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) 2017](http://netpreserve.org/wac2017/) in London, England in June 2017.
- The [Decentralized Web Summit 2018's](https://www.decentralizedweb.net/) IPFS Lab Day in San Francisco, CA in August 2018.
### Citing Project
We have numerous publications related to this project, but the most significant and primary one was published in TPDL 2016. ([Read the PDF](https://matkelly.com/papers/2016_tpdl_ipwb.pdf))
> Mat Kelly, Sawood Alam, Michael L. Nelson, and Michele C. Weigle. __InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives__. In _Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries_, pages 411–416, Hamburg, Germany, June 2016.
```bib
@INPROCEEDINGS{ipwb-tpdl2016,
AUTHOR = {Mat Kelly and
Sawood Alam and
Michael L. Nelson and
Michele C. Weigle},
TITLE = {{InterPlanetary Wayback}: Peer-To-Peer Permanence of Web Archives},
BOOKTITLE = {Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries},
PAGES = {411--416},
MONTH = {June},
YEAR = {2016},
ADDRESS = {Hamburg, Germany},
DOI = {10.1007/978-3-319-43997-6_35}
}
```
# License
MIT
%package -n python3-ipwb
Summary: InterPlanetary Wayback (ipwb): Web Archive integration with IPFS
Provides: python-ipwb
BuildRequires: python3-devel
BuildRequires: python3-setuptools
BuildRequires: python3-pip
%description -n python3-ipwb
[](https://pypi.python.org/pypi/ipwb)
# InterPlanetary Wayback (ipwb)
**Peer-To-Peer Permanence of Web Archives**
[](https://travis-ci.org/oduwsdl/ipwb) [](https://pypi.org/project/ipwb) [](https://codecov.io/gh/oduwsdl/ipwb)
InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of [WARC](http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717) files into the IPFS network. [IPFS](https://ipfs.io/) is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. ipwb splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a [CDXJ index](https://github.com/oduwsdl/ORS/wiki/CDXJ) with references to the IPFS hashes returned, and combines the header and payload from IPFS at the time of replay.
InterPlanetary Wayback primarily consists of two scripts:
- **ipwb/indexer.py** - archival indexing script that takes the path to a WARC input, extracts the HTTP headers, HTTP payload (response body), and relevant parts of the WARC-response record header from the WARC specified and creates byte string representations. The indexer then pushes the byte strings into IPFS using a locally running IPFS daemon then creates a [CDXJ](https://github.com/oduwsdl/ORS/wiki/CDXJ) file with this metadata for replay.py.
- **ipwb/replay.py** - rudimentary replay script to resolve requests for archival content contained in IPFS for replay in the browser.
A pictorial representation of the ipwb indexing and replay process:

An important aspect of archival replay systems is rewriting various resource references for proper memento reconstruction so that they are dereferenced properly from the archive from around the same datetime as of the root memento and not from the live site (in which case the resource might have changed or gone missing). Many archival replay systems perform server-side rewriting, but it has its limitations when URIs are generated using JavaScript. To handle this we use [Service Worker](https://developer.mozilla.org/en-US/docs/Web/API/Service_Worker_API) for rerouting requests on the client-side when they are dereferenced to avoid any server-side rewiring. For this, we have implemented a separate library, [Reconstructive](https://oduwsdl.github.io/Reconstructive/), which is reusable and extendable by any archival replay system.
Another important feature of archival replays is the inclusion of an archival banner in mementos. The purpose of an archival banner is to highlight that a replayed page is a memento and not a live page, to provide metadata about the memento and the archive, and to facilitate additional interactivity. Many archival banners used in different web archival replay systems are obtrusive in nature and have issues like style leakage. To eliminate both of these issues we have implemented a [Custom HTML Element](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_custom_elements), [<reconstructive-banner>](https://oduwsdl.github.io/Reconstructive/docs/class/Reconstructive/reconstructive-banner.js~ReconstructiveBanner.html) as part of the [Reconstructive](https://oduwsdl.github.io/Reconstructive/) library and used in the ipwb.
## Installing
InterPlanetary Wayback (ipwb) requires Python 3.7+. ipwb can also be used with Docker ([see below](#user-content-using-docker)).
For conventional usage, the latest release of ipwb can be installed using pip:
```
$ pip install ipwb
```
The latest development version containing changes not yet released can be installed from source:
```
$ git clone https://github.com/oduwsdl/ipwb
$ cd ipwb
$ pip install ./
```
## Setup
The InterPlanetary File System (ipfs) daemon must be installed and running before starting ipwb. See the [Install IPFS](https://ipfs.io/docs/install/) page to accomplish this. In the future, we hope to make this more automated. Once ipfs is installed, start the daemon:
```
$ ipfs daemon
```
If you encounter a conflict with the default API port of 5001 when starting the daemon, running the following prior to launching the daemon will change the API port to access to one of your choosing (here, shown to be 5002):
```
$ ipfs config Addresses.API /ip4/127.0.0.1/tcp/5002
```
## Indexing
In a separate terminal session (or the same if you started the daemon in the background), instruct ipwb to push contents of a WARC file into IPFS and create an index of records:
```
$ ipwb index (path to warc or warc.gz)
```
...for example, from the root of the ipwb repository:
```
$ ipwb index samples/warcs/salam-home.warc
```
The ipwb indexer partitions the WARC into WARC Records and extracts the WARC Response headers, HTTP response headers, and the HTTP response bodies (payloads). Relevant information is extracted from the WARC Response headers, temporary byte strings are created for the HTTP response headers and payload, and these two bytes strings are pushed into IPFS. The resulting CDXJ data is written to `STDOUT` by default but can be redirected to a file, e.g.,
```
$ ipwb index (path to warc or warc.gz) >> myArchiveIndex.cdxj
```
## Replaying
An archival replay system is also included with ipwb to re-experience the content disseminated to IPFS. A CDXJ index needs to be provided and used by the ipwb replay system by specifying the path of the index file as a parameter to the replay system:
```
$ ipwb replay <path/to/cdxj>
```
ipwb also supports using an IPFS hash or any HTTP location as the source of the CDXJ:
```
$ ipwb replay http://myDomain/files/myIndex.cdxj
$ ipwb replay QmYwAPJzv5CZsnANOTaREALhashYgPpHdWEz79ojWnPbdG
```
Once started, the replay system's web interface can be accessed through a web browser, e.g., <http://localhost:2016/> by default.
To run it under a domain name other than `localhost`, the easiest approach is to use a reverse proxy that supports HTTPS. The replay system utilizes [Service Worker](https://developer.mozilla.org/en-US/docs/Web/API/Service_Worker_API) for URL rerouting/rewriting to prevent [live leakage (zombies)](http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html). However, for security reason many web browsers have mandated HTTPS for the Service Worker API with only exception if the domain is `localhost`. [Caddy Server](https://caddyserver.com/) and [Traefik](https://traefik.io/) can be used as a reverse-proxy server and are very easy to setup. They come with built-in HTTPS support and manage (install and update) TLS certificates transparently and automatically from [Let's Encrypt](https://letsencrypt.org/). However, any web server proxy that has HTTPS support on the front-end will work. To make ipwb replay aware of the proxy, use `--proxy` or `-P` flag to supply the proxy URL. This way the replay will yield the supplied proxy URL as a prefix when generating various fully qualified domain name (FQDN) URIs or absolute URIs (for example, those in the TimeMap or Link header) instead of the default `http://localhost:2016`. This can be necessary when the service is running in a private network or a container and only exposed via a reverse-proxy. Suppose a reverse-proxy server is running and ready to forward all traffic on the `https://ipwb.example.com` to the ipwb replay server then the replay can be started as following:
```
$ ipwb replay --proxy=https://ipwb.example.com <path/to/cdxj>
```
## Using Docker
A pre-built Docker image is made available that can be run as following:
```
$ docker container run -it --rm -p 2016:2016 oduwsdl/ipwb
```
The container will run an IPFS daemon, index a sample WARC file, and replay it using the newly created index. It will take a few seconds to be ready, then the replay will be accessible at <http://localhost:2016/> with a sample archived page.
To index and replay your own WARC file, bind mount your data folders inside the container using `-v` (or `--volume`) flag and run commands accordingly. The provided docker image has designated `/data` directory, inside which there are `warc`, `cdxj`, and `ipfs` folders where host folders can be mounted separately or as a single mount point at the parent `/data` directory. Assuming that the host machine has a `/path/to/data` folder under which there are `warc`, `cdxj`, and `ipfs` folders and a WARC file at `/path/to/data/warc/custom.warc.gz`.
```
$ docker container run -it --rm -v /path/to/data:/data oduwsdl/ipwb ipwb index -o /data/cdxj/custom.cdxj /data/warc/custom.warc.gz
$ docker container run -it --rm -v /path/to/data:/data -p 2016:2016 oduwsdl/ipwb ipwb replay /data/cdxj/custom.cdxj
```
If the host folder structure is something other than `/some/path/{warc,cdxj,ipfs}` then these volumes need to be mounted separately.
To build an image from the source, run the following command from the directory where the source code is checked out. The name of the locally built image could be anything, but we use `oduwsdl/ipwb` to be consistent with the above commands.
```
$ docker image build -t oduwsdl/ipwb .
```
By default, the image building process also performs tests, so it might take a while to build the image. It ensures that an image will not be created with failing tests. However, it is possible to skip tests by supplying a build-arg `--build-arg SKIPTEST=true` as illustrated below:
```
$ docker image build --build-arg SKIPTEST=true -t oduwsdl/ipwb .
```
## Help
Usage of sub-commands in ipwb can be accessed through providing the `-h` or `--help` flag, like any of the below.
```
$ ipwb -h
usage: ipwb [-h] [-d DAEMON_ADDRESS] [-v] [-u] {index,replay} ...
InterPlanetary Wayback (ipwb)
optional arguments:
-h, --help show this help message and exit
-d DAEMON_ADDRESS, --daemon DAEMON_ADDRESS
Multi-address of IPFS daemon (default
/dns/localhost/tcp/5001/http)
-v, --version Report the version of ipwb
-u, --update-check Check whether an updated version of ipwb is available
ipwb commands:
Invoke using "ipwb <command>", e.g., ipwb replay <cdxjFile>
{index,replay}
index Index a WARC file for replay in ipwb
replay Start the ipwb replay system
```
```
$ ipwb index -h
usage: ipwb [-h] [-e] [-c] [--compressFirst] [-o OUTFILE] [--debug]
index <warc_path> [index <warc_path> ...]
Index a WARC file for replay in ipwb
positional arguments:
index <warc_path> Path to a WARC[.gz] file
optional arguments:
-h, --help show this help message and exit
-e Encrypt WARC content prior to adding to IPFS
-c Compress WARC content prior to adding to IPFS
--compressFirst Compress data before encryption, where applicable
-o OUTFILE, --outfile OUTFILE
Path to an output CDXJ file, defaults to STDOUT
--debug Convenience flag to help with testing and debugging
```
```
$ ipwb replay -h
usage: ipwb replay [-h] [-P [<host:port>]] [index]
Start the ipwb relay system
positional arguments:
index path, URI, or multihash of file to use for replay
optional arguments:
-h, --help show this help message and exit
-P [<host:port>], --proxy [<host:port>]
Proxy URL
```
## Project History
This repo contains the code for integrating [WARC](http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717)s and [IPFS](https://ipfs.io/) as developed at the [Archives Unleashed: Web Archive Hackathon]() in Toronto, Canada in March 2016. The project was also presented at:
- The [Joint Conference on Digital Libraries 2016](http://www.jcdl2016.org/) in Newark, NJ in June 2016.
- The [Web Archiving and Digital Libraries (WADL) 2016 workshop](http://fox.cs.vt.edu/wadl2016.html) in Newark, NJ in June 2016.
- The [Theory and Practice on Digital Libraries (TPDL) 2016](http://www.tpdl2016.org/) in Hannover, Germany in September 2016.
- The [Archives Unleashed 4.0: Web Archive Datathon](https://archivesunleashed.com/call-for-participation-au4/) in London, England in June 2017.
- The [International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) 2017](http://netpreserve.org/wac2017/) in London, England in June 2017.
- The [Decentralized Web Summit 2018's](https://www.decentralizedweb.net/) IPFS Lab Day in San Francisco, CA in August 2018.
### Citing Project
We have numerous publications related to this project, but the most significant and primary one was published in TPDL 2016. ([Read the PDF](https://matkelly.com/papers/2016_tpdl_ipwb.pdf))
> Mat Kelly, Sawood Alam, Michael L. Nelson, and Michele C. Weigle. __InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives__. In _Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries_, pages 411–416, Hamburg, Germany, June 2016.
```bib
@INPROCEEDINGS{ipwb-tpdl2016,
AUTHOR = {Mat Kelly and
Sawood Alam and
Michael L. Nelson and
Michele C. Weigle},
TITLE = {{InterPlanetary Wayback}: Peer-To-Peer Permanence of Web Archives},
BOOKTITLE = {Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries},
PAGES = {411--416},
MONTH = {June},
YEAR = {2016},
ADDRESS = {Hamburg, Germany},
DOI = {10.1007/978-3-319-43997-6_35}
}
```
# License
MIT
%package help
Summary: Development documents and examples for ipwb
Provides: python3-ipwb-doc
%description help
[](https://pypi.python.org/pypi/ipwb)
# InterPlanetary Wayback (ipwb)
**Peer-To-Peer Permanence of Web Archives**
[](https://travis-ci.org/oduwsdl/ipwb) [](https://pypi.org/project/ipwb) [](https://codecov.io/gh/oduwsdl/ipwb)
InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of [WARC](http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717) files into the IPFS network. [IPFS](https://ipfs.io/) is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. ipwb splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a [CDXJ index](https://github.com/oduwsdl/ORS/wiki/CDXJ) with references to the IPFS hashes returned, and combines the header and payload from IPFS at the time of replay.
InterPlanetary Wayback primarily consists of two scripts:
- **ipwb/indexer.py** - archival indexing script that takes the path to a WARC input, extracts the HTTP headers, HTTP payload (response body), and relevant parts of the WARC-response record header from the WARC specified and creates byte string representations. The indexer then pushes the byte strings into IPFS using a locally running IPFS daemon then creates a [CDXJ](https://github.com/oduwsdl/ORS/wiki/CDXJ) file with this metadata for replay.py.
- **ipwb/replay.py** - rudimentary replay script to resolve requests for archival content contained in IPFS for replay in the browser.
A pictorial representation of the ipwb indexing and replay process:

An important aspect of archival replay systems is rewriting various resource references for proper memento reconstruction so that they are dereferenced properly from the archive from around the same datetime as of the root memento and not from the live site (in which case the resource might have changed or gone missing). Many archival replay systems perform server-side rewriting, but it has its limitations when URIs are generated using JavaScript. To handle this we use [Service Worker](https://developer.mozilla.org/en-US/docs/Web/API/Service_Worker_API) for rerouting requests on the client-side when they are dereferenced to avoid any server-side rewiring. For this, we have implemented a separate library, [Reconstructive](https://oduwsdl.github.io/Reconstructive/), which is reusable and extendable by any archival replay system.
Another important feature of archival replays is the inclusion of an archival banner in mementos. The purpose of an archival banner is to highlight that a replayed page is a memento and not a live page, to provide metadata about the memento and the archive, and to facilitate additional interactivity. Many archival banners used in different web archival replay systems are obtrusive in nature and have issues like style leakage. To eliminate both of these issues we have implemented a [Custom HTML Element](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_custom_elements), [<reconstructive-banner>](https://oduwsdl.github.io/Reconstructive/docs/class/Reconstructive/reconstructive-banner.js~ReconstructiveBanner.html) as part of the [Reconstructive](https://oduwsdl.github.io/Reconstructive/) library and used in the ipwb.
## Installing
InterPlanetary Wayback (ipwb) requires Python 3.7+. ipwb can also be used with Docker ([see below](#user-content-using-docker)).
For conventional usage, the latest release of ipwb can be installed using pip:
```
$ pip install ipwb
```
The latest development version containing changes not yet released can be installed from source:
```
$ git clone https://github.com/oduwsdl/ipwb
$ cd ipwb
$ pip install ./
```
## Setup
The InterPlanetary File System (ipfs) daemon must be installed and running before starting ipwb. See the [Install IPFS](https://ipfs.io/docs/install/) page to accomplish this. In the future, we hope to make this more automated. Once ipfs is installed, start the daemon:
```
$ ipfs daemon
```
If you encounter a conflict with the default API port of 5001 when starting the daemon, running the following prior to launching the daemon will change the API port to access to one of your choosing (here, shown to be 5002):
```
$ ipfs config Addresses.API /ip4/127.0.0.1/tcp/5002
```
## Indexing
In a separate terminal session (or the same if you started the daemon in the background), instruct ipwb to push contents of a WARC file into IPFS and create an index of records:
```
$ ipwb index (path to warc or warc.gz)
```
...for example, from the root of the ipwb repository:
```
$ ipwb index samples/warcs/salam-home.warc
```
The ipwb indexer partitions the WARC into WARC Records and extracts the WARC Response headers, HTTP response headers, and the HTTP response bodies (payloads). Relevant information is extracted from the WARC Response headers, temporary byte strings are created for the HTTP response headers and payload, and these two bytes strings are pushed into IPFS. The resulting CDXJ data is written to `STDOUT` by default but can be redirected to a file, e.g.,
```
$ ipwb index (path to warc or warc.gz) >> myArchiveIndex.cdxj
```
## Replaying
An archival replay system is also included with ipwb to re-experience the content disseminated to IPFS. A CDXJ index needs to be provided and used by the ipwb replay system by specifying the path of the index file as a parameter to the replay system:
```
$ ipwb replay <path/to/cdxj>
```
ipwb also supports using an IPFS hash or any HTTP location as the source of the CDXJ:
```
$ ipwb replay http://myDomain/files/myIndex.cdxj
$ ipwb replay QmYwAPJzv5CZsnANOTaREALhashYgPpHdWEz79ojWnPbdG
```
Once started, the replay system's web interface can be accessed through a web browser, e.g., <http://localhost:2016/> by default.
To run it under a domain name other than `localhost`, the easiest approach is to use a reverse proxy that supports HTTPS. The replay system utilizes [Service Worker](https://developer.mozilla.org/en-US/docs/Web/API/Service_Worker_API) for URL rerouting/rewriting to prevent [live leakage (zombies)](http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html). However, for security reason many web browsers have mandated HTTPS for the Service Worker API with only exception if the domain is `localhost`. [Caddy Server](https://caddyserver.com/) and [Traefik](https://traefik.io/) can be used as a reverse-proxy server and are very easy to setup. They come with built-in HTTPS support and manage (install and update) TLS certificates transparently and automatically from [Let's Encrypt](https://letsencrypt.org/). However, any web server proxy that has HTTPS support on the front-end will work. To make ipwb replay aware of the proxy, use `--proxy` or `-P` flag to supply the proxy URL. This way the replay will yield the supplied proxy URL as a prefix when generating various fully qualified domain name (FQDN) URIs or absolute URIs (for example, those in the TimeMap or Link header) instead of the default `http://localhost:2016`. This can be necessary when the service is running in a private network or a container and only exposed via a reverse-proxy. Suppose a reverse-proxy server is running and ready to forward all traffic on the `https://ipwb.example.com` to the ipwb replay server then the replay can be started as following:
```
$ ipwb replay --proxy=https://ipwb.example.com <path/to/cdxj>
```
## Using Docker
A pre-built Docker image is made available that can be run as following:
```
$ docker container run -it --rm -p 2016:2016 oduwsdl/ipwb
```
The container will run an IPFS daemon, index a sample WARC file, and replay it using the newly created index. It will take a few seconds to be ready, then the replay will be accessible at <http://localhost:2016/> with a sample archived page.
To index and replay your own WARC file, bind mount your data folders inside the container using `-v` (or `--volume`) flag and run commands accordingly. The provided docker image has designated `/data` directory, inside which there are `warc`, `cdxj`, and `ipfs` folders where host folders can be mounted separately or as a single mount point at the parent `/data` directory. Assuming that the host machine has a `/path/to/data` folder under which there are `warc`, `cdxj`, and `ipfs` folders and a WARC file at `/path/to/data/warc/custom.warc.gz`.
```
$ docker container run -it --rm -v /path/to/data:/data oduwsdl/ipwb ipwb index -o /data/cdxj/custom.cdxj /data/warc/custom.warc.gz
$ docker container run -it --rm -v /path/to/data:/data -p 2016:2016 oduwsdl/ipwb ipwb replay /data/cdxj/custom.cdxj
```
If the host folder structure is something other than `/some/path/{warc,cdxj,ipfs}` then these volumes need to be mounted separately.
To build an image from the source, run the following command from the directory where the source code is checked out. The name of the locally built image could be anything, but we use `oduwsdl/ipwb` to be consistent with the above commands.
```
$ docker image build -t oduwsdl/ipwb .
```
By default, the image building process also performs tests, so it might take a while to build the image. It ensures that an image will not be created with failing tests. However, it is possible to skip tests by supplying a build-arg `--build-arg SKIPTEST=true` as illustrated below:
```
$ docker image build --build-arg SKIPTEST=true -t oduwsdl/ipwb .
```
## Help
Usage of sub-commands in ipwb can be accessed through providing the `-h` or `--help` flag, like any of the below.
```
$ ipwb -h
usage: ipwb [-h] [-d DAEMON_ADDRESS] [-v] [-u] {index,replay} ...
InterPlanetary Wayback (ipwb)
optional arguments:
-h, --help show this help message and exit
-d DAEMON_ADDRESS, --daemon DAEMON_ADDRESS
Multi-address of IPFS daemon (default
/dns/localhost/tcp/5001/http)
-v, --version Report the version of ipwb
-u, --update-check Check whether an updated version of ipwb is available
ipwb commands:
Invoke using "ipwb <command>", e.g., ipwb replay <cdxjFile>
{index,replay}
index Index a WARC file for replay in ipwb
replay Start the ipwb replay system
```
```
$ ipwb index -h
usage: ipwb [-h] [-e] [-c] [--compressFirst] [-o OUTFILE] [--debug]
index <warc_path> [index <warc_path> ...]
Index a WARC file for replay in ipwb
positional arguments:
index <warc_path> Path to a WARC[.gz] file
optional arguments:
-h, --help show this help message and exit
-e Encrypt WARC content prior to adding to IPFS
-c Compress WARC content prior to adding to IPFS
--compressFirst Compress data before encryption, where applicable
-o OUTFILE, --outfile OUTFILE
Path to an output CDXJ file, defaults to STDOUT
--debug Convenience flag to help with testing and debugging
```
```
$ ipwb replay -h
usage: ipwb replay [-h] [-P [<host:port>]] [index]
Start the ipwb relay system
positional arguments:
index path, URI, or multihash of file to use for replay
optional arguments:
-h, --help show this help message and exit
-P [<host:port>], --proxy [<host:port>]
Proxy URL
```
## Project History
This repo contains the code for integrating [WARC](http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717)s and [IPFS](https://ipfs.io/) as developed at the [Archives Unleashed: Web Archive Hackathon]() in Toronto, Canada in March 2016. The project was also presented at:
- The [Joint Conference on Digital Libraries 2016](http://www.jcdl2016.org/) in Newark, NJ in June 2016.
- The [Web Archiving and Digital Libraries (WADL) 2016 workshop](http://fox.cs.vt.edu/wadl2016.html) in Newark, NJ in June 2016.
- The [Theory and Practice on Digital Libraries (TPDL) 2016](http://www.tpdl2016.org/) in Hannover, Germany in September 2016.
- The [Archives Unleashed 4.0: Web Archive Datathon](https://archivesunleashed.com/call-for-participation-au4/) in London, England in June 2017.
- The [International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) 2017](http://netpreserve.org/wac2017/) in London, England in June 2017.
- The [Decentralized Web Summit 2018's](https://www.decentralizedweb.net/) IPFS Lab Day in San Francisco, CA in August 2018.
### Citing Project
We have numerous publications related to this project, but the most significant and primary one was published in TPDL 2016. ([Read the PDF](https://matkelly.com/papers/2016_tpdl_ipwb.pdf))
> Mat Kelly, Sawood Alam, Michael L. Nelson, and Michele C. Weigle. __InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives__. In _Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries_, pages 411–416, Hamburg, Germany, June 2016.
```bib
@INPROCEEDINGS{ipwb-tpdl2016,
AUTHOR = {Mat Kelly and
Sawood Alam and
Michael L. Nelson and
Michele C. Weigle},
TITLE = {{InterPlanetary Wayback}: Peer-To-Peer Permanence of Web Archives},
BOOKTITLE = {Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries},
PAGES = {411--416},
MONTH = {June},
YEAR = {2016},
ADDRESS = {Hamburg, Germany},
DOI = {10.1007/978-3-319-43997-6_35}
}
```
# License
MIT
%prep
%autosetup -n ipwb-0.2023.3.12.2026
%build
%py3_build
%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .
%files -n python3-ipwb -f filelist.lst
%dir %{python3_sitelib}/*
%files help -f doclist.lst
%{_docdir}/*
%changelog
* Fri May 05 2023 Python_Bot <Python_Bot@openeuler.org> - 0.2023.3.12.2026-1
- Package Spec generated
|