summaryrefslogtreecommitdiff
path: root/python-cloud-files.spec
blob: f7ea571403a60fc3895529f3c825b5bb018842f5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
%global _empty_manifest_terminate_build 0
Name:		python-cloud-files
Version:	4.15.1
Release:	1
Summary:	Fast access to cloud storage and local FS.
License:	License :: OSI Approved :: BSD License
URL:		https://github.com/seung-lab/cloud-files/
Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/c8/73/0244deab8c26ab629bd9cec2675a8a019bdbadc0c29b66a6bb401f6e74e3/cloud-files-4.15.1.tar.gz
BuildArch:	noarch

Requires:	python3-boto3
Requires:	python3-brotli
Requires:	python3-crc32c
Requires:	python3-chardet
Requires:	python3-click
Requires:	python3-deflate
Requires:	python3-gevent
Requires:	python3-google-auth
Requires:	python3-google-cloud-core
Requires:	python3-google-cloud-storage
Requires:	python3-google-crc32c
Requires:	python3-orjson
Requires:	python3-pathos
Requires:	python3-protobuf
Requires:	python3-requests
Requires:	python3-six
Requires:	python3-tenacity
Requires:	python3-tqdm
Requires:	python3-urllib3
Requires:	python3-zstandard
Requires:	python3-rsa
Requires:	python3-numpy
Requires:	python3-pytest
Requires:	python3-moto

%description
```python
from cloudfiles import CloudFiles, CloudFile, dl
results = dl(["gs://bucket/file1", "gs://bucket2/file2", ... ]) # shorthand
cf = CloudFiles('gs://bucket', progress=True) # s3://, https://, and file:// also supported
results = cf.get(['file1', 'file2', 'file3', ..., 'fileN']) # threaded
results = cf.get(paths, parallel=2) # threaded and two processes
file1 = cf['file1']
part  = cf['file1', 0:30] # first 30 bytes of file1
cf.put('filename', content)
cf.put_json('filename', content)
cf.puts([{
    'path': 'filename',
    'content': content,
}, ... ]) # automatically threaded
cf.puts(content, parallel=2) # threaded + two processes
cf.puts(content, storage_class="NEARLINE") # apply vendor-specific storage class
cf.put_jsons(...) # same as puts
cf['filename'] = content
for fname in cf.list(prefix='abc123'):
    print(fname)
list(cf) # same as list(cf.list())
cf.delete('filename')
del cf['filename']
cf.delete([ 'filename_1', 'filename_2', ... ]) # threaded
cf.delete(paths, parallel=2) # threaded + two processes
boolean = cf.exists('filename')
results = cf.exists([ 'filename_1', ... ]) # threaded
# for single files
cf = CloudFile("gs://bucket/file1")
info = cf.head()
binary = cf.get()
cf.put(binary)
cf[:30] # get first 30 bytes of file
```
CloudFiles was developed to access files from object storage without ever touching disk. The goal was to reliably and rapidly access a petabyte of image data broken down into tens to hundreds of millions of files being accessed in parallel across thousands of cores. CloudFiles has been used to processes dozens of images, many of which were in the hundreds of terabyte range. It has reliably read and written tens of billions of files to date.
## Highlights
1. Fast file access with transparent threading and optionally multi-process.
2. Google Cloud Storage, Amazon S3, local filesystems, and arbitrary web servers making hybrid or multi-cloud easy.
3. Robust to flaky network connections. Uses exponential random window retries to avoid network collisions on a large cluster. Validates md5 for gcs and s3.
4. gzip, brotli, bz2, zstd, and xz compression.
5. Supports HTTP Range reads.
6. Supports green threads, which are important for achieving maximum performance on virtualized servers.
7. High efficiency transfers that avoid compression/decompression cycles.
8. High speed gzip decompression using libdeflate (compared with zlib).
9. Bundled CLI tool.
10. Accepts iterator and generator input.
11. Resumable bulk transfers.
12. Supports composite parallel upload for GCS and multi-part upload for AWS S3.
13. Supports s3 and GCS internal copies to avoid unnecessary data movement.
## Installation 
```bash
pip install cloud-files
pip install cloud-files[test] # to enable testing with pytest
```
If you run into trouble installing dependenies, make sure you're using at least Python3.6 and you have updated pip. On Linux, some dependencies require manylinux2010 or manylinux2014 binaries which earlier versions of pip do not search for. MacOS, Linux, and Windows are supported platforms.
### Credentials
You may wish to install credentials under `~/.cloudvolume/secrets`. CloudFiles is descended from CloudVolume, and for now we'll leave the same configuration structure in place. 
You need credentials only for the services you'll use. The local filesystem doesn't need any. Google Storage ([setup instructions here](https://github.com/seung-lab/cloud-volume/wiki/Setting-up-Google-Cloud-Storage)) will attempt to use default account credentials if no service account is provided.  
If neither of those two conditions apply, you need a service account credential. `google-secret.json` is a service account credential for Google Storage, `aws-secret.json` is a service account for S3, etc. You can support multiple projects at once by prefixing the bucket you are planning to access to the credential filename. `google-secret.json` will be your defaut service account, but if you also want to also access bucket ABC, you can provide `ABC-google-secret.json` and you'll have simultaneous access to your ordinary buckets and ABC. The secondary credentials are accessed on the basis of the bucket name, not the project name.
```bash
mkdir -p ~/.cloudvolume/secrets/
mv aws-secret.json ~/.cloudvolume/secrets/ # needed for Amazon
mv google-secret.json ~/.cloudvolume/secrets/ # needed for Google
mv matrix-secret.json ~/.cloudvolume/secrets/ # needed for Matrix
```
#### `aws-secret.json` and `matrix-secret.json`
Create an [IAM user service account](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html) that can read, write, and delete objects from at least one bucket.
```json
{
  "AWS_ACCESS_KEY_ID": "$MY_AWS_ACCESS_KEY_ID",
  "AWS_SECRET_ACCESS_KEY": "$MY_SECRET_ACCESS_TOKEN",
  "AWS_DEFAULT_REGION": "$MY_AWS_REGION" // defaults to us-east-1
}
```
#### `google-secret.json`
You can create the `google-secret.json` file [here](https://console.cloud.google.com/iam-admin/serviceaccounts). You don't need to manually fill in JSON by hand, the below example is provided to show you what the end result should look like. You should be able to read, write, and delete objects from at least one bucket.
```json
{
  "type": "service_account",
  "project_id": "$YOUR_GOOGLE_PROJECT_ID",
  "private_key_id": "...",
  "private_key": "...",
  "client_email": "...",
  "client_id": "...",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": ""
}
```
## API Documentation  
Note that the "Cloud Costs" mentioned below are current as of June 2020 and are subject to change. As of this writing, S3 and Google use identical cost structures for these operations.  
### Constructor
```python
# import gevent.monkey # uncomment when using green threads
# gevent.monkey.patch_all(thread=False)
from cloudfiles import CloudFiles
cf = CloudFiles(
    cloudpath, progress=False, 
    green=None, secrets=None, num_threads=20,
    use_https=False, endpoint=None, request_payer=None,
    composite_upload_threshold = int(1e8)
)
# cloudpath examples:
cf = CloudFiles('gs://bucket/') # google cloud storage
cf = CloudFiles('s3://bucket/') # Amazon S3
cf = CloudFiles('s3://https://s3emulator.com/coolguy/') # alternate s3 endpoint
cf = CloudFiles('file:///home/coolguy/') # local filesystem
cf = CloudFiles('mem:///home/coolguy/') # in memory
cf = CloudFiles('https://website.com/coolguy/') # arbitrary web server
```
* cloudpath: The path to the bucket you are accessing. The path is formatted as `$PROTOCOL://BUCKET/PATH`. Files will then be accessed relative to the path. The protocols supported are `gs` (GCS), `s3` (AWS S3), `file` (local FS), `mem` (RAM), and `http`/`https`.
* progress: Whether to display a progress bar when processing multiple items simultaneously.
* green: Use green threads. For this to work properly, you must uncomment the top two lines. Green threads are used automatically upon monkey patching if green is None.
* secrets: Provide secrets dynamically rather than fetching from the credentials directory `$HOME/.cloudvolume/secrets`.
* num_threads: Number of simultaneous requests to make. Usually 20 per core is pretty close to optimal unless file sizes are extreme.
* use_https: `gs://` and `s3://` require credentials to access their files. However, each has a read-only https endpoint that sometimes requires no credentials. If True, automatically convert `gs://` to `https://storage.googleapis.com/` and `s3://` to `https://s3.amazonaws.com/`.
* endpoint: (s3 only) provide an alternate endpoint than the official Amazon servers. This is useful for accessing the various S3 emulators offered by on-premises deployments of object storage.  
* request_payer: Specify the account that should be charged for requests towards the bucket, rather than the bucket owner.
  * `gs://`: `request_payer` can be any Google Cloud project id. Please refer to the documentation for [more information](https://cloud.google.com/storage/docs/requester-pays).
  * `s3://`: `request_payer` must be `"requester"`. The AWS account associated with the AWS_ACCESS_KEY_ID will be charged. Please refer to the documentation for [more information](https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html)
The advantage of using `mem://` versus a `dict` has both the advantage of using identical interfaces in your code and it will use compression  automatically.
### get / get_json
```python
# Let 'filename' be the file b'hello world'
binary = cf.get('filename')
binary = cf['filename']
>> b'hello world'
compressed_binary = cf.get('filename', raw=True) 
binaries = cf.get(['filename1', 'filename2'])
>> [ { 'path': 'filename1', 'content': b'...', 'byte_range': (None, None), 'error': None }, { 'path': 'filename2', 'content': b'...', 'byte_range': (None, None), 'error': None } ]
# total provides info for progress bar when using generators.
binaries = cf.get(generator(), total=N) 
binary = cf.get({ 'path': 'filename', 'start': 0, 'end': 5 }) # fetches 5 bytes
binary = cf['filename', 0:5] # only fetches 5 bytes
binary = cf['filename'][0:5] # same result, fetches 11 bytes
>> b'hello' # represents byte range 0-4 inclusive of filename
binaries = cf[:100] # download the first 100 files
```
`get` supports several different styles of input. The simplest takes a scalar filename and returns the contents of that file. However, you can also specify lists of filenames, a byte range request, and lists of byte range requests. You can provide a generator or iterator as input as well. Order is not guaranteed.
When more than one file is provided at once, the download will be threaded using preemptive or cooperative (green) threads depending on the `green` setting. If `progress` is set to true, a progress bar will be displayed that counts down the number of files to download.
`get_json` is the same as get but it will parse the returned binary as JSON data encoded as utf8 and returns a dictionary. Order is guaranteed.
Cloud Cost: Usually about $0.40 per million requests.
### put / puts / put_json / put_jsons
```python 
cf.put('filename', b'content')
cf['filename'] = b'content'
cf.put_json('digits', [1,2,3,4,5])
cf.puts([{ 
   'path': 'filename',
   'content': b'...',
   'content_type': 'application/octet-stream',
   'compress': 'gzip',
   'compression_level': 6, # parameter for gzip or brotli compressor
   'cache_control': 'no-cache',
}])
cf.puts([ (path, content), (path, content) ], compression='gzip')
cf.put_jsons(...)
# Definition of put, put_json is identical
def put(
    self, 
    path, content,     
    content_type=None, compress=None, 
    compression_level=None, cache_control=None,
    raw=False
)
# Definition of puts, put_jsons is identical
def puts(
    self, files, 
    content_type=None, compress=None, 
    compression_level=None, cache_control=None,
    total=None, raw=False
)
```
The PUT operation is the most complex operation because it's so configurable. Sometimes you want one file, sometimes many. Sometimes you want to configure each file individually, sometimes you want to standardize a bulk upload. Sometimes it's binary data, but oftentimes it's JSON. We provide a simpler interface for uploading a single file `put` and `put_json` (singular) versus the interface for uploading possibly many files `puts` and `put_jsons` (plural). 
In order to upload many files at once (which is much faster due to threading), you need to minimally provide the `path` and `content` for each file. This can be done either as a dict containing those fields or as a tuple `(path, content)`. If dicts are used, the fields (if present) specified in the dict take precedence over the parameters of the function. You can mix tuples with dicts. The input to puts can be a scalar (a single dict or tuple) or an iterable such as a list, iterator, or generator.  
Cloud Cost: Usually about $5 per million files.
### delete
```python 
cf.delete('filename')
cf.delete([ 'file1', 'file2', ... ])
del cf['filename']
```
This will issue a delete request for each file specified in a threaded fashion.
Cloud Cost: Usually free.
### exists 
```python 
cf.exists('filename') 
>> True # or False
cf.exists([ 'file1', 'file2', ... ]) 
>> { 'file1': True, 'file2': False, ... }
```
Scalar input results in a simple boolean output while iterable input returns a dictionary of input paths mapped to whether they exist. In iterable mode, a progress bar may be displayed and threading is utilized to improve performance. 
Cloud Cost: Usually about $0.40 per million requests.
### size 
```python
cf.size('filename')
>>> 1337 # size in bytes
cf.size([ 'file1', 'nonexistent', 'empty_file', ... ])
>>> { 'file1': 392, 'nonexistent': None, 'empty_file': 0, ... }
```
The output is the size of each file as it is stored in bytes. If the file doesn't exist, `None` is returned. Scalar input results in a simple boolean output while iterable input returns a dictionary of input paths mapped to whether they exist. In iterable mode, a progress bar may be displayed and threading is utilized to improve performance. 
Cloud Cost: Usually about $0.40 per million requests.
### list
```python 
cf.list() # returns generator
list(cf) # same as list(cf.list())
cf.list(prefix="abc")
cf.list(prefix="abc", flat=True)
```
Recall that in object storage, directories do not really exist and file paths are really a key-value mapping. The `list` operator will list everything under the `cloudpath` given in the constructor. The `prefix` operator allows you to efficiently filter some of the results. If `flat` is specified, the results will be filtered to return only a single "level" of the "directory" even though directories are fake. The entire set of all subdirectories will still need to be fetched.
Cloud Cost: Usually about $5 per million requests, but each request might list 1000 files. The list operation will continuously issue list requests lazily as needed.
### transfer_to / transfer_from
```python
cff = CloudFiles('file:///source_location')
cfg = CloudFiles('gs://dest_location')
# Transfer all files from local filesys to google cloud storage
cfg.transfer_from(cff, block_size=64) # in blocks of 64 files
cff.transfer_to(cfg, block_size=64)
cff.transfer_to(cfg, block_size=64, reencode='br') # change encoding to brotli
cfg[:] = cff # default block size 64
```
Transfer semantics provide a simple way to perform bulk file transfers. Use the `block_size` parameter to adjust the number of files handled in a given pass. This can be important for preventing memory blow-up and reducing latency between batches.
gs to gs and s3 to s3 transfers will occur within the cloud without looping through the executing client provided no reencoding is specified.
#### resumable transfer
```python
from cloudfiles import ResumableTransfer
# .db b/c this is a sqlite database
# that will be automatically created
rt = ResumableTransfer("NAME_OF_JOB.db") 
# init should only be called once
rt.init("file://source_location", "gs://dest_location")
# This part can be interrupted and resumed
rt.execute("NAME_OF_JOB.db")
# If multiple transfer clients, the lease_msec
# parameter must be specified to prevent conflicts.
rt.execute("NAME_OF_JOB.db", lease_msec=30000)
rt.close() # deletes NAME_OF_JOB.db
```
This is esentially a more durable version of `cp`. The transfer works by first loading a sqlite database with filenames, a "done" flag, and a lease time. Then clients can attach to the database and execute the transfer in batches. When multiple clients are used, a lease time must be set so that the database does not return the same set of files to each client (and is robust).
This transfer type can also be accessed via the CLI.
```bash
cloudfiles xfer init SOURCE DEST --db NAME_OF_JOB.db
cloudfiles xfer execute NAME_OF_JOB.db # deletes db when done
```
### composite upload (Google Cloud Storage)
If a file is larger than 100MB (default), CloudFiles will split the file into 100MB parts and upload them as individual part files using the STANDARD storage class to minimize deletion costs. Once uploaded, the part files will be recursively merged in a tree 32 files at a time. After each merge, the part files will be deleted. The final file will have the default storage class for the bucket.
If an upload is interrupted, the part files will remain and must be cleaned up. You can provide an open for binary reading file handle instead of a bytes object so that large files can be uploaded without overwhelming RAM. You can also adjust the composite threshold using `CloudFiles(..., composite_upload_threshold=int(2e8))` to for example, raise the threshold to 200MB.
### mutli-part upload (S3)
If a file is larger than 100MB (default), the S3 service will use multi-part upload. ou can provide an open for binary reading file handle instead of a bytes object so that large files can be uploaded without overwhelming RAM. You can also adjust the composite threshold using `CloudFiles(..., composite_upload_threshold=int(2e8))` to for example, raise the threshold to 200MB.  
Unfinished upload parts remain on S3 (and cost money) unless you use a bucket lifecycle rule to remove them automatically.  
https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
### transcode
```python
from cloudfiles.compression import transcode
files = cf.get(...) 
for file in transcode(files, 'gzip'):
  file['content'] # gzipped file content regardless of source
transcode(files, 
  encoding='gzip', # any cf compatible compression scheme
  in_place=False, # modify the files in-place to save memory
  progress=True # progress bar
)
```
Sometimes we want to change the encoding type of a set of arbitrary files (often when moving them around to another storage system). `transcode` will take the output of `get` and transcode the resultant files into a new format. `transcode` respects the `raw` attribute which indicates that the contents are already compressed and will decompress them first before recompressing. If the input data are already compressed to the correct output encoding, it will simply pass it through without going through a decompression/recompression cycle.
`transcode` returns a generator so that the transcoding can be done in a streaming manner.
## Network Robustness
CloudFiles protects itself from network issues in several ways. 
First, it uses a connection pool to avoid needing to reestablish connections or exhausting the number of available sockets.  
Second, it uses an exponential random window backoff to retry failed connections and requests. The exponential backoff allows increasing breathing room for an overloaded server and the random window decorrelates independent attempts by a cluster. If the backoff was not growing, the retry attempts by a large cluster would be too rapid fire or inefficiently slow. If the attempts were not decorrellated, then regardless of the backoff, the servers will often all try again around the same time. We backoff seven times starting from 0.5 seconds to 60 seconds, doubling the random window each time.
Third, for Google Cloud Storage (GCS) and S3 endpoints, we compute the md5 digest both sending and receiving to ensure data corruption did not occur in transit and that the server sent the full response. We cannot validate the digest for partial ("Range") reads. For [composite objects](https://cloud.google.com/storage/docs/composite-objects) (GCS) we can check the [crc32c](https://pypi.org/project/crc32c/) check-sum which catches transmission errors but not tampering (though MD5 isn't secure at all anymore). We are unable to perform validation for [multi-part uploads](https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html) (S3). Using custom encryption keys may also create validation problems.
## CloudFiles CLI Tool
```bash
# list cloud and local directories
cloudfiles ls gs://bucket-folder/
# parallel file transfer, no decompression
cloudfiles -p 2 cp --progress -r s3://bkt/ gs://bkt2/
# change compression type to brotli
cloudfiles cp -c br s3://bkt/file.txt gs://bkt2/
# decompress
cloudfiles cp -c none s3://bkt/file.txt gs://bkt2/
# pass from stdin (use "-" for source argument)
find some_dir | cloudfiles cp - s3://bkt/
# resumable transfers
cloudfiles xfer init SRC DEST --db JOBNAME.db
cloudfiles xfer execute JOBNAME.db --progress # can quit and resume
# Get human readable file sizes from anywhere
cloudfiles du -shc ./tmp gs://bkt/dir s3://bkt/dir
# remove files
cloudfiles rm ./tmp gs://bkt/dir/file s3://bkt/dir/file
# cat across services, -r for range reads
cloudfiles cat ./tmp gs://bkt/dir/file s3://bkt/dir/file
# verify a transfer was successful by comparing bytes and hashes
cloudfiles verify ./my-data gs://bucket/my-data/ 
```
### `cp` Pros and Cons
For the cp command, the bundled CLI tool has a number of advantages vs. `gsutil` when it comes to transfers.
1. No decompression of file transfers (unless you want to).
2. Can shift compression format.
3. Easily control the number of parallel processes.
4. Green threads make core utilization more efficient.
5. Optionally uses libdeflate for faster gzip decompression.
It also has some disadvantages:  
1. Doesn't support all commands.
2. File suffixes may be added to signify compression type on the local filesystem (e.g. `.gz`, `.br`, or `.zstd`). `cloudfiles ls` will list them without the extension and they will be converted into `Content-Encoding` on cloud storage.
### `ls` Generative Expressions
For the `ls` command, we support (via the `-e` flag) simple generative expressions that enable querying multiple prefixes at once. A generative expression is denoted `[chars]` where `c`,`h`,`a`,`r`, & `s` will be inserted individually into the position where the expression appears. Multiple expressions are allowed and produce a cartesian product of resulting strings. This functionality is very limited at the moment but we intend to improve it.
```bash
cloudfiles ls -e "gs://bucket/prefix[ab]"
# equivalent to:
# cloudfiles ls gs://bucket/prefixa
# cloudfiles ls gs://bucket/prefixb
```
### `alias` for Alternative S3 Endpoints
You can set your own protocols for S3 compatible endpoints by creating dynamic or persistent aliases. CloudFiles comes with two official s3 endpoints that are important for the Seung Lab, `matrix://` and `tigerdata://` which point to Princeton S3 endpoints. Official aliases can't be overridden.
To create a dynamic alias, you can use `cloudfiles.paths.add_alias` which will only affect the current process. To create a persistent alias that resides in `~/.cloudfiles/aliases.json`, you can use the CLI. 
```bash
cloudfiles alias add example s3://https://example.com/ # example://
cloudfiles alias ls # list all aliases
cloudfiles alias rm example # remove example://
```
The alias file is only accessed (and cached) if CloudFiles encounters an unknown protocol. If you stick to default protocols and use the syntax `s3://https://example.com/` for alternative endpoints, you can still use CloudFiles in environments without filesystem access.
## Credits
CloudFiles is derived from the [CloudVolume.Storage](https://github.com/seung-lab/cloud-volume/tree/master/cloudvolume/storage) system.  
Storage was initially created by William Silversmith and Ignacio Tartavull. It was maintained and improved by William Silversmith and includes improvements by Nico Kemnitz (extremely fast exists) and Ben Falk (brotli). Manuel Castro added the ability to chose cloud storage class. Thanks to the anonymous author from https://teppen.io/ for their s3 etag validation code.

%package -n python3-cloud-files
Summary:	Fast access to cloud storage and local FS.
Provides:	python-cloud-files
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-pip
%description -n python3-cloud-files
```python
from cloudfiles import CloudFiles, CloudFile, dl
results = dl(["gs://bucket/file1", "gs://bucket2/file2", ... ]) # shorthand
cf = CloudFiles('gs://bucket', progress=True) # s3://, https://, and file:// also supported
results = cf.get(['file1', 'file2', 'file3', ..., 'fileN']) # threaded
results = cf.get(paths, parallel=2) # threaded and two processes
file1 = cf['file1']
part  = cf['file1', 0:30] # first 30 bytes of file1
cf.put('filename', content)
cf.put_json('filename', content)
cf.puts([{
    'path': 'filename',
    'content': content,
}, ... ]) # automatically threaded
cf.puts(content, parallel=2) # threaded + two processes
cf.puts(content, storage_class="NEARLINE") # apply vendor-specific storage class
cf.put_jsons(...) # same as puts
cf['filename'] = content
for fname in cf.list(prefix='abc123'):
    print(fname)
list(cf) # same as list(cf.list())
cf.delete('filename')
del cf['filename']
cf.delete([ 'filename_1', 'filename_2', ... ]) # threaded
cf.delete(paths, parallel=2) # threaded + two processes
boolean = cf.exists('filename')
results = cf.exists([ 'filename_1', ... ]) # threaded
# for single files
cf = CloudFile("gs://bucket/file1")
info = cf.head()
binary = cf.get()
cf.put(binary)
cf[:30] # get first 30 bytes of file
```
CloudFiles was developed to access files from object storage without ever touching disk. The goal was to reliably and rapidly access a petabyte of image data broken down into tens to hundreds of millions of files being accessed in parallel across thousands of cores. CloudFiles has been used to processes dozens of images, many of which were in the hundreds of terabyte range. It has reliably read and written tens of billions of files to date.
## Highlights
1. Fast file access with transparent threading and optionally multi-process.
2. Google Cloud Storage, Amazon S3, local filesystems, and arbitrary web servers making hybrid or multi-cloud easy.
3. Robust to flaky network connections. Uses exponential random window retries to avoid network collisions on a large cluster. Validates md5 for gcs and s3.
4. gzip, brotli, bz2, zstd, and xz compression.
5. Supports HTTP Range reads.
6. Supports green threads, which are important for achieving maximum performance on virtualized servers.
7. High efficiency transfers that avoid compression/decompression cycles.
8. High speed gzip decompression using libdeflate (compared with zlib).
9. Bundled CLI tool.
10. Accepts iterator and generator input.
11. Resumable bulk transfers.
12. Supports composite parallel upload for GCS and multi-part upload for AWS S3.
13. Supports s3 and GCS internal copies to avoid unnecessary data movement.
## Installation 
```bash
pip install cloud-files
pip install cloud-files[test] # to enable testing with pytest
```
If you run into trouble installing dependenies, make sure you're using at least Python3.6 and you have updated pip. On Linux, some dependencies require manylinux2010 or manylinux2014 binaries which earlier versions of pip do not search for. MacOS, Linux, and Windows are supported platforms.
### Credentials
You may wish to install credentials under `~/.cloudvolume/secrets`. CloudFiles is descended from CloudVolume, and for now we'll leave the same configuration structure in place. 
You need credentials only for the services you'll use. The local filesystem doesn't need any. Google Storage ([setup instructions here](https://github.com/seung-lab/cloud-volume/wiki/Setting-up-Google-Cloud-Storage)) will attempt to use default account credentials if no service account is provided.  
If neither of those two conditions apply, you need a service account credential. `google-secret.json` is a service account credential for Google Storage, `aws-secret.json` is a service account for S3, etc. You can support multiple projects at once by prefixing the bucket you are planning to access to the credential filename. `google-secret.json` will be your defaut service account, but if you also want to also access bucket ABC, you can provide `ABC-google-secret.json` and you'll have simultaneous access to your ordinary buckets and ABC. The secondary credentials are accessed on the basis of the bucket name, not the project name.
```bash
mkdir -p ~/.cloudvolume/secrets/
mv aws-secret.json ~/.cloudvolume/secrets/ # needed for Amazon
mv google-secret.json ~/.cloudvolume/secrets/ # needed for Google
mv matrix-secret.json ~/.cloudvolume/secrets/ # needed for Matrix
```
#### `aws-secret.json` and `matrix-secret.json`
Create an [IAM user service account](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html) that can read, write, and delete objects from at least one bucket.
```json
{
  "AWS_ACCESS_KEY_ID": "$MY_AWS_ACCESS_KEY_ID",
  "AWS_SECRET_ACCESS_KEY": "$MY_SECRET_ACCESS_TOKEN",
  "AWS_DEFAULT_REGION": "$MY_AWS_REGION" // defaults to us-east-1
}
```
#### `google-secret.json`
You can create the `google-secret.json` file [here](https://console.cloud.google.com/iam-admin/serviceaccounts). You don't need to manually fill in JSON by hand, the below example is provided to show you what the end result should look like. You should be able to read, write, and delete objects from at least one bucket.
```json
{
  "type": "service_account",
  "project_id": "$YOUR_GOOGLE_PROJECT_ID",
  "private_key_id": "...",
  "private_key": "...",
  "client_email": "...",
  "client_id": "...",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": ""
}
```
## API Documentation  
Note that the "Cloud Costs" mentioned below are current as of June 2020 and are subject to change. As of this writing, S3 and Google use identical cost structures for these operations.  
### Constructor
```python
# import gevent.monkey # uncomment when using green threads
# gevent.monkey.patch_all(thread=False)
from cloudfiles import CloudFiles
cf = CloudFiles(
    cloudpath, progress=False, 
    green=None, secrets=None, num_threads=20,
    use_https=False, endpoint=None, request_payer=None,
    composite_upload_threshold = int(1e8)
)
# cloudpath examples:
cf = CloudFiles('gs://bucket/') # google cloud storage
cf = CloudFiles('s3://bucket/') # Amazon S3
cf = CloudFiles('s3://https://s3emulator.com/coolguy/') # alternate s3 endpoint
cf = CloudFiles('file:///home/coolguy/') # local filesystem
cf = CloudFiles('mem:///home/coolguy/') # in memory
cf = CloudFiles('https://website.com/coolguy/') # arbitrary web server
```
* cloudpath: The path to the bucket you are accessing. The path is formatted as `$PROTOCOL://BUCKET/PATH`. Files will then be accessed relative to the path. The protocols supported are `gs` (GCS), `s3` (AWS S3), `file` (local FS), `mem` (RAM), and `http`/`https`.
* progress: Whether to display a progress bar when processing multiple items simultaneously.
* green: Use green threads. For this to work properly, you must uncomment the top two lines. Green threads are used automatically upon monkey patching if green is None.
* secrets: Provide secrets dynamically rather than fetching from the credentials directory `$HOME/.cloudvolume/secrets`.
* num_threads: Number of simultaneous requests to make. Usually 20 per core is pretty close to optimal unless file sizes are extreme.
* use_https: `gs://` and `s3://` require credentials to access their files. However, each has a read-only https endpoint that sometimes requires no credentials. If True, automatically convert `gs://` to `https://storage.googleapis.com/` and `s3://` to `https://s3.amazonaws.com/`.
* endpoint: (s3 only) provide an alternate endpoint than the official Amazon servers. This is useful for accessing the various S3 emulators offered by on-premises deployments of object storage.  
* request_payer: Specify the account that should be charged for requests towards the bucket, rather than the bucket owner.
  * `gs://`: `request_payer` can be any Google Cloud project id. Please refer to the documentation for [more information](https://cloud.google.com/storage/docs/requester-pays).
  * `s3://`: `request_payer` must be `"requester"`. The AWS account associated with the AWS_ACCESS_KEY_ID will be charged. Please refer to the documentation for [more information](https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html)
The advantage of using `mem://` versus a `dict` has both the advantage of using identical interfaces in your code and it will use compression  automatically.
### get / get_json
```python
# Let 'filename' be the file b'hello world'
binary = cf.get('filename')
binary = cf['filename']
>> b'hello world'
compressed_binary = cf.get('filename', raw=True) 
binaries = cf.get(['filename1', 'filename2'])
>> [ { 'path': 'filename1', 'content': b'...', 'byte_range': (None, None), 'error': None }, { 'path': 'filename2', 'content': b'...', 'byte_range': (None, None), 'error': None } ]
# total provides info for progress bar when using generators.
binaries = cf.get(generator(), total=N) 
binary = cf.get({ 'path': 'filename', 'start': 0, 'end': 5 }) # fetches 5 bytes
binary = cf['filename', 0:5] # only fetches 5 bytes
binary = cf['filename'][0:5] # same result, fetches 11 bytes
>> b'hello' # represents byte range 0-4 inclusive of filename
binaries = cf[:100] # download the first 100 files
```
`get` supports several different styles of input. The simplest takes a scalar filename and returns the contents of that file. However, you can also specify lists of filenames, a byte range request, and lists of byte range requests. You can provide a generator or iterator as input as well. Order is not guaranteed.
When more than one file is provided at once, the download will be threaded using preemptive or cooperative (green) threads depending on the `green` setting. If `progress` is set to true, a progress bar will be displayed that counts down the number of files to download.
`get_json` is the same as get but it will parse the returned binary as JSON data encoded as utf8 and returns a dictionary. Order is guaranteed.
Cloud Cost: Usually about $0.40 per million requests.
### put / puts / put_json / put_jsons
```python 
cf.put('filename', b'content')
cf['filename'] = b'content'
cf.put_json('digits', [1,2,3,4,5])
cf.puts([{ 
   'path': 'filename',
   'content': b'...',
   'content_type': 'application/octet-stream',
   'compress': 'gzip',
   'compression_level': 6, # parameter for gzip or brotli compressor
   'cache_control': 'no-cache',
}])
cf.puts([ (path, content), (path, content) ], compression='gzip')
cf.put_jsons(...)
# Definition of put, put_json is identical
def put(
    self, 
    path, content,     
    content_type=None, compress=None, 
    compression_level=None, cache_control=None,
    raw=False
)
# Definition of puts, put_jsons is identical
def puts(
    self, files, 
    content_type=None, compress=None, 
    compression_level=None, cache_control=None,
    total=None, raw=False
)
```
The PUT operation is the most complex operation because it's so configurable. Sometimes you want one file, sometimes many. Sometimes you want to configure each file individually, sometimes you want to standardize a bulk upload. Sometimes it's binary data, but oftentimes it's JSON. We provide a simpler interface for uploading a single file `put` and `put_json` (singular) versus the interface for uploading possibly many files `puts` and `put_jsons` (plural). 
In order to upload many files at once (which is much faster due to threading), you need to minimally provide the `path` and `content` for each file. This can be done either as a dict containing those fields or as a tuple `(path, content)`. If dicts are used, the fields (if present) specified in the dict take precedence over the parameters of the function. You can mix tuples with dicts. The input to puts can be a scalar (a single dict or tuple) or an iterable such as a list, iterator, or generator.  
Cloud Cost: Usually about $5 per million files.
### delete
```python 
cf.delete('filename')
cf.delete([ 'file1', 'file2', ... ])
del cf['filename']
```
This will issue a delete request for each file specified in a threaded fashion.
Cloud Cost: Usually free.
### exists 
```python 
cf.exists('filename') 
>> True # or False
cf.exists([ 'file1', 'file2', ... ]) 
>> { 'file1': True, 'file2': False, ... }
```
Scalar input results in a simple boolean output while iterable input returns a dictionary of input paths mapped to whether they exist. In iterable mode, a progress bar may be displayed and threading is utilized to improve performance. 
Cloud Cost: Usually about $0.40 per million requests.
### size 
```python
cf.size('filename')
>>> 1337 # size in bytes
cf.size([ 'file1', 'nonexistent', 'empty_file', ... ])
>>> { 'file1': 392, 'nonexistent': None, 'empty_file': 0, ... }
```
The output is the size of each file as it is stored in bytes. If the file doesn't exist, `None` is returned. Scalar input results in a simple boolean output while iterable input returns a dictionary of input paths mapped to whether they exist. In iterable mode, a progress bar may be displayed and threading is utilized to improve performance. 
Cloud Cost: Usually about $0.40 per million requests.
### list
```python 
cf.list() # returns generator
list(cf) # same as list(cf.list())
cf.list(prefix="abc")
cf.list(prefix="abc", flat=True)
```
Recall that in object storage, directories do not really exist and file paths are really a key-value mapping. The `list` operator will list everything under the `cloudpath` given in the constructor. The `prefix` operator allows you to efficiently filter some of the results. If `flat` is specified, the results will be filtered to return only a single "level" of the "directory" even though directories are fake. The entire set of all subdirectories will still need to be fetched.
Cloud Cost: Usually about $5 per million requests, but each request might list 1000 files. The list operation will continuously issue list requests lazily as needed.
### transfer_to / transfer_from
```python
cff = CloudFiles('file:///source_location')
cfg = CloudFiles('gs://dest_location')
# Transfer all files from local filesys to google cloud storage
cfg.transfer_from(cff, block_size=64) # in blocks of 64 files
cff.transfer_to(cfg, block_size=64)
cff.transfer_to(cfg, block_size=64, reencode='br') # change encoding to brotli
cfg[:] = cff # default block size 64
```
Transfer semantics provide a simple way to perform bulk file transfers. Use the `block_size` parameter to adjust the number of files handled in a given pass. This can be important for preventing memory blow-up and reducing latency between batches.
gs to gs and s3 to s3 transfers will occur within the cloud without looping through the executing client provided no reencoding is specified.
#### resumable transfer
```python
from cloudfiles import ResumableTransfer
# .db b/c this is a sqlite database
# that will be automatically created
rt = ResumableTransfer("NAME_OF_JOB.db") 
# init should only be called once
rt.init("file://source_location", "gs://dest_location")
# This part can be interrupted and resumed
rt.execute("NAME_OF_JOB.db")
# If multiple transfer clients, the lease_msec
# parameter must be specified to prevent conflicts.
rt.execute("NAME_OF_JOB.db", lease_msec=30000)
rt.close() # deletes NAME_OF_JOB.db
```
This is esentially a more durable version of `cp`. The transfer works by first loading a sqlite database with filenames, a "done" flag, and a lease time. Then clients can attach to the database and execute the transfer in batches. When multiple clients are used, a lease time must be set so that the database does not return the same set of files to each client (and is robust).
This transfer type can also be accessed via the CLI.
```bash
cloudfiles xfer init SOURCE DEST --db NAME_OF_JOB.db
cloudfiles xfer execute NAME_OF_JOB.db # deletes db when done
```
### composite upload (Google Cloud Storage)
If a file is larger than 100MB (default), CloudFiles will split the file into 100MB parts and upload them as individual part files using the STANDARD storage class to minimize deletion costs. Once uploaded, the part files will be recursively merged in a tree 32 files at a time. After each merge, the part files will be deleted. The final file will have the default storage class for the bucket.
If an upload is interrupted, the part files will remain and must be cleaned up. You can provide an open for binary reading file handle instead of a bytes object so that large files can be uploaded without overwhelming RAM. You can also adjust the composite threshold using `CloudFiles(..., composite_upload_threshold=int(2e8))` to for example, raise the threshold to 200MB.
### mutli-part upload (S3)
If a file is larger than 100MB (default), the S3 service will use multi-part upload. ou can provide an open for binary reading file handle instead of a bytes object so that large files can be uploaded without overwhelming RAM. You can also adjust the composite threshold using `CloudFiles(..., composite_upload_threshold=int(2e8))` to for example, raise the threshold to 200MB.  
Unfinished upload parts remain on S3 (and cost money) unless you use a bucket lifecycle rule to remove them automatically.  
https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
### transcode
```python
from cloudfiles.compression import transcode
files = cf.get(...) 
for file in transcode(files, 'gzip'):
  file['content'] # gzipped file content regardless of source
transcode(files, 
  encoding='gzip', # any cf compatible compression scheme
  in_place=False, # modify the files in-place to save memory
  progress=True # progress bar
)
```
Sometimes we want to change the encoding type of a set of arbitrary files (often when moving them around to another storage system). `transcode` will take the output of `get` and transcode the resultant files into a new format. `transcode` respects the `raw` attribute which indicates that the contents are already compressed and will decompress them first before recompressing. If the input data are already compressed to the correct output encoding, it will simply pass it through without going through a decompression/recompression cycle.
`transcode` returns a generator so that the transcoding can be done in a streaming manner.
## Network Robustness
CloudFiles protects itself from network issues in several ways. 
First, it uses a connection pool to avoid needing to reestablish connections or exhausting the number of available sockets.  
Second, it uses an exponential random window backoff to retry failed connections and requests. The exponential backoff allows increasing breathing room for an overloaded server and the random window decorrelates independent attempts by a cluster. If the backoff was not growing, the retry attempts by a large cluster would be too rapid fire or inefficiently slow. If the attempts were not decorrellated, then regardless of the backoff, the servers will often all try again around the same time. We backoff seven times starting from 0.5 seconds to 60 seconds, doubling the random window each time.
Third, for Google Cloud Storage (GCS) and S3 endpoints, we compute the md5 digest both sending and receiving to ensure data corruption did not occur in transit and that the server sent the full response. We cannot validate the digest for partial ("Range") reads. For [composite objects](https://cloud.google.com/storage/docs/composite-objects) (GCS) we can check the [crc32c](https://pypi.org/project/crc32c/) check-sum which catches transmission errors but not tampering (though MD5 isn't secure at all anymore). We are unable to perform validation for [multi-part uploads](https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html) (S3). Using custom encryption keys may also create validation problems.
## CloudFiles CLI Tool
```bash
# list cloud and local directories
cloudfiles ls gs://bucket-folder/
# parallel file transfer, no decompression
cloudfiles -p 2 cp --progress -r s3://bkt/ gs://bkt2/
# change compression type to brotli
cloudfiles cp -c br s3://bkt/file.txt gs://bkt2/
# decompress
cloudfiles cp -c none s3://bkt/file.txt gs://bkt2/
# pass from stdin (use "-" for source argument)
find some_dir | cloudfiles cp - s3://bkt/
# resumable transfers
cloudfiles xfer init SRC DEST --db JOBNAME.db
cloudfiles xfer execute JOBNAME.db --progress # can quit and resume
# Get human readable file sizes from anywhere
cloudfiles du -shc ./tmp gs://bkt/dir s3://bkt/dir
# remove files
cloudfiles rm ./tmp gs://bkt/dir/file s3://bkt/dir/file
# cat across services, -r for range reads
cloudfiles cat ./tmp gs://bkt/dir/file s3://bkt/dir/file
# verify a transfer was successful by comparing bytes and hashes
cloudfiles verify ./my-data gs://bucket/my-data/ 
```
### `cp` Pros and Cons
For the cp command, the bundled CLI tool has a number of advantages vs. `gsutil` when it comes to transfers.
1. No decompression of file transfers (unless you want to).
2. Can shift compression format.
3. Easily control the number of parallel processes.
4. Green threads make core utilization more efficient.
5. Optionally uses libdeflate for faster gzip decompression.
It also has some disadvantages:  
1. Doesn't support all commands.
2. File suffixes may be added to signify compression type on the local filesystem (e.g. `.gz`, `.br`, or `.zstd`). `cloudfiles ls` will list them without the extension and they will be converted into `Content-Encoding` on cloud storage.
### `ls` Generative Expressions
For the `ls` command, we support (via the `-e` flag) simple generative expressions that enable querying multiple prefixes at once. A generative expression is denoted `[chars]` where `c`,`h`,`a`,`r`, & `s` will be inserted individually into the position where the expression appears. Multiple expressions are allowed and produce a cartesian product of resulting strings. This functionality is very limited at the moment but we intend to improve it.
```bash
cloudfiles ls -e "gs://bucket/prefix[ab]"
# equivalent to:
# cloudfiles ls gs://bucket/prefixa
# cloudfiles ls gs://bucket/prefixb
```
### `alias` for Alternative S3 Endpoints
You can set your own protocols for S3 compatible endpoints by creating dynamic or persistent aliases. CloudFiles comes with two official s3 endpoints that are important for the Seung Lab, `matrix://` and `tigerdata://` which point to Princeton S3 endpoints. Official aliases can't be overridden.
To create a dynamic alias, you can use `cloudfiles.paths.add_alias` which will only affect the current process. To create a persistent alias that resides in `~/.cloudfiles/aliases.json`, you can use the CLI. 
```bash
cloudfiles alias add example s3://https://example.com/ # example://
cloudfiles alias ls # list all aliases
cloudfiles alias rm example # remove example://
```
The alias file is only accessed (and cached) if CloudFiles encounters an unknown protocol. If you stick to default protocols and use the syntax `s3://https://example.com/` for alternative endpoints, you can still use CloudFiles in environments without filesystem access.
## Credits
CloudFiles is derived from the [CloudVolume.Storage](https://github.com/seung-lab/cloud-volume/tree/master/cloudvolume/storage) system.  
Storage was initially created by William Silversmith and Ignacio Tartavull. It was maintained and improved by William Silversmith and includes improvements by Nico Kemnitz (extremely fast exists) and Ben Falk (brotli). Manuel Castro added the ability to chose cloud storage class. Thanks to the anonymous author from https://teppen.io/ for their s3 etag validation code.

%package help
Summary:	Development documents and examples for cloud-files
Provides:	python3-cloud-files-doc
%description help
```python
from cloudfiles import CloudFiles, CloudFile, dl
results = dl(["gs://bucket/file1", "gs://bucket2/file2", ... ]) # shorthand
cf = CloudFiles('gs://bucket', progress=True) # s3://, https://, and file:// also supported
results = cf.get(['file1', 'file2', 'file3', ..., 'fileN']) # threaded
results = cf.get(paths, parallel=2) # threaded and two processes
file1 = cf['file1']
part  = cf['file1', 0:30] # first 30 bytes of file1
cf.put('filename', content)
cf.put_json('filename', content)
cf.puts([{
    'path': 'filename',
    'content': content,
}, ... ]) # automatically threaded
cf.puts(content, parallel=2) # threaded + two processes
cf.puts(content, storage_class="NEARLINE") # apply vendor-specific storage class
cf.put_jsons(...) # same as puts
cf['filename'] = content
for fname in cf.list(prefix='abc123'):
    print(fname)
list(cf) # same as list(cf.list())
cf.delete('filename')
del cf['filename']
cf.delete([ 'filename_1', 'filename_2', ... ]) # threaded
cf.delete(paths, parallel=2) # threaded + two processes
boolean = cf.exists('filename')
results = cf.exists([ 'filename_1', ... ]) # threaded
# for single files
cf = CloudFile("gs://bucket/file1")
info = cf.head()
binary = cf.get()
cf.put(binary)
cf[:30] # get first 30 bytes of file
```
CloudFiles was developed to access files from object storage without ever touching disk. The goal was to reliably and rapidly access a petabyte of image data broken down into tens to hundreds of millions of files being accessed in parallel across thousands of cores. CloudFiles has been used to processes dozens of images, many of which were in the hundreds of terabyte range. It has reliably read and written tens of billions of files to date.
## Highlights
1. Fast file access with transparent threading and optionally multi-process.
2. Google Cloud Storage, Amazon S3, local filesystems, and arbitrary web servers making hybrid or multi-cloud easy.
3. Robust to flaky network connections. Uses exponential random window retries to avoid network collisions on a large cluster. Validates md5 for gcs and s3.
4. gzip, brotli, bz2, zstd, and xz compression.
5. Supports HTTP Range reads.
6. Supports green threads, which are important for achieving maximum performance on virtualized servers.
7. High efficiency transfers that avoid compression/decompression cycles.
8. High speed gzip decompression using libdeflate (compared with zlib).
9. Bundled CLI tool.
10. Accepts iterator and generator input.
11. Resumable bulk transfers.
12. Supports composite parallel upload for GCS and multi-part upload for AWS S3.
13. Supports s3 and GCS internal copies to avoid unnecessary data movement.
## Installation 
```bash
pip install cloud-files
pip install cloud-files[test] # to enable testing with pytest
```
If you run into trouble installing dependenies, make sure you're using at least Python3.6 and you have updated pip. On Linux, some dependencies require manylinux2010 or manylinux2014 binaries which earlier versions of pip do not search for. MacOS, Linux, and Windows are supported platforms.
### Credentials
You may wish to install credentials under `~/.cloudvolume/secrets`. CloudFiles is descended from CloudVolume, and for now we'll leave the same configuration structure in place. 
You need credentials only for the services you'll use. The local filesystem doesn't need any. Google Storage ([setup instructions here](https://github.com/seung-lab/cloud-volume/wiki/Setting-up-Google-Cloud-Storage)) will attempt to use default account credentials if no service account is provided.  
If neither of those two conditions apply, you need a service account credential. `google-secret.json` is a service account credential for Google Storage, `aws-secret.json` is a service account for S3, etc. You can support multiple projects at once by prefixing the bucket you are planning to access to the credential filename. `google-secret.json` will be your defaut service account, but if you also want to also access bucket ABC, you can provide `ABC-google-secret.json` and you'll have simultaneous access to your ordinary buckets and ABC. The secondary credentials are accessed on the basis of the bucket name, not the project name.
```bash
mkdir -p ~/.cloudvolume/secrets/
mv aws-secret.json ~/.cloudvolume/secrets/ # needed for Amazon
mv google-secret.json ~/.cloudvolume/secrets/ # needed for Google
mv matrix-secret.json ~/.cloudvolume/secrets/ # needed for Matrix
```
#### `aws-secret.json` and `matrix-secret.json`
Create an [IAM user service account](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html) that can read, write, and delete objects from at least one bucket.
```json
{
  "AWS_ACCESS_KEY_ID": "$MY_AWS_ACCESS_KEY_ID",
  "AWS_SECRET_ACCESS_KEY": "$MY_SECRET_ACCESS_TOKEN",
  "AWS_DEFAULT_REGION": "$MY_AWS_REGION" // defaults to us-east-1
}
```
#### `google-secret.json`
You can create the `google-secret.json` file [here](https://console.cloud.google.com/iam-admin/serviceaccounts). You don't need to manually fill in JSON by hand, the below example is provided to show you what the end result should look like. You should be able to read, write, and delete objects from at least one bucket.
```json
{
  "type": "service_account",
  "project_id": "$YOUR_GOOGLE_PROJECT_ID",
  "private_key_id": "...",
  "private_key": "...",
  "client_email": "...",
  "client_id": "...",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": ""
}
```
## API Documentation  
Note that the "Cloud Costs" mentioned below are current as of June 2020 and are subject to change. As of this writing, S3 and Google use identical cost structures for these operations.  
### Constructor
```python
# import gevent.monkey # uncomment when using green threads
# gevent.monkey.patch_all(thread=False)
from cloudfiles import CloudFiles
cf = CloudFiles(
    cloudpath, progress=False, 
    green=None, secrets=None, num_threads=20,
    use_https=False, endpoint=None, request_payer=None,
    composite_upload_threshold = int(1e8)
)
# cloudpath examples:
cf = CloudFiles('gs://bucket/') # google cloud storage
cf = CloudFiles('s3://bucket/') # Amazon S3
cf = CloudFiles('s3://https://s3emulator.com/coolguy/') # alternate s3 endpoint
cf = CloudFiles('file:///home/coolguy/') # local filesystem
cf = CloudFiles('mem:///home/coolguy/') # in memory
cf = CloudFiles('https://website.com/coolguy/') # arbitrary web server
```
* cloudpath: The path to the bucket you are accessing. The path is formatted as `$PROTOCOL://BUCKET/PATH`. Files will then be accessed relative to the path. The protocols supported are `gs` (GCS), `s3` (AWS S3), `file` (local FS), `mem` (RAM), and `http`/`https`.
* progress: Whether to display a progress bar when processing multiple items simultaneously.
* green: Use green threads. For this to work properly, you must uncomment the top two lines. Green threads are used automatically upon monkey patching if green is None.
* secrets: Provide secrets dynamically rather than fetching from the credentials directory `$HOME/.cloudvolume/secrets`.
* num_threads: Number of simultaneous requests to make. Usually 20 per core is pretty close to optimal unless file sizes are extreme.
* use_https: `gs://` and `s3://` require credentials to access their files. However, each has a read-only https endpoint that sometimes requires no credentials. If True, automatically convert `gs://` to `https://storage.googleapis.com/` and `s3://` to `https://s3.amazonaws.com/`.
* endpoint: (s3 only) provide an alternate endpoint than the official Amazon servers. This is useful for accessing the various S3 emulators offered by on-premises deployments of object storage.  
* request_payer: Specify the account that should be charged for requests towards the bucket, rather than the bucket owner.
  * `gs://`: `request_payer` can be any Google Cloud project id. Please refer to the documentation for [more information](https://cloud.google.com/storage/docs/requester-pays).
  * `s3://`: `request_payer` must be `"requester"`. The AWS account associated with the AWS_ACCESS_KEY_ID will be charged. Please refer to the documentation for [more information](https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html)
The advantage of using `mem://` versus a `dict` has both the advantage of using identical interfaces in your code and it will use compression  automatically.
### get / get_json
```python
# Let 'filename' be the file b'hello world'
binary = cf.get('filename')
binary = cf['filename']
>> b'hello world'
compressed_binary = cf.get('filename', raw=True) 
binaries = cf.get(['filename1', 'filename2'])
>> [ { 'path': 'filename1', 'content': b'...', 'byte_range': (None, None), 'error': None }, { 'path': 'filename2', 'content': b'...', 'byte_range': (None, None), 'error': None } ]
# total provides info for progress bar when using generators.
binaries = cf.get(generator(), total=N) 
binary = cf.get({ 'path': 'filename', 'start': 0, 'end': 5 }) # fetches 5 bytes
binary = cf['filename', 0:5] # only fetches 5 bytes
binary = cf['filename'][0:5] # same result, fetches 11 bytes
>> b'hello' # represents byte range 0-4 inclusive of filename
binaries = cf[:100] # download the first 100 files
```
`get` supports several different styles of input. The simplest takes a scalar filename and returns the contents of that file. However, you can also specify lists of filenames, a byte range request, and lists of byte range requests. You can provide a generator or iterator as input as well. Order is not guaranteed.
When more than one file is provided at once, the download will be threaded using preemptive or cooperative (green) threads depending on the `green` setting. If `progress` is set to true, a progress bar will be displayed that counts down the number of files to download.
`get_json` is the same as get but it will parse the returned binary as JSON data encoded as utf8 and returns a dictionary. Order is guaranteed.
Cloud Cost: Usually about $0.40 per million requests.
### put / puts / put_json / put_jsons
```python 
cf.put('filename', b'content')
cf['filename'] = b'content'
cf.put_json('digits', [1,2,3,4,5])
cf.puts([{ 
   'path': 'filename',
   'content': b'...',
   'content_type': 'application/octet-stream',
   'compress': 'gzip',
   'compression_level': 6, # parameter for gzip or brotli compressor
   'cache_control': 'no-cache',
}])
cf.puts([ (path, content), (path, content) ], compression='gzip')
cf.put_jsons(...)
# Definition of put, put_json is identical
def put(
    self, 
    path, content,     
    content_type=None, compress=None, 
    compression_level=None, cache_control=None,
    raw=False
)
# Definition of puts, put_jsons is identical
def puts(
    self, files, 
    content_type=None, compress=None, 
    compression_level=None, cache_control=None,
    total=None, raw=False
)
```
The PUT operation is the most complex operation because it's so configurable. Sometimes you want one file, sometimes many. Sometimes you want to configure each file individually, sometimes you want to standardize a bulk upload. Sometimes it's binary data, but oftentimes it's JSON. We provide a simpler interface for uploading a single file `put` and `put_json` (singular) versus the interface for uploading possibly many files `puts` and `put_jsons` (plural). 
In order to upload many files at once (which is much faster due to threading), you need to minimally provide the `path` and `content` for each file. This can be done either as a dict containing those fields or as a tuple `(path, content)`. If dicts are used, the fields (if present) specified in the dict take precedence over the parameters of the function. You can mix tuples with dicts. The input to puts can be a scalar (a single dict or tuple) or an iterable such as a list, iterator, or generator.  
Cloud Cost: Usually about $5 per million files.
### delete
```python 
cf.delete('filename')
cf.delete([ 'file1', 'file2', ... ])
del cf['filename']
```
This will issue a delete request for each file specified in a threaded fashion.
Cloud Cost: Usually free.
### exists 
```python 
cf.exists('filename') 
>> True # or False
cf.exists([ 'file1', 'file2', ... ]) 
>> { 'file1': True, 'file2': False, ... }
```
Scalar input results in a simple boolean output while iterable input returns a dictionary of input paths mapped to whether they exist. In iterable mode, a progress bar may be displayed and threading is utilized to improve performance. 
Cloud Cost: Usually about $0.40 per million requests.
### size 
```python
cf.size('filename')
>>> 1337 # size in bytes
cf.size([ 'file1', 'nonexistent', 'empty_file', ... ])
>>> { 'file1': 392, 'nonexistent': None, 'empty_file': 0, ... }
```
The output is the size of each file as it is stored in bytes. If the file doesn't exist, `None` is returned. Scalar input results in a simple boolean output while iterable input returns a dictionary of input paths mapped to whether they exist. In iterable mode, a progress bar may be displayed and threading is utilized to improve performance. 
Cloud Cost: Usually about $0.40 per million requests.
### list
```python 
cf.list() # returns generator
list(cf) # same as list(cf.list())
cf.list(prefix="abc")
cf.list(prefix="abc", flat=True)
```
Recall that in object storage, directories do not really exist and file paths are really a key-value mapping. The `list` operator will list everything under the `cloudpath` given in the constructor. The `prefix` operator allows you to efficiently filter some of the results. If `flat` is specified, the results will be filtered to return only a single "level" of the "directory" even though directories are fake. The entire set of all subdirectories will still need to be fetched.
Cloud Cost: Usually about $5 per million requests, but each request might list 1000 files. The list operation will continuously issue list requests lazily as needed.
### transfer_to / transfer_from
```python
cff = CloudFiles('file:///source_location')
cfg = CloudFiles('gs://dest_location')
# Transfer all files from local filesys to google cloud storage
cfg.transfer_from(cff, block_size=64) # in blocks of 64 files
cff.transfer_to(cfg, block_size=64)
cff.transfer_to(cfg, block_size=64, reencode='br') # change encoding to brotli
cfg[:] = cff # default block size 64
```
Transfer semantics provide a simple way to perform bulk file transfers. Use the `block_size` parameter to adjust the number of files handled in a given pass. This can be important for preventing memory blow-up and reducing latency between batches.
gs to gs and s3 to s3 transfers will occur within the cloud without looping through the executing client provided no reencoding is specified.
#### resumable transfer
```python
from cloudfiles import ResumableTransfer
# .db b/c this is a sqlite database
# that will be automatically created
rt = ResumableTransfer("NAME_OF_JOB.db") 
# init should only be called once
rt.init("file://source_location", "gs://dest_location")
# This part can be interrupted and resumed
rt.execute("NAME_OF_JOB.db")
# If multiple transfer clients, the lease_msec
# parameter must be specified to prevent conflicts.
rt.execute("NAME_OF_JOB.db", lease_msec=30000)
rt.close() # deletes NAME_OF_JOB.db
```
This is esentially a more durable version of `cp`. The transfer works by first loading a sqlite database with filenames, a "done" flag, and a lease time. Then clients can attach to the database and execute the transfer in batches. When multiple clients are used, a lease time must be set so that the database does not return the same set of files to each client (and is robust).
This transfer type can also be accessed via the CLI.
```bash
cloudfiles xfer init SOURCE DEST --db NAME_OF_JOB.db
cloudfiles xfer execute NAME_OF_JOB.db # deletes db when done
```
### composite upload (Google Cloud Storage)
If a file is larger than 100MB (default), CloudFiles will split the file into 100MB parts and upload them as individual part files using the STANDARD storage class to minimize deletion costs. Once uploaded, the part files will be recursively merged in a tree 32 files at a time. After each merge, the part files will be deleted. The final file will have the default storage class for the bucket.
If an upload is interrupted, the part files will remain and must be cleaned up. You can provide an open for binary reading file handle instead of a bytes object so that large files can be uploaded without overwhelming RAM. You can also adjust the composite threshold using `CloudFiles(..., composite_upload_threshold=int(2e8))` to for example, raise the threshold to 200MB.
### mutli-part upload (S3)
If a file is larger than 100MB (default), the S3 service will use multi-part upload. ou can provide an open for binary reading file handle instead of a bytes object so that large files can be uploaded without overwhelming RAM. You can also adjust the composite threshold using `CloudFiles(..., composite_upload_threshold=int(2e8))` to for example, raise the threshold to 200MB.  
Unfinished upload parts remain on S3 (and cost money) unless you use a bucket lifecycle rule to remove them automatically.  
https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
### transcode
```python
from cloudfiles.compression import transcode
files = cf.get(...) 
for file in transcode(files, 'gzip'):
  file['content'] # gzipped file content regardless of source
transcode(files, 
  encoding='gzip', # any cf compatible compression scheme
  in_place=False, # modify the files in-place to save memory
  progress=True # progress bar
)
```
Sometimes we want to change the encoding type of a set of arbitrary files (often when moving them around to another storage system). `transcode` will take the output of `get` and transcode the resultant files into a new format. `transcode` respects the `raw` attribute which indicates that the contents are already compressed and will decompress them first before recompressing. If the input data are already compressed to the correct output encoding, it will simply pass it through without going through a decompression/recompression cycle.
`transcode` returns a generator so that the transcoding can be done in a streaming manner.
## Network Robustness
CloudFiles protects itself from network issues in several ways. 
First, it uses a connection pool to avoid needing to reestablish connections or exhausting the number of available sockets.  
Second, it uses an exponential random window backoff to retry failed connections and requests. The exponential backoff allows increasing breathing room for an overloaded server and the random window decorrelates independent attempts by a cluster. If the backoff was not growing, the retry attempts by a large cluster would be too rapid fire or inefficiently slow. If the attempts were not decorrellated, then regardless of the backoff, the servers will often all try again around the same time. We backoff seven times starting from 0.5 seconds to 60 seconds, doubling the random window each time.
Third, for Google Cloud Storage (GCS) and S3 endpoints, we compute the md5 digest both sending and receiving to ensure data corruption did not occur in transit and that the server sent the full response. We cannot validate the digest for partial ("Range") reads. For [composite objects](https://cloud.google.com/storage/docs/composite-objects) (GCS) we can check the [crc32c](https://pypi.org/project/crc32c/) check-sum which catches transmission errors but not tampering (though MD5 isn't secure at all anymore). We are unable to perform validation for [multi-part uploads](https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html) (S3). Using custom encryption keys may also create validation problems.
## CloudFiles CLI Tool
```bash
# list cloud and local directories
cloudfiles ls gs://bucket-folder/
# parallel file transfer, no decompression
cloudfiles -p 2 cp --progress -r s3://bkt/ gs://bkt2/
# change compression type to brotli
cloudfiles cp -c br s3://bkt/file.txt gs://bkt2/
# decompress
cloudfiles cp -c none s3://bkt/file.txt gs://bkt2/
# pass from stdin (use "-" for source argument)
find some_dir | cloudfiles cp - s3://bkt/
# resumable transfers
cloudfiles xfer init SRC DEST --db JOBNAME.db
cloudfiles xfer execute JOBNAME.db --progress # can quit and resume
# Get human readable file sizes from anywhere
cloudfiles du -shc ./tmp gs://bkt/dir s3://bkt/dir
# remove files
cloudfiles rm ./tmp gs://bkt/dir/file s3://bkt/dir/file
# cat across services, -r for range reads
cloudfiles cat ./tmp gs://bkt/dir/file s3://bkt/dir/file
# verify a transfer was successful by comparing bytes and hashes
cloudfiles verify ./my-data gs://bucket/my-data/ 
```
### `cp` Pros and Cons
For the cp command, the bundled CLI tool has a number of advantages vs. `gsutil` when it comes to transfers.
1. No decompression of file transfers (unless you want to).
2. Can shift compression format.
3. Easily control the number of parallel processes.
4. Green threads make core utilization more efficient.
5. Optionally uses libdeflate for faster gzip decompression.
It also has some disadvantages:  
1. Doesn't support all commands.
2. File suffixes may be added to signify compression type on the local filesystem (e.g. `.gz`, `.br`, or `.zstd`). `cloudfiles ls` will list them without the extension and they will be converted into `Content-Encoding` on cloud storage.
### `ls` Generative Expressions
For the `ls` command, we support (via the `-e` flag) simple generative expressions that enable querying multiple prefixes at once. A generative expression is denoted `[chars]` where `c`,`h`,`a`,`r`, & `s` will be inserted individually into the position where the expression appears. Multiple expressions are allowed and produce a cartesian product of resulting strings. This functionality is very limited at the moment but we intend to improve it.
```bash
cloudfiles ls -e "gs://bucket/prefix[ab]"
# equivalent to:
# cloudfiles ls gs://bucket/prefixa
# cloudfiles ls gs://bucket/prefixb
```
### `alias` for Alternative S3 Endpoints
You can set your own protocols for S3 compatible endpoints by creating dynamic or persistent aliases. CloudFiles comes with two official s3 endpoints that are important for the Seung Lab, `matrix://` and `tigerdata://` which point to Princeton S3 endpoints. Official aliases can't be overridden.
To create a dynamic alias, you can use `cloudfiles.paths.add_alias` which will only affect the current process. To create a persistent alias that resides in `~/.cloudfiles/aliases.json`, you can use the CLI. 
```bash
cloudfiles alias add example s3://https://example.com/ # example://
cloudfiles alias ls # list all aliases
cloudfiles alias rm example # remove example://
```
The alias file is only accessed (and cached) if CloudFiles encounters an unknown protocol. If you stick to default protocols and use the syntax `s3://https://example.com/` for alternative endpoints, you can still use CloudFiles in environments without filesystem access.
## Credits
CloudFiles is derived from the [CloudVolume.Storage](https://github.com/seung-lab/cloud-volume/tree/master/cloudvolume/storage) system.  
Storage was initially created by William Silversmith and Ignacio Tartavull. It was maintained and improved by William Silversmith and includes improvements by Nico Kemnitz (extremely fast exists) and Ben Falk (brotli). Manuel Castro added the ability to chose cloud storage class. Thanks to the anonymous author from https://teppen.io/ for their s3 etag validation code.

%prep
%autosetup -n cloud-files-4.15.1

%build
%py3_build

%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .

%files -n python3-cloud-files -f filelist.lst
%dir %{python3_sitelib}/*

%files help -f doclist.lst
%{_docdir}/*

%changelog
* Fri May 05 2023 Python_Bot <Python_Bot@openeuler.org> - 4.15.1-1
- Package Spec generated