1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
|
%global _empty_manifest_terminate_build 0
Name: python-canine
Version: 0.9.0
Release: 1
Summary: A modular, high-performance computing solution to run jobs using SLURM
License: BSD3
URL: https://github.com/broadinstitute/canine
Source0: https://mirrors.aliyun.com/pypi/web/packages/f7/c0/79f51a88a7f06b6367666ca2a6d35208b6f5be35db5c927bdd64f6f5523c/canine-0.9.0.tar.gz
BuildArch: noarch
Requires: python3-paramiko
Requires: python3-pandas
Requires: python3-google-auth
Requires: python3-PyYAML
Requires: python3-agutil
Requires: python3-hound
Requires: python3-firecloud-dalmatian
Requires: python3-google-api-python-client
Requires: python3-docker
Requires: python3-psutil
Requires: python3-port-for
Requires: python3-tables
%description
## Usage
Canine operates by running jobs on a SLURM cluster. It is designed to take a bash
or WDL script and schedule jobs using data from a Firecloud workspace or with manually
provided inputs. API usage documented at the bottom of this section.
Canine may be used in any of the following ways:
* Running a pipeline yaml file (ie: `$ canine examples/example_pipeline.yaml`)
* Running a pipeline defined on the commandline (ie: `$ canine --backend type:TransientGCP --backend name:my-cluster (etc...)`)
* Building and running a pipeline in python (ie: `>>> canine.Orchestrator(pipeline_dict).run_pipeline()`)
* Using the [Canine API](https://broadinstitute.github.io/canine/) to execute custom
workflows in Slurm, which could not be configured as a pipeline object
## Anatomy of a pipeline
Canine can be natively configured to suit a vast range of setups.
Canine is modularized into three main components which can be mixed and matched as needed: Adapters, Backends, and Localizers.
A pipeline specifies which Adapter, Backend, and Localizer to use, along with any configuration options for each.
### Adapters
The pipeline adapter is responsible for converting the provided list of inputs into an input specification for each job.
#### Choosing an Adapter
This is a list of available adapters. For more details, see [pipeline_options.md](https://github.com/broadinstitute/canine/blob/master/pipeline_options.md)
* `Manual`: (Default) This is the primary input adapter responsible for determining the number of jobs and the inputs for each job, based on the raw inputs provided by the user.
* Inputs which have a single constant value will have the same value for all jobs
* Inputs which have a 1D list of values will have one of those values in each job. By Default, all list inputs must have the same length, and there will be one job per element. The nth job will have the nth value of each input
* There are extra configuration options which can change how inputs are combined or how lists are interpreted
* `Firecloud`/`Terra`: Choose this adapter if you are using data hosted in a FireCloud or Terra workspace.
Your inputs will be interpreted as entity expressions, similar to how FireCloud and Terra workflows interpret inputs. This adapter can also be configured to post results back to your workspace, if you choose. **Warning:** Reading from Workspace buckets is convenient, but you may encounter issues if your Slurm cluster is not logged in using your credentials
### Backends
The pipeline backend is responsible for interfacing with the Slurm controller.
There are many different backends available depending on where SLURM is running (or for creating a Slurm cluster for you).
#### Choosing a Backend
This is a list of available backends. For more details, see [pipeline_options.md](https://github.com/broadinstitute/canine/blob/master/pipeline_options.md)
* `Local`: (Default) Choose this backend if you will be running Canine from the Slurm controller and your cluster is fully configured.
This backend will run Slurm commands through the local shell
* `Remote`: Choose this backend if you have a fully configured SLURM cluster, but you will be running Canine elsewhere.
This backend uses SSH and SFTP to interact with the Slurm controller
* `GCPTransient`: Choose this backend if you do not have a Slurm cluster.
This backend will create a cluster to your specifications in Google Cloud and then use SSH and SFTP to interact with the controller. The cluster will be deleted after Canine has finished
* `ImageTransient`: Choose this backend if you do not have a Slurm cluster, but want more control over its startup than `GCPTransient`.
This backend assumes that the current system has Slurm installed and has an NFS mount set up.
It then creates worker nodes from a Google Compute Image that you have setup and configured.
* `DockerTransient`: Choose this backend if you want the same control as `ImageTransient` but do not want to set up a Google Compute Image.
The Slurm daemons run inside docker containers on the worker nodes.
The Slurm controller daemon runs inside a docker container on the local filesystem
* `Dummy`: Choose this backend for developing or testing pipelines.
This backend simulates a Slurm cluster by running the controller and workers as docker containers on the local system. **This backend does not provision any cloud resources.**
It runs entirely through the local docker daemon.
### Localizers
The pipeline localizer is responsible for staging the pipeline on the SLURM controller and for transferring inputs/outputs as needed.
There are four different localizers to accommodate different needs.
#### Choosing a Localizer
This is a list of available localizers. For more details, see [pipeline_options.md](https://github.com/broadinstitute/canine/blob/master/pipeline_options.md)
* `Batched`: (Default) This localizer is suitable for most situations.
It stages the canine pipeline workspace locally in a temporary directory, copying or symlinking local files into it before broadcasting the workspace directory structure over to the Slurm controller.
Files stored in Google Cloud Storage are downloaded at the end, directly onto the Slurm Controller (using credentials stored on the controller).
* `Local`: Choose this localizer if you have files in Google Cloud Storage which need to be localized but you are unable to save suitable credentials to the Slurm controller.
This is very similar to the `Batched` localizer, except that Google Cloud Storage files are staged locally and broadcast to the Slurm Controller along with the rest of the pipeline files
* `Remote`: Choose this localizer for small pipelines with few local files.
This localizer stages the pipeline directory directly on the Slurm controller using SFTP. It is often less efficient than the bulk directory copy used by the `Batched` and `Local` localizers (especially if you provide a `transfer_bucket` to them) but can outperform other localizers for small pipelines which consist entirely of files from Google Cloud Storage.
* `NFS`: Choose this backend if the current system has an active NFS mount to the Slurm controller.
The canine pipeline will be staged locally, within the NFS mount point, allowing NFS to take care of transferring the pipeline directory to the controller.
### Examples
There are a few examples in the `examples/` directory which can be run out-of-the box.
To run one of these pipelines, follow any of the following instructions:
##### Command Line
```
$ canine examples/example_pipeline.yaml
```
#### Python (using filepath)
```python
import canine
orchestrator = canine.Orchestrator('examples/example_pipeline.yaml')
results = orchestrator.run_pipeline()
```
#### Python (using dictionary)
```python
import canine
import yaml
with open('examples/example_pipeline.yaml') as r:
config = yaml.load(r)
orchestrator = canine.Orchestrator(config)
results = orchestrator.run_pipeline()
```
### Other pipeline components
Hopefully you've run an example or two and have a better understanding of what a pipeline looks like.
This section will describe the other parts of a pipeline configuration not covered already
#### inputs
Inputs describe both the number of jobs and the inputs to each job.
The `inputs` section of the pipeline should be a dictionary.
Each key is a string, mapping the name of the input to either a string or list of strings.
As described above, the adapter is responsible for parsing the raw, user-provided inputs into the set of inputs for each job that will be run.
* Raw inputs which were lists of 2 or more dimensions are interpreted by the adapter as if the user wished to provide one of the nested lists to each job. The array is flattened to 2 dimensions, and interpreted as if it were a regular list input (with one element passed to each job). The contents of these arrays are handled using the above localization rules
* Raw inputs which were lists of any dimensions, but marked as `common` in the overrides are flattened to 1 dimension, and the whole list is provided as an input to each job. The contents of the array are handled as `common` files (see below)
#### script
The pipeline script is the heart of the pipeline. This is the actual bash script which will be run. The `script` key can either be a filepath to a bash script to run, or a list of strings, each of which is a command to run.
Either way, the script gets executed by each job of the pipeline.
### overrides
Localization overrides, defined in `localization.overrides` allow the user to change the localizer's default handling for a specific input.
The overrides section should be a dictionary mapping input names, to a string describing the desired handling, as follows:
* Default rules (no override):
* Strings which exist as a local filepath are treated as files and will be localized to the Slurm controller
* Strings which start with `gs://` are interpreted to be files/directories within Google Cloud Storage and will be localized to the Slurm controller
* Any file or Google Storage object which appears as an input to multiple jobs is considered `common` and will be localized once to a common directory, visible to all jobs
* If any input to any arbitrary job is a list, the contents of the list are interpreted using the same rules
* `Common`: Inputs marked as common will be considered common to all jobs and localized once, to a directory which is visible to all jobs. Inputs marked as common which cannot be interpreted as a filepath or a Google Cloud Storage object are ignored and treated as strings
* `Stream`: Inputs marked as `Stream` will be streamed into a FIFO pipe, and the path to the pipe will be exported to the job. The `Stream` override is ignored for inputs which are not Google Cloud Storage objects, causing those inputs to be localized under default rules. Jobs which are requeued due to node failure will always restart the stream
* `Delayed`: Inputs marked as `Delayed` will be downloaded by the job once it starts, instead of upfront during localization. The `Delayed` override is ignored for inputs which are not Google Cloud Storage objects, causing those inputs to be localized under default rules. Jobs which are requeued due to node failures will only re-download delayed inputs if the job failed before the download completed
* `Localize`: Inputs marked as `Localize` will be treated as files and localized to job-specific input directories. This can be used to force files which would be handled as common, to be localized for each job. The `Localize` override is ignored for inputs which are not valid filepaths or Google Cloud Storage objects, causing those inputs to be treated as strings
* `Null` or `None`: Inputs marked this way are treated as strings, and no localization will be applied.
#### outputs
The outputs section defines a mapping of output names to file patterns which should be grabbed for output. File patterns may be raw filenames or globs, and may include shell variables (including job inputs).
These patterns are always relative to each job's initial cwd (`$CANINE_JOB_ROOT`). Patterns _may_ match files above the workspace directory, but this is not recommended.
By default, `stdout` and `stderr` are included in the outputs, which will grab the job's stdout/err streams.
You may override this behavior by providing your own pattern for `stdout` or `stderr`.
**Warning:** the outputs `stdout` and `stderr` have special handling, which expects their patterns to match exactly one file.
If you provide a custom pattern for `stdout` or `stderr` and matches more than one file, the output dataframe will only show the first filename matched
All files which match a provided output pattern will be delocalized from the Slurm controller back to the current system in the following directory structure:
```
output_dir/
{job id}/
stdout
stderr
{other output names}/
{matched files/directories}
```
#### resources
The `resources` section allows you to define additional arguments to `sbatch` to control the resource allocation or other scheduling parameters. The `resource` dictionary is converted to commandline arguments as follows:
* Single-letter keys are converted to short (`-x`) options.
* Multi-letter keys are converted to long (`--xx`) options.
* Keys with a value of `True` are converted to flags (no value)
* keys with any other value are converted to paramters (`--key=val`)
* Underscores in keys are converted to hyphens (`foo_bar` becomes `--foo-bar`)
%package -n python3-canine
Summary: A modular, high-performance computing solution to run jobs using SLURM
Provides: python-canine
BuildRequires: python3-devel
BuildRequires: python3-setuptools
BuildRequires: python3-pip
%description -n python3-canine
## Usage
Canine operates by running jobs on a SLURM cluster. It is designed to take a bash
or WDL script and schedule jobs using data from a Firecloud workspace or with manually
provided inputs. API usage documented at the bottom of this section.
Canine may be used in any of the following ways:
* Running a pipeline yaml file (ie: `$ canine examples/example_pipeline.yaml`)
* Running a pipeline defined on the commandline (ie: `$ canine --backend type:TransientGCP --backend name:my-cluster (etc...)`)
* Building and running a pipeline in python (ie: `>>> canine.Orchestrator(pipeline_dict).run_pipeline()`)
* Using the [Canine API](https://broadinstitute.github.io/canine/) to execute custom
workflows in Slurm, which could not be configured as a pipeline object
## Anatomy of a pipeline
Canine can be natively configured to suit a vast range of setups.
Canine is modularized into three main components which can be mixed and matched as needed: Adapters, Backends, and Localizers.
A pipeline specifies which Adapter, Backend, and Localizer to use, along with any configuration options for each.
### Adapters
The pipeline adapter is responsible for converting the provided list of inputs into an input specification for each job.
#### Choosing an Adapter
This is a list of available adapters. For more details, see [pipeline_options.md](https://github.com/broadinstitute/canine/blob/master/pipeline_options.md)
* `Manual`: (Default) This is the primary input adapter responsible for determining the number of jobs and the inputs for each job, based on the raw inputs provided by the user.
* Inputs which have a single constant value will have the same value for all jobs
* Inputs which have a 1D list of values will have one of those values in each job. By Default, all list inputs must have the same length, and there will be one job per element. The nth job will have the nth value of each input
* There are extra configuration options which can change how inputs are combined or how lists are interpreted
* `Firecloud`/`Terra`: Choose this adapter if you are using data hosted in a FireCloud or Terra workspace.
Your inputs will be interpreted as entity expressions, similar to how FireCloud and Terra workflows interpret inputs. This adapter can also be configured to post results back to your workspace, if you choose. **Warning:** Reading from Workspace buckets is convenient, but you may encounter issues if your Slurm cluster is not logged in using your credentials
### Backends
The pipeline backend is responsible for interfacing with the Slurm controller.
There are many different backends available depending on where SLURM is running (or for creating a Slurm cluster for you).
#### Choosing a Backend
This is a list of available backends. For more details, see [pipeline_options.md](https://github.com/broadinstitute/canine/blob/master/pipeline_options.md)
* `Local`: (Default) Choose this backend if you will be running Canine from the Slurm controller and your cluster is fully configured.
This backend will run Slurm commands through the local shell
* `Remote`: Choose this backend if you have a fully configured SLURM cluster, but you will be running Canine elsewhere.
This backend uses SSH and SFTP to interact with the Slurm controller
* `GCPTransient`: Choose this backend if you do not have a Slurm cluster.
This backend will create a cluster to your specifications in Google Cloud and then use SSH and SFTP to interact with the controller. The cluster will be deleted after Canine has finished
* `ImageTransient`: Choose this backend if you do not have a Slurm cluster, but want more control over its startup than `GCPTransient`.
This backend assumes that the current system has Slurm installed and has an NFS mount set up.
It then creates worker nodes from a Google Compute Image that you have setup and configured.
* `DockerTransient`: Choose this backend if you want the same control as `ImageTransient` but do not want to set up a Google Compute Image.
The Slurm daemons run inside docker containers on the worker nodes.
The Slurm controller daemon runs inside a docker container on the local filesystem
* `Dummy`: Choose this backend for developing or testing pipelines.
This backend simulates a Slurm cluster by running the controller and workers as docker containers on the local system. **This backend does not provision any cloud resources.**
It runs entirely through the local docker daemon.
### Localizers
The pipeline localizer is responsible for staging the pipeline on the SLURM controller and for transferring inputs/outputs as needed.
There are four different localizers to accommodate different needs.
#### Choosing a Localizer
This is a list of available localizers. For more details, see [pipeline_options.md](https://github.com/broadinstitute/canine/blob/master/pipeline_options.md)
* `Batched`: (Default) This localizer is suitable for most situations.
It stages the canine pipeline workspace locally in a temporary directory, copying or symlinking local files into it before broadcasting the workspace directory structure over to the Slurm controller.
Files stored in Google Cloud Storage are downloaded at the end, directly onto the Slurm Controller (using credentials stored on the controller).
* `Local`: Choose this localizer if you have files in Google Cloud Storage which need to be localized but you are unable to save suitable credentials to the Slurm controller.
This is very similar to the `Batched` localizer, except that Google Cloud Storage files are staged locally and broadcast to the Slurm Controller along with the rest of the pipeline files
* `Remote`: Choose this localizer for small pipelines with few local files.
This localizer stages the pipeline directory directly on the Slurm controller using SFTP. It is often less efficient than the bulk directory copy used by the `Batched` and `Local` localizers (especially if you provide a `transfer_bucket` to them) but can outperform other localizers for small pipelines which consist entirely of files from Google Cloud Storage.
* `NFS`: Choose this backend if the current system has an active NFS mount to the Slurm controller.
The canine pipeline will be staged locally, within the NFS mount point, allowing NFS to take care of transferring the pipeline directory to the controller.
### Examples
There are a few examples in the `examples/` directory which can be run out-of-the box.
To run one of these pipelines, follow any of the following instructions:
##### Command Line
```
$ canine examples/example_pipeline.yaml
```
#### Python (using filepath)
```python
import canine
orchestrator = canine.Orchestrator('examples/example_pipeline.yaml')
results = orchestrator.run_pipeline()
```
#### Python (using dictionary)
```python
import canine
import yaml
with open('examples/example_pipeline.yaml') as r:
config = yaml.load(r)
orchestrator = canine.Orchestrator(config)
results = orchestrator.run_pipeline()
```
### Other pipeline components
Hopefully you've run an example or two and have a better understanding of what a pipeline looks like.
This section will describe the other parts of a pipeline configuration not covered already
#### inputs
Inputs describe both the number of jobs and the inputs to each job.
The `inputs` section of the pipeline should be a dictionary.
Each key is a string, mapping the name of the input to either a string or list of strings.
As described above, the adapter is responsible for parsing the raw, user-provided inputs into the set of inputs for each job that will be run.
* Raw inputs which were lists of 2 or more dimensions are interpreted by the adapter as if the user wished to provide one of the nested lists to each job. The array is flattened to 2 dimensions, and interpreted as if it were a regular list input (with one element passed to each job). The contents of these arrays are handled using the above localization rules
* Raw inputs which were lists of any dimensions, but marked as `common` in the overrides are flattened to 1 dimension, and the whole list is provided as an input to each job. The contents of the array are handled as `common` files (see below)
#### script
The pipeline script is the heart of the pipeline. This is the actual bash script which will be run. The `script` key can either be a filepath to a bash script to run, or a list of strings, each of which is a command to run.
Either way, the script gets executed by each job of the pipeline.
### overrides
Localization overrides, defined in `localization.overrides` allow the user to change the localizer's default handling for a specific input.
The overrides section should be a dictionary mapping input names, to a string describing the desired handling, as follows:
* Default rules (no override):
* Strings which exist as a local filepath are treated as files and will be localized to the Slurm controller
* Strings which start with `gs://` are interpreted to be files/directories within Google Cloud Storage and will be localized to the Slurm controller
* Any file or Google Storage object which appears as an input to multiple jobs is considered `common` and will be localized once to a common directory, visible to all jobs
* If any input to any arbitrary job is a list, the contents of the list are interpreted using the same rules
* `Common`: Inputs marked as common will be considered common to all jobs and localized once, to a directory which is visible to all jobs. Inputs marked as common which cannot be interpreted as a filepath or a Google Cloud Storage object are ignored and treated as strings
* `Stream`: Inputs marked as `Stream` will be streamed into a FIFO pipe, and the path to the pipe will be exported to the job. The `Stream` override is ignored for inputs which are not Google Cloud Storage objects, causing those inputs to be localized under default rules. Jobs which are requeued due to node failure will always restart the stream
* `Delayed`: Inputs marked as `Delayed` will be downloaded by the job once it starts, instead of upfront during localization. The `Delayed` override is ignored for inputs which are not Google Cloud Storage objects, causing those inputs to be localized under default rules. Jobs which are requeued due to node failures will only re-download delayed inputs if the job failed before the download completed
* `Localize`: Inputs marked as `Localize` will be treated as files and localized to job-specific input directories. This can be used to force files which would be handled as common, to be localized for each job. The `Localize` override is ignored for inputs which are not valid filepaths or Google Cloud Storage objects, causing those inputs to be treated as strings
* `Null` or `None`: Inputs marked this way are treated as strings, and no localization will be applied.
#### outputs
The outputs section defines a mapping of output names to file patterns which should be grabbed for output. File patterns may be raw filenames or globs, and may include shell variables (including job inputs).
These patterns are always relative to each job's initial cwd (`$CANINE_JOB_ROOT`). Patterns _may_ match files above the workspace directory, but this is not recommended.
By default, `stdout` and `stderr` are included in the outputs, which will grab the job's stdout/err streams.
You may override this behavior by providing your own pattern for `stdout` or `stderr`.
**Warning:** the outputs `stdout` and `stderr` have special handling, which expects their patterns to match exactly one file.
If you provide a custom pattern for `stdout` or `stderr` and matches more than one file, the output dataframe will only show the first filename matched
All files which match a provided output pattern will be delocalized from the Slurm controller back to the current system in the following directory structure:
```
output_dir/
{job id}/
stdout
stderr
{other output names}/
{matched files/directories}
```
#### resources
The `resources` section allows you to define additional arguments to `sbatch` to control the resource allocation or other scheduling parameters. The `resource` dictionary is converted to commandline arguments as follows:
* Single-letter keys are converted to short (`-x`) options.
* Multi-letter keys are converted to long (`--xx`) options.
* Keys with a value of `True` are converted to flags (no value)
* keys with any other value are converted to paramters (`--key=val`)
* Underscores in keys are converted to hyphens (`foo_bar` becomes `--foo-bar`)
%package help
Summary: Development documents and examples for canine
Provides: python3-canine-doc
%description help
## Usage
Canine operates by running jobs on a SLURM cluster. It is designed to take a bash
or WDL script and schedule jobs using data from a Firecloud workspace or with manually
provided inputs. API usage documented at the bottom of this section.
Canine may be used in any of the following ways:
* Running a pipeline yaml file (ie: `$ canine examples/example_pipeline.yaml`)
* Running a pipeline defined on the commandline (ie: `$ canine --backend type:TransientGCP --backend name:my-cluster (etc...)`)
* Building and running a pipeline in python (ie: `>>> canine.Orchestrator(pipeline_dict).run_pipeline()`)
* Using the [Canine API](https://broadinstitute.github.io/canine/) to execute custom
workflows in Slurm, which could not be configured as a pipeline object
## Anatomy of a pipeline
Canine can be natively configured to suit a vast range of setups.
Canine is modularized into three main components which can be mixed and matched as needed: Adapters, Backends, and Localizers.
A pipeline specifies which Adapter, Backend, and Localizer to use, along with any configuration options for each.
### Adapters
The pipeline adapter is responsible for converting the provided list of inputs into an input specification for each job.
#### Choosing an Adapter
This is a list of available adapters. For more details, see [pipeline_options.md](https://github.com/broadinstitute/canine/blob/master/pipeline_options.md)
* `Manual`: (Default) This is the primary input adapter responsible for determining the number of jobs and the inputs for each job, based on the raw inputs provided by the user.
* Inputs which have a single constant value will have the same value for all jobs
* Inputs which have a 1D list of values will have one of those values in each job. By Default, all list inputs must have the same length, and there will be one job per element. The nth job will have the nth value of each input
* There are extra configuration options which can change how inputs are combined or how lists are interpreted
* `Firecloud`/`Terra`: Choose this adapter if you are using data hosted in a FireCloud or Terra workspace.
Your inputs will be interpreted as entity expressions, similar to how FireCloud and Terra workflows interpret inputs. This adapter can also be configured to post results back to your workspace, if you choose. **Warning:** Reading from Workspace buckets is convenient, but you may encounter issues if your Slurm cluster is not logged in using your credentials
### Backends
The pipeline backend is responsible for interfacing with the Slurm controller.
There are many different backends available depending on where SLURM is running (or for creating a Slurm cluster for you).
#### Choosing a Backend
This is a list of available backends. For more details, see [pipeline_options.md](https://github.com/broadinstitute/canine/blob/master/pipeline_options.md)
* `Local`: (Default) Choose this backend if you will be running Canine from the Slurm controller and your cluster is fully configured.
This backend will run Slurm commands through the local shell
* `Remote`: Choose this backend if you have a fully configured SLURM cluster, but you will be running Canine elsewhere.
This backend uses SSH and SFTP to interact with the Slurm controller
* `GCPTransient`: Choose this backend if you do not have a Slurm cluster.
This backend will create a cluster to your specifications in Google Cloud and then use SSH and SFTP to interact with the controller. The cluster will be deleted after Canine has finished
* `ImageTransient`: Choose this backend if you do not have a Slurm cluster, but want more control over its startup than `GCPTransient`.
This backend assumes that the current system has Slurm installed and has an NFS mount set up.
It then creates worker nodes from a Google Compute Image that you have setup and configured.
* `DockerTransient`: Choose this backend if you want the same control as `ImageTransient` but do not want to set up a Google Compute Image.
The Slurm daemons run inside docker containers on the worker nodes.
The Slurm controller daemon runs inside a docker container on the local filesystem
* `Dummy`: Choose this backend for developing or testing pipelines.
This backend simulates a Slurm cluster by running the controller and workers as docker containers on the local system. **This backend does not provision any cloud resources.**
It runs entirely through the local docker daemon.
### Localizers
The pipeline localizer is responsible for staging the pipeline on the SLURM controller and for transferring inputs/outputs as needed.
There are four different localizers to accommodate different needs.
#### Choosing a Localizer
This is a list of available localizers. For more details, see [pipeline_options.md](https://github.com/broadinstitute/canine/blob/master/pipeline_options.md)
* `Batched`: (Default) This localizer is suitable for most situations.
It stages the canine pipeline workspace locally in a temporary directory, copying or symlinking local files into it before broadcasting the workspace directory structure over to the Slurm controller.
Files stored in Google Cloud Storage are downloaded at the end, directly onto the Slurm Controller (using credentials stored on the controller).
* `Local`: Choose this localizer if you have files in Google Cloud Storage which need to be localized but you are unable to save suitable credentials to the Slurm controller.
This is very similar to the `Batched` localizer, except that Google Cloud Storage files are staged locally and broadcast to the Slurm Controller along with the rest of the pipeline files
* `Remote`: Choose this localizer for small pipelines with few local files.
This localizer stages the pipeline directory directly on the Slurm controller using SFTP. It is often less efficient than the bulk directory copy used by the `Batched` and `Local` localizers (especially if you provide a `transfer_bucket` to them) but can outperform other localizers for small pipelines which consist entirely of files from Google Cloud Storage.
* `NFS`: Choose this backend if the current system has an active NFS mount to the Slurm controller.
The canine pipeline will be staged locally, within the NFS mount point, allowing NFS to take care of transferring the pipeline directory to the controller.
### Examples
There are a few examples in the `examples/` directory which can be run out-of-the box.
To run one of these pipelines, follow any of the following instructions:
##### Command Line
```
$ canine examples/example_pipeline.yaml
```
#### Python (using filepath)
```python
import canine
orchestrator = canine.Orchestrator('examples/example_pipeline.yaml')
results = orchestrator.run_pipeline()
```
#### Python (using dictionary)
```python
import canine
import yaml
with open('examples/example_pipeline.yaml') as r:
config = yaml.load(r)
orchestrator = canine.Orchestrator(config)
results = orchestrator.run_pipeline()
```
### Other pipeline components
Hopefully you've run an example or two and have a better understanding of what a pipeline looks like.
This section will describe the other parts of a pipeline configuration not covered already
#### inputs
Inputs describe both the number of jobs and the inputs to each job.
The `inputs` section of the pipeline should be a dictionary.
Each key is a string, mapping the name of the input to either a string or list of strings.
As described above, the adapter is responsible for parsing the raw, user-provided inputs into the set of inputs for each job that will be run.
* Raw inputs which were lists of 2 or more dimensions are interpreted by the adapter as if the user wished to provide one of the nested lists to each job. The array is flattened to 2 dimensions, and interpreted as if it were a regular list input (with one element passed to each job). The contents of these arrays are handled using the above localization rules
* Raw inputs which were lists of any dimensions, but marked as `common` in the overrides are flattened to 1 dimension, and the whole list is provided as an input to each job. The contents of the array are handled as `common` files (see below)
#### script
The pipeline script is the heart of the pipeline. This is the actual bash script which will be run. The `script` key can either be a filepath to a bash script to run, or a list of strings, each of which is a command to run.
Either way, the script gets executed by each job of the pipeline.
### overrides
Localization overrides, defined in `localization.overrides` allow the user to change the localizer's default handling for a specific input.
The overrides section should be a dictionary mapping input names, to a string describing the desired handling, as follows:
* Default rules (no override):
* Strings which exist as a local filepath are treated as files and will be localized to the Slurm controller
* Strings which start with `gs://` are interpreted to be files/directories within Google Cloud Storage and will be localized to the Slurm controller
* Any file or Google Storage object which appears as an input to multiple jobs is considered `common` and will be localized once to a common directory, visible to all jobs
* If any input to any arbitrary job is a list, the contents of the list are interpreted using the same rules
* `Common`: Inputs marked as common will be considered common to all jobs and localized once, to a directory which is visible to all jobs. Inputs marked as common which cannot be interpreted as a filepath or a Google Cloud Storage object are ignored and treated as strings
* `Stream`: Inputs marked as `Stream` will be streamed into a FIFO pipe, and the path to the pipe will be exported to the job. The `Stream` override is ignored for inputs which are not Google Cloud Storage objects, causing those inputs to be localized under default rules. Jobs which are requeued due to node failure will always restart the stream
* `Delayed`: Inputs marked as `Delayed` will be downloaded by the job once it starts, instead of upfront during localization. The `Delayed` override is ignored for inputs which are not Google Cloud Storage objects, causing those inputs to be localized under default rules. Jobs which are requeued due to node failures will only re-download delayed inputs if the job failed before the download completed
* `Localize`: Inputs marked as `Localize` will be treated as files and localized to job-specific input directories. This can be used to force files which would be handled as common, to be localized for each job. The `Localize` override is ignored for inputs which are not valid filepaths or Google Cloud Storage objects, causing those inputs to be treated as strings
* `Null` or `None`: Inputs marked this way are treated as strings, and no localization will be applied.
#### outputs
The outputs section defines a mapping of output names to file patterns which should be grabbed for output. File patterns may be raw filenames or globs, and may include shell variables (including job inputs).
These patterns are always relative to each job's initial cwd (`$CANINE_JOB_ROOT`). Patterns _may_ match files above the workspace directory, but this is not recommended.
By default, `stdout` and `stderr` are included in the outputs, which will grab the job's stdout/err streams.
You may override this behavior by providing your own pattern for `stdout` or `stderr`.
**Warning:** the outputs `stdout` and `stderr` have special handling, which expects their patterns to match exactly one file.
If you provide a custom pattern for `stdout` or `stderr` and matches more than one file, the output dataframe will only show the first filename matched
All files which match a provided output pattern will be delocalized from the Slurm controller back to the current system in the following directory structure:
```
output_dir/
{job id}/
stdout
stderr
{other output names}/
{matched files/directories}
```
#### resources
The `resources` section allows you to define additional arguments to `sbatch` to control the resource allocation or other scheduling parameters. The `resource` dictionary is converted to commandline arguments as follows:
* Single-letter keys are converted to short (`-x`) options.
* Multi-letter keys are converted to long (`--xx`) options.
* Keys with a value of `True` are converted to flags (no value)
* keys with any other value are converted to paramters (`--key=val`)
* Underscores in keys are converted to hyphens (`foo_bar` becomes `--foo-bar`)
%prep
%autosetup -n canine-0.9.0
%build
%py3_build
%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .
%files -n python3-canine -f filelist.lst
%dir %{python3_sitelib}/*
%files help -f doclist.lst
%{_docdir}/*
%changelog
* Tue Jun 20 2023 Python_Bot <Python_Bot@openeuler.org> - 0.9.0-1
- Package Spec generated
|