%global _empty_manifest_terminate_build 0 Name: python-mlpug Version: 0.0.57 Release: 1 Summary: A machine learning library agnostic framework for model training and evaluation License: Apache Software License URL: https://github.com/nuhame/mlpug Source0: https://mirrors.nju.edu.cn/pypi/web/packages/99/a0/fb7ed09cabdf20f37c1ee4f1ac2a74e81c63d02eb06d14080e2134f8bd3b/mlpug-0.0.57.tar.gz BuildArch: noarch Requires: python3-visionscaper-pybase Requires: python3-tensorboardX %description # MLPug MLPug is a Machine Learning library agnostic framework for model training and evaluation. A lot of the functionality you need to train and evaluate your model is independent of the ML library you're using, e.g. PyTorch or Tensorflow. MLPug provides a single framework with a unified API for all such training and evaluation functionality, independent of the ML library you are using. **Thus, when switching ML library, you don't have to learn a new training API and you can reuse your own training code with no, or minimal, change! 🤩🎉** ## Dive right in! ### Run the repository examples You can find the example code [here](mlpug/examples/documentation/). How MLPug is used in the examples is explained further [here](#hello-world-with-pytorch). Clone the MLPug repo: ``` git clone https://github.com/nuhame/mlpug.git ``` #### MLPug with PyTorch To run the PyTorch examples, install PyTorch first, further use Python >= 3.7. ``` cd mlpug # MLPug Hello World example python mlpug/examples/documentation/pytorch/hello_world.py # MLPug Fashion MNIST example # Run `fashion_mnist.py -h` for options python mlpug/examples/documentation/pytorch/fashion_mnist.py ``` There are similar [examples for using MLPug with PyTorch/XLA](mlpug/examples/documentation/pytorch/xla) (Training with Pytorch on TPUs). #### MLPug with Tensorflow To run the Tensorflow examples, install Tensorflow first, further use Python >= 3.7. ``` cd mlpug # MLPug Hello World example # Run hello_world.py or hello_world_not_eager.py python mlpug/examples/documentation/tensorflow/hello_world.py # MLPug Fashion MNIST example # Run `fashion_mnist.py -h` for options python mlpug/examples/documentation/pytorch/fashion_mnist.py ``` ### Use MLPug in your own project ``` pip install mlpug ``` ```Python # Using MLPug with PyTorch import mlpug.pytorch as mlp ``` ```Python # Using MLPug with PyTorch/XLA (Training with Pytorch on TPUs) import mlpug.pytorch.xla as mlp ``` ```Python # Using MLPug with Tensorflow import mlpug.tensorflow as mlp ``` # What is MLPug? MLPug is a machine learning library agnostic framework for model training and evaluation. A lot of the functionality you need to train and evaluate your machine learning model is independent of the machine learning library you're using, e.g. PyTorch and Tensorflow. For instance, * checkpoint management, * evaluation of validation set loss and other custom metrics, * progress logging, * progress visualization using Tensorboard, * the use of gradient accumulation to train with large batch sizes using limited GPU memory, etc.. You need such functionality no matter what machine learning framework you are using. MLPug provides a single framework with a unified API for all such training and evaluation functionality, independent of the machine learning library you are using. This also implies that when you switch library you can reuse your training code with no, or minimal, changes. ## Supported deep learning libraries Currently, MLPug supports the following deep learning/machine learning libraries: * PyTorch * PyTorch/XLA (Training with Pytorch on TPUs) * Tensorflow (in development, some features not available yet) ## MLPug focus Although MLPug should be able to deal with any training job, its functionality is mostly focussed on dealing with training large models on large datasets, using limited hardware (GPU or TPU) resources and memory. ## Almost at version 0.1! MLPug is still in development. If you are having trouble using MLPug for your use case, or when you have found a bug, please file an issue. ## Contents [Installing MLPug](#installing-mlpug) \ \ [Hello World](#hello-world) ([PT](#hello-world-with-pytorch) | [XLA](#hello-world-with-pytorchxla) | [TF](#hello-world-with-tensorflow)) [Feature parity list](#feature-parity-list) \ \ \ The following sections are documentation **ToDo's**, but provide insight in to MLPug's features: \ [The `logs` object](#the-logs-object) \ \ [Callbacks and the training life cycle](#callbacks-and-the-training-life-cycle) \ \ [Progress Logging](#progress-logging) \ \ [Model components vs Training model](#model-components-vs-training-model) \ \ [Distributed training](#distributed-training) \ \ [Checkpoint management](#checkpoint-management) \       [Using the CheckpointManager](#using-the-checkpointmanager) \       [Using training checkpoints](#using-training-checkpoints) \       [Using model checkpoints](#using-model-checkpoints) \       [Checkpointing on error or interrupt](#checkpointing-on-error-or-interrupt) \ \ [MLPug metric evaluators](#mlpug-metric-evaluators) \       [Auxiliary batch training results](#auxiliary-batch-training-results) \       [Calculating custom metrics](#calculating-custom-metrics) \       [Conditional computation of metrics](#conditional-computation-of-metrics) \ \ [Batch chunking, dealing with GPU memory limits](#batch-chunking-dealing-with-gpu-memory-limits) \       [Gradient Accumulation](#gradient-accumulation) \       [Chunked Metric Computation](#chunked-metric-computation) \ \ [Using Tensorboard](#using-tensorboard) \       [Tensorboard made easy with AutoTensorboard](#tensorboard-made-easy-with-auto-tensorboard) \       [More fine grained control](#more-fine-grained-control) \ \ [Learning Rate Scheduling](#learning-rate-scheduling) \ \ [Multi GPU training](#multi-gpu-training) \ \ [Mixed Precision Training](#mixed-precision-training) \ \ [CUDA Memory tools](#cuda-memory-tools) \ \ [Using multiple optimizers](#using-multiple-optimizers) ## Installing MLPug Please ensure that you are using Python3.7+. Install as follows: ``` pip install mlpug ``` ### Usage with PyTorch When you want to use MLPug with PyTorch, you will need to install it: ``` pip install torch torchvision ``` ### Usage with Tensorflow When you want to use MLPug with Tensorflow, you will need to install it: ``` pip install tensorflow ``` ## Hello World! This is the Hello World of training with MLPug. You will see that the usage of MLPug with Pytorch, Pytorch/XLA and Tensorflow is very similar. For details please see : * [pytorch/hello_world.py](mlpug/examples/documentation/pytorch/hello_world.py), * [pytorch/xla/hello_world.py](mlpug/examples/documentation/pytorch/xla/hello_world.py), * [tensorflow/hello_world.py](mlpug/examples/documentation/tensorflow/hello_world.py) and [tensorflow/hello_world_not_eager.py](mlpug/examples/documentation/tensorflow/hello_world_not_eager.py) You can download and run these examples (for XLA you need to use a TPU on Google Cloud, or use Google Colab). When reading through the explanation below it might be that you still have a lot of questions about the why and how of training with MLPug, however I will expand the MLPug documentation soon, so you will get better insight. ### 'Hello World' with PyTorch To use MLPug with Pytorch ```python import mlpug.pytorch as mlp ``` Before we can start training we need an iterable dataset that can provide our training batches. ```python training_dataset = torch.utils.data.DataLoader(training_data, batch_size=batch_size, shuffle=False, num_workers=3) ``` ... and a model we want to train ```python classifier = torch.nn.Sequential( torch.nn.Flatten(), torch.nn.Linear(784, 128), torch.nn.ReLU(), torch.nn.Linear(128, 10)) ``` MLPug needs a way to evaluate the loss of the model. One way to do that is to define a `TrainModel` that outputs the loss ```python class TrainModel(torch.nn.Module): def __init__(self, classifier): super(TrainModel, self).__init__() self.classifier = classifier self.loss_func = torch.nn.CrossEntropyLoss() def forward(self, batch_data, evaluate_settings, inference_mode=None): images, true_labels = batch_data logits = self.classifier(images) return self.loss_func(logits, true_labels) train_model = TrainModel(classifier) ``` To train the model we will also need an optimizer ```python optimizer = torch.optim.Adam(classifier.parameters(), eps=1e-7) ``` To now use MLPug to start training, we need to create a `Trainer` which will be used by a `TrainingManager`. ```python trainer = mlp.trainers.DefaultTrainer(optimizers=optimizer, model_components=classifier) ``` MLPug uses a callback system allowing you to customize and extend the training functionality. The list of callback instances you provide the `TrainingManager` will be called using hooks at different stages of the training process. ```python # At minimum you want to log the loss in the training progress # By default the batch loss and the moving average of the loss are calculated and logged loss_evaluator = mlp.evaluation.MetricEvaluator(trainer=trainer) callbacks = [ mlp.callbacks.TrainingMetricsLogger(metric_evaluator=loss_evaluator), # Calculate validation loss only once per epoch over the whole dataset mlp.callbacks.TestMetricsLogger(validation_dataset, 'validation', metric_evaluator=loss_evaluator, batch_level=False), mlp.callbacks.LogProgress(log_period=progress_log_period, set_names=['training', 'validation']), ] ``` The `TrainingMetricsLogger` and the `TestMetricsLogger` callback instances log training and validation set loss values in a `logs` object that is passed through all callbacks during training. The `LogProgress` callback instance logs the metric values stored in the received `logs` object. We can now instantiate the `TrainingManager` and pass it the `trainer`. ```python manager = mlp.trainers.TrainingManager(trainer, training_dataset, num_epochs=num_epochs, callbacks=callbacks) ``` Before we can start training we still have to provide the `train_model` to the trainer. ```python trainer.set_training_model(train_model) ``` The final step is to actually start training: ```python manager.start_training() ``` Running `pytorch/hello_world.py` finishes like this: ```text ############################################################################### Epoch 9/9 READY - Duration 0:00:08 Moving average: training : loss 0.238. Computed over dataset: validation : loss 0.346. INFO : TrainingManager::_train : Training completed. All good! ❤️ Using the classifier ... real label = 9, predicted label = 9 ``` ### 'Hello World' with PyTorch/XLA The Hello World example with PyTorch/XLA, is largely the same as with [PyTorch](#hello-world-with-pytorch). There are only two small differences. To use MLPug with Pytorch/XLA, load the correct backend ```python import mlpug.pytorch.xla as mlp ``` Load your model on a TPU core: ```python import torch_xla.core.xla_model as xm ... device = xm.xla_device() train_model = TrainModel(classifier, device) classifier.to(device) ``` ### 'Hello World' with Tensorflow Below we will focus only on the minor differences between using MLPug with [PyTorch](#hello-world-with-pytorch) and Tensorflow. To use MLPug with Tensorflow ```python import mlpug.tensorflow as mlp ``` The only real difference is that, for Tensorflow, you can specify if the trainer needs to run in eager mode or not. If not, you need to specify the input `batch_data_signature`. ```python trainer = mlp.trainers.DefaultTrainer(optimizers=optimizer, model_components=classifier, eager_mode=True) ``` ```python trainer = mlp.trainers.DefaultTrainer(optimizers=optimizer, model_components=classifier, batch_data_signature=(tf.TensorSpec(shape=(None, 28, 28), dtype=tf.float64), tf.TensorSpec(shape=(None,), dtype=tf.uint8),)) ``` When you run [tensorflow/hello_world.py](mlpug/examples/documentation/tensorflow/hello_world.py) and [tensorflow/hello_world_not_eager.py](mlpug/examples/documentation/tensorflow/hello_world_not_eager.py) you will see that when not running in eager mode, training is much faster. Running `tensorflow/hello_world.py` finishes like this: ```text ############################################################################### Epoch 9/9 READY - Duration 0:00:15 Moving average: training : loss 0.229. Computed over dataset: validation : loss 0.370. INFO : TrainingManager::_train : Training completed. All good! ❤️ Using the classifier ... real label = 9, predicted label = 9 ``` Running `tensorflow/hello_world_not_eager.py` finishes like this: ```text ############################################################################### Epoch 9/9 READY - Duration 0:00:06 Moving average: training : loss 0.229. Computed over dataset: validation : loss 0.370. INFO : TrainingManager::_train : Training completed. All good! ❤️ Using the classifier ... real label = 9, predicted label = 9 ``` Note the difference in epoch duration! ## Feature parity list | Feature | PyTorch | PyTorch/XLA | Tensorflow | JAX | Comments | |-----------------------------------------------|-------------|-------------|-------------|-------------|----------------------------------| | Callbacks and training life cycle | ✓ | ✓ | ✓ | | | | Progress Logging | ✓ | ✓ | ✓ | | | | Distributed training | ✓ | ✓ | ✓ | | Both multi-GPU and multi-TPU support for PyTorch and TF. TPU training with TF is untested | | Distributed evaluation | ✓ | ✓ | ✓ | | Both multi-GPU and multi-TPU support for PyTorch and TF. evaluation on TPU with TF is untested | | Model and training checkpoint management | ✓ | ✓ | ✓ | | | | Custom metric evaluation | ✓ | ✓ | ✓ | | | | Conditional evaluation of metrics | ✓ | ✓ | ✓ | | | | Batch Chunking: gradient accumulation | ✓ | ✓ | ✓ | | | | Batch Chunking: chunked evaluation of metrics | ✓ | ✓ | ✓ | | | | Tensorboard support | ✓ | ✓ | ✓ | | Might be refactored | | Learning Rate scheduling | ✓ | ✓ | ✓ | | Might be refactored | | Mixed Precision Training | ✓ | ❌ | ~ | | Should work with TF, but no specific support | | Using multiple optimizers | ✓ | ✓ | ✓ | | | | Multi-task training | ~ | ~ | ~ | | No support yet, but can be done when only one DataLoader is required | %package -n python3-mlpug Summary: A machine learning library agnostic framework for model training and evaluation Provides: python-mlpug BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-mlpug # MLPug MLPug is a Machine Learning library agnostic framework for model training and evaluation. A lot of the functionality you need to train and evaluate your model is independent of the ML library you're using, e.g. PyTorch or Tensorflow. MLPug provides a single framework with a unified API for all such training and evaluation functionality, independent of the ML library you are using. **Thus, when switching ML library, you don't have to learn a new training API and you can reuse your own training code with no, or minimal, change! 🤩🎉** ## Dive right in! ### Run the repository examples You can find the example code [here](mlpug/examples/documentation/). How MLPug is used in the examples is explained further [here](#hello-world-with-pytorch). Clone the MLPug repo: ``` git clone https://github.com/nuhame/mlpug.git ``` #### MLPug with PyTorch To run the PyTorch examples, install PyTorch first, further use Python >= 3.7. ``` cd mlpug # MLPug Hello World example python mlpug/examples/documentation/pytorch/hello_world.py # MLPug Fashion MNIST example # Run `fashion_mnist.py -h` for options python mlpug/examples/documentation/pytorch/fashion_mnist.py ``` There are similar [examples for using MLPug with PyTorch/XLA](mlpug/examples/documentation/pytorch/xla) (Training with Pytorch on TPUs). #### MLPug with Tensorflow To run the Tensorflow examples, install Tensorflow first, further use Python >= 3.7. ``` cd mlpug # MLPug Hello World example # Run hello_world.py or hello_world_not_eager.py python mlpug/examples/documentation/tensorflow/hello_world.py # MLPug Fashion MNIST example # Run `fashion_mnist.py -h` for options python mlpug/examples/documentation/pytorch/fashion_mnist.py ``` ### Use MLPug in your own project ``` pip install mlpug ``` ```Python # Using MLPug with PyTorch import mlpug.pytorch as mlp ``` ```Python # Using MLPug with PyTorch/XLA (Training with Pytorch on TPUs) import mlpug.pytorch.xla as mlp ``` ```Python # Using MLPug with Tensorflow import mlpug.tensorflow as mlp ``` # What is MLPug? MLPug is a machine learning library agnostic framework for model training and evaluation. A lot of the functionality you need to train and evaluate your machine learning model is independent of the machine learning library you're using, e.g. PyTorch and Tensorflow. For instance, * checkpoint management, * evaluation of validation set loss and other custom metrics, * progress logging, * progress visualization using Tensorboard, * the use of gradient accumulation to train with large batch sizes using limited GPU memory, etc.. You need such functionality no matter what machine learning framework you are using. MLPug provides a single framework with a unified API for all such training and evaluation functionality, independent of the machine learning library you are using. This also implies that when you switch library you can reuse your training code with no, or minimal, changes. ## Supported deep learning libraries Currently, MLPug supports the following deep learning/machine learning libraries: * PyTorch * PyTorch/XLA (Training with Pytorch on TPUs) * Tensorflow (in development, some features not available yet) ## MLPug focus Although MLPug should be able to deal with any training job, its functionality is mostly focussed on dealing with training large models on large datasets, using limited hardware (GPU or TPU) resources and memory. ## Almost at version 0.1! MLPug is still in development. If you are having trouble using MLPug for your use case, or when you have found a bug, please file an issue. ## Contents [Installing MLPug](#installing-mlpug) \ \ [Hello World](#hello-world) ([PT](#hello-world-with-pytorch) | [XLA](#hello-world-with-pytorchxla) | [TF](#hello-world-with-tensorflow)) [Feature parity list](#feature-parity-list) \ \ \ The following sections are documentation **ToDo's**, but provide insight in to MLPug's features: \ [The `logs` object](#the-logs-object) \ \ [Callbacks and the training life cycle](#callbacks-and-the-training-life-cycle) \ \ [Progress Logging](#progress-logging) \ \ [Model components vs Training model](#model-components-vs-training-model) \ \ [Distributed training](#distributed-training) \ \ [Checkpoint management](#checkpoint-management) \       [Using the CheckpointManager](#using-the-checkpointmanager) \       [Using training checkpoints](#using-training-checkpoints) \       [Using model checkpoints](#using-model-checkpoints) \       [Checkpointing on error or interrupt](#checkpointing-on-error-or-interrupt) \ \ [MLPug metric evaluators](#mlpug-metric-evaluators) \       [Auxiliary batch training results](#auxiliary-batch-training-results) \       [Calculating custom metrics](#calculating-custom-metrics) \       [Conditional computation of metrics](#conditional-computation-of-metrics) \ \ [Batch chunking, dealing with GPU memory limits](#batch-chunking-dealing-with-gpu-memory-limits) \       [Gradient Accumulation](#gradient-accumulation) \       [Chunked Metric Computation](#chunked-metric-computation) \ \ [Using Tensorboard](#using-tensorboard) \       [Tensorboard made easy with AutoTensorboard](#tensorboard-made-easy-with-auto-tensorboard) \       [More fine grained control](#more-fine-grained-control) \ \ [Learning Rate Scheduling](#learning-rate-scheduling) \ \ [Multi GPU training](#multi-gpu-training) \ \ [Mixed Precision Training](#mixed-precision-training) \ \ [CUDA Memory tools](#cuda-memory-tools) \ \ [Using multiple optimizers](#using-multiple-optimizers) ## Installing MLPug Please ensure that you are using Python3.7+. Install as follows: ``` pip install mlpug ``` ### Usage with PyTorch When you want to use MLPug with PyTorch, you will need to install it: ``` pip install torch torchvision ``` ### Usage with Tensorflow When you want to use MLPug with Tensorflow, you will need to install it: ``` pip install tensorflow ``` ## Hello World! This is the Hello World of training with MLPug. You will see that the usage of MLPug with Pytorch, Pytorch/XLA and Tensorflow is very similar. For details please see : * [pytorch/hello_world.py](mlpug/examples/documentation/pytorch/hello_world.py), * [pytorch/xla/hello_world.py](mlpug/examples/documentation/pytorch/xla/hello_world.py), * [tensorflow/hello_world.py](mlpug/examples/documentation/tensorflow/hello_world.py) and [tensorflow/hello_world_not_eager.py](mlpug/examples/documentation/tensorflow/hello_world_not_eager.py) You can download and run these examples (for XLA you need to use a TPU on Google Cloud, or use Google Colab). When reading through the explanation below it might be that you still have a lot of questions about the why and how of training with MLPug, however I will expand the MLPug documentation soon, so you will get better insight. ### 'Hello World' with PyTorch To use MLPug with Pytorch ```python import mlpug.pytorch as mlp ``` Before we can start training we need an iterable dataset that can provide our training batches. ```python training_dataset = torch.utils.data.DataLoader(training_data, batch_size=batch_size, shuffle=False, num_workers=3) ``` ... and a model we want to train ```python classifier = torch.nn.Sequential( torch.nn.Flatten(), torch.nn.Linear(784, 128), torch.nn.ReLU(), torch.nn.Linear(128, 10)) ``` MLPug needs a way to evaluate the loss of the model. One way to do that is to define a `TrainModel` that outputs the loss ```python class TrainModel(torch.nn.Module): def __init__(self, classifier): super(TrainModel, self).__init__() self.classifier = classifier self.loss_func = torch.nn.CrossEntropyLoss() def forward(self, batch_data, evaluate_settings, inference_mode=None): images, true_labels = batch_data logits = self.classifier(images) return self.loss_func(logits, true_labels) train_model = TrainModel(classifier) ``` To train the model we will also need an optimizer ```python optimizer = torch.optim.Adam(classifier.parameters(), eps=1e-7) ``` To now use MLPug to start training, we need to create a `Trainer` which will be used by a `TrainingManager`. ```python trainer = mlp.trainers.DefaultTrainer(optimizers=optimizer, model_components=classifier) ``` MLPug uses a callback system allowing you to customize and extend the training functionality. The list of callback instances you provide the `TrainingManager` will be called using hooks at different stages of the training process. ```python # At minimum you want to log the loss in the training progress # By default the batch loss and the moving average of the loss are calculated and logged loss_evaluator = mlp.evaluation.MetricEvaluator(trainer=trainer) callbacks = [ mlp.callbacks.TrainingMetricsLogger(metric_evaluator=loss_evaluator), # Calculate validation loss only once per epoch over the whole dataset mlp.callbacks.TestMetricsLogger(validation_dataset, 'validation', metric_evaluator=loss_evaluator, batch_level=False), mlp.callbacks.LogProgress(log_period=progress_log_period, set_names=['training', 'validation']), ] ``` The `TrainingMetricsLogger` and the `TestMetricsLogger` callback instances log training and validation set loss values in a `logs` object that is passed through all callbacks during training. The `LogProgress` callback instance logs the metric values stored in the received `logs` object. We can now instantiate the `TrainingManager` and pass it the `trainer`. ```python manager = mlp.trainers.TrainingManager(trainer, training_dataset, num_epochs=num_epochs, callbacks=callbacks) ``` Before we can start training we still have to provide the `train_model` to the trainer. ```python trainer.set_training_model(train_model) ``` The final step is to actually start training: ```python manager.start_training() ``` Running `pytorch/hello_world.py` finishes like this: ```text ############################################################################### Epoch 9/9 READY - Duration 0:00:08 Moving average: training : loss 0.238. Computed over dataset: validation : loss 0.346. INFO : TrainingManager::_train : Training completed. All good! ❤️ Using the classifier ... real label = 9, predicted label = 9 ``` ### 'Hello World' with PyTorch/XLA The Hello World example with PyTorch/XLA, is largely the same as with [PyTorch](#hello-world-with-pytorch). There are only two small differences. To use MLPug with Pytorch/XLA, load the correct backend ```python import mlpug.pytorch.xla as mlp ``` Load your model on a TPU core: ```python import torch_xla.core.xla_model as xm ... device = xm.xla_device() train_model = TrainModel(classifier, device) classifier.to(device) ``` ### 'Hello World' with Tensorflow Below we will focus only on the minor differences between using MLPug with [PyTorch](#hello-world-with-pytorch) and Tensorflow. To use MLPug with Tensorflow ```python import mlpug.tensorflow as mlp ``` The only real difference is that, for Tensorflow, you can specify if the trainer needs to run in eager mode or not. If not, you need to specify the input `batch_data_signature`. ```python trainer = mlp.trainers.DefaultTrainer(optimizers=optimizer, model_components=classifier, eager_mode=True) ``` ```python trainer = mlp.trainers.DefaultTrainer(optimizers=optimizer, model_components=classifier, batch_data_signature=(tf.TensorSpec(shape=(None, 28, 28), dtype=tf.float64), tf.TensorSpec(shape=(None,), dtype=tf.uint8),)) ``` When you run [tensorflow/hello_world.py](mlpug/examples/documentation/tensorflow/hello_world.py) and [tensorflow/hello_world_not_eager.py](mlpug/examples/documentation/tensorflow/hello_world_not_eager.py) you will see that when not running in eager mode, training is much faster. Running `tensorflow/hello_world.py` finishes like this: ```text ############################################################################### Epoch 9/9 READY - Duration 0:00:15 Moving average: training : loss 0.229. Computed over dataset: validation : loss 0.370. INFO : TrainingManager::_train : Training completed. All good! ❤️ Using the classifier ... real label = 9, predicted label = 9 ``` Running `tensorflow/hello_world_not_eager.py` finishes like this: ```text ############################################################################### Epoch 9/9 READY - Duration 0:00:06 Moving average: training : loss 0.229. Computed over dataset: validation : loss 0.370. INFO : TrainingManager::_train : Training completed. All good! ❤️ Using the classifier ... real label = 9, predicted label = 9 ``` Note the difference in epoch duration! ## Feature parity list | Feature | PyTorch | PyTorch/XLA | Tensorflow | JAX | Comments | |-----------------------------------------------|-------------|-------------|-------------|-------------|----------------------------------| | Callbacks and training life cycle | ✓ | ✓ | ✓ | | | | Progress Logging | ✓ | ✓ | ✓ | | | | Distributed training | ✓ | ✓ | ✓ | | Both multi-GPU and multi-TPU support for PyTorch and TF. TPU training with TF is untested | | Distributed evaluation | ✓ | ✓ | ✓ | | Both multi-GPU and multi-TPU support for PyTorch and TF. evaluation on TPU with TF is untested | | Model and training checkpoint management | ✓ | ✓ | ✓ | | | | Custom metric evaluation | ✓ | ✓ | ✓ | | | | Conditional evaluation of metrics | ✓ | ✓ | ✓ | | | | Batch Chunking: gradient accumulation | ✓ | ✓ | ✓ | | | | Batch Chunking: chunked evaluation of metrics | ✓ | ✓ | ✓ | | | | Tensorboard support | ✓ | ✓ | ✓ | | Might be refactored | | Learning Rate scheduling | ✓ | ✓ | ✓ | | Might be refactored | | Mixed Precision Training | ✓ | ❌ | ~ | | Should work with TF, but no specific support | | Using multiple optimizers | ✓ | ✓ | ✓ | | | | Multi-task training | ~ | ~ | ~ | | No support yet, but can be done when only one DataLoader is required | %package help Summary: Development documents and examples for mlpug Provides: python3-mlpug-doc %description help # MLPug MLPug is a Machine Learning library agnostic framework for model training and evaluation. A lot of the functionality you need to train and evaluate your model is independent of the ML library you're using, e.g. PyTorch or Tensorflow. MLPug provides a single framework with a unified API for all such training and evaluation functionality, independent of the ML library you are using. **Thus, when switching ML library, you don't have to learn a new training API and you can reuse your own training code with no, or minimal, change! 🤩🎉** ## Dive right in! ### Run the repository examples You can find the example code [here](mlpug/examples/documentation/). How MLPug is used in the examples is explained further [here](#hello-world-with-pytorch). Clone the MLPug repo: ``` git clone https://github.com/nuhame/mlpug.git ``` #### MLPug with PyTorch To run the PyTorch examples, install PyTorch first, further use Python >= 3.7. ``` cd mlpug # MLPug Hello World example python mlpug/examples/documentation/pytorch/hello_world.py # MLPug Fashion MNIST example # Run `fashion_mnist.py -h` for options python mlpug/examples/documentation/pytorch/fashion_mnist.py ``` There are similar [examples for using MLPug with PyTorch/XLA](mlpug/examples/documentation/pytorch/xla) (Training with Pytorch on TPUs). #### MLPug with Tensorflow To run the Tensorflow examples, install Tensorflow first, further use Python >= 3.7. ``` cd mlpug # MLPug Hello World example # Run hello_world.py or hello_world_not_eager.py python mlpug/examples/documentation/tensorflow/hello_world.py # MLPug Fashion MNIST example # Run `fashion_mnist.py -h` for options python mlpug/examples/documentation/pytorch/fashion_mnist.py ``` ### Use MLPug in your own project ``` pip install mlpug ``` ```Python # Using MLPug with PyTorch import mlpug.pytorch as mlp ``` ```Python # Using MLPug with PyTorch/XLA (Training with Pytorch on TPUs) import mlpug.pytorch.xla as mlp ``` ```Python # Using MLPug with Tensorflow import mlpug.tensorflow as mlp ``` # What is MLPug? MLPug is a machine learning library agnostic framework for model training and evaluation. A lot of the functionality you need to train and evaluate your machine learning model is independent of the machine learning library you're using, e.g. PyTorch and Tensorflow. For instance, * checkpoint management, * evaluation of validation set loss and other custom metrics, * progress logging, * progress visualization using Tensorboard, * the use of gradient accumulation to train with large batch sizes using limited GPU memory, etc.. You need such functionality no matter what machine learning framework you are using. MLPug provides a single framework with a unified API for all such training and evaluation functionality, independent of the machine learning library you are using. This also implies that when you switch library you can reuse your training code with no, or minimal, changes. ## Supported deep learning libraries Currently, MLPug supports the following deep learning/machine learning libraries: * PyTorch * PyTorch/XLA (Training with Pytorch on TPUs) * Tensorflow (in development, some features not available yet) ## MLPug focus Although MLPug should be able to deal with any training job, its functionality is mostly focussed on dealing with training large models on large datasets, using limited hardware (GPU or TPU) resources and memory. ## Almost at version 0.1! MLPug is still in development. If you are having trouble using MLPug for your use case, or when you have found a bug, please file an issue. ## Contents [Installing MLPug](#installing-mlpug) \ \ [Hello World](#hello-world) ([PT](#hello-world-with-pytorch) | [XLA](#hello-world-with-pytorchxla) | [TF](#hello-world-with-tensorflow)) [Feature parity list](#feature-parity-list) \ \ \ The following sections are documentation **ToDo's**, but provide insight in to MLPug's features: \ [The `logs` object](#the-logs-object) \ \ [Callbacks and the training life cycle](#callbacks-and-the-training-life-cycle) \ \ [Progress Logging](#progress-logging) \ \ [Model components vs Training model](#model-components-vs-training-model) \ \ [Distributed training](#distributed-training) \ \ [Checkpoint management](#checkpoint-management) \       [Using the CheckpointManager](#using-the-checkpointmanager) \       [Using training checkpoints](#using-training-checkpoints) \       [Using model checkpoints](#using-model-checkpoints) \       [Checkpointing on error or interrupt](#checkpointing-on-error-or-interrupt) \ \ [MLPug metric evaluators](#mlpug-metric-evaluators) \       [Auxiliary batch training results](#auxiliary-batch-training-results) \       [Calculating custom metrics](#calculating-custom-metrics) \       [Conditional computation of metrics](#conditional-computation-of-metrics) \ \ [Batch chunking, dealing with GPU memory limits](#batch-chunking-dealing-with-gpu-memory-limits) \       [Gradient Accumulation](#gradient-accumulation) \       [Chunked Metric Computation](#chunked-metric-computation) \ \ [Using Tensorboard](#using-tensorboard) \       [Tensorboard made easy with AutoTensorboard](#tensorboard-made-easy-with-auto-tensorboard) \       [More fine grained control](#more-fine-grained-control) \ \ [Learning Rate Scheduling](#learning-rate-scheduling) \ \ [Multi GPU training](#multi-gpu-training) \ \ [Mixed Precision Training](#mixed-precision-training) \ \ [CUDA Memory tools](#cuda-memory-tools) \ \ [Using multiple optimizers](#using-multiple-optimizers) ## Installing MLPug Please ensure that you are using Python3.7+. Install as follows: ``` pip install mlpug ``` ### Usage with PyTorch When you want to use MLPug with PyTorch, you will need to install it: ``` pip install torch torchvision ``` ### Usage with Tensorflow When you want to use MLPug with Tensorflow, you will need to install it: ``` pip install tensorflow ``` ## Hello World! This is the Hello World of training with MLPug. You will see that the usage of MLPug with Pytorch, Pytorch/XLA and Tensorflow is very similar. For details please see : * [pytorch/hello_world.py](mlpug/examples/documentation/pytorch/hello_world.py), * [pytorch/xla/hello_world.py](mlpug/examples/documentation/pytorch/xla/hello_world.py), * [tensorflow/hello_world.py](mlpug/examples/documentation/tensorflow/hello_world.py) and [tensorflow/hello_world_not_eager.py](mlpug/examples/documentation/tensorflow/hello_world_not_eager.py) You can download and run these examples (for XLA you need to use a TPU on Google Cloud, or use Google Colab). When reading through the explanation below it might be that you still have a lot of questions about the why and how of training with MLPug, however I will expand the MLPug documentation soon, so you will get better insight. ### 'Hello World' with PyTorch To use MLPug with Pytorch ```python import mlpug.pytorch as mlp ``` Before we can start training we need an iterable dataset that can provide our training batches. ```python training_dataset = torch.utils.data.DataLoader(training_data, batch_size=batch_size, shuffle=False, num_workers=3) ``` ... and a model we want to train ```python classifier = torch.nn.Sequential( torch.nn.Flatten(), torch.nn.Linear(784, 128), torch.nn.ReLU(), torch.nn.Linear(128, 10)) ``` MLPug needs a way to evaluate the loss of the model. One way to do that is to define a `TrainModel` that outputs the loss ```python class TrainModel(torch.nn.Module): def __init__(self, classifier): super(TrainModel, self).__init__() self.classifier = classifier self.loss_func = torch.nn.CrossEntropyLoss() def forward(self, batch_data, evaluate_settings, inference_mode=None): images, true_labels = batch_data logits = self.classifier(images) return self.loss_func(logits, true_labels) train_model = TrainModel(classifier) ``` To train the model we will also need an optimizer ```python optimizer = torch.optim.Adam(classifier.parameters(), eps=1e-7) ``` To now use MLPug to start training, we need to create a `Trainer` which will be used by a `TrainingManager`. ```python trainer = mlp.trainers.DefaultTrainer(optimizers=optimizer, model_components=classifier) ``` MLPug uses a callback system allowing you to customize and extend the training functionality. The list of callback instances you provide the `TrainingManager` will be called using hooks at different stages of the training process. ```python # At minimum you want to log the loss in the training progress # By default the batch loss and the moving average of the loss are calculated and logged loss_evaluator = mlp.evaluation.MetricEvaluator(trainer=trainer) callbacks = [ mlp.callbacks.TrainingMetricsLogger(metric_evaluator=loss_evaluator), # Calculate validation loss only once per epoch over the whole dataset mlp.callbacks.TestMetricsLogger(validation_dataset, 'validation', metric_evaluator=loss_evaluator, batch_level=False), mlp.callbacks.LogProgress(log_period=progress_log_period, set_names=['training', 'validation']), ] ``` The `TrainingMetricsLogger` and the `TestMetricsLogger` callback instances log training and validation set loss values in a `logs` object that is passed through all callbacks during training. The `LogProgress` callback instance logs the metric values stored in the received `logs` object. We can now instantiate the `TrainingManager` and pass it the `trainer`. ```python manager = mlp.trainers.TrainingManager(trainer, training_dataset, num_epochs=num_epochs, callbacks=callbacks) ``` Before we can start training we still have to provide the `train_model` to the trainer. ```python trainer.set_training_model(train_model) ``` The final step is to actually start training: ```python manager.start_training() ``` Running `pytorch/hello_world.py` finishes like this: ```text ############################################################################### Epoch 9/9 READY - Duration 0:00:08 Moving average: training : loss 0.238. Computed over dataset: validation : loss 0.346. INFO : TrainingManager::_train : Training completed. All good! ❤️ Using the classifier ... real label = 9, predicted label = 9 ``` ### 'Hello World' with PyTorch/XLA The Hello World example with PyTorch/XLA, is largely the same as with [PyTorch](#hello-world-with-pytorch). There are only two small differences. To use MLPug with Pytorch/XLA, load the correct backend ```python import mlpug.pytorch.xla as mlp ``` Load your model on a TPU core: ```python import torch_xla.core.xla_model as xm ... device = xm.xla_device() train_model = TrainModel(classifier, device) classifier.to(device) ``` ### 'Hello World' with Tensorflow Below we will focus only on the minor differences between using MLPug with [PyTorch](#hello-world-with-pytorch) and Tensorflow. To use MLPug with Tensorflow ```python import mlpug.tensorflow as mlp ``` The only real difference is that, for Tensorflow, you can specify if the trainer needs to run in eager mode or not. If not, you need to specify the input `batch_data_signature`. ```python trainer = mlp.trainers.DefaultTrainer(optimizers=optimizer, model_components=classifier, eager_mode=True) ``` ```python trainer = mlp.trainers.DefaultTrainer(optimizers=optimizer, model_components=classifier, batch_data_signature=(tf.TensorSpec(shape=(None, 28, 28), dtype=tf.float64), tf.TensorSpec(shape=(None,), dtype=tf.uint8),)) ``` When you run [tensorflow/hello_world.py](mlpug/examples/documentation/tensorflow/hello_world.py) and [tensorflow/hello_world_not_eager.py](mlpug/examples/documentation/tensorflow/hello_world_not_eager.py) you will see that when not running in eager mode, training is much faster. Running `tensorflow/hello_world.py` finishes like this: ```text ############################################################################### Epoch 9/9 READY - Duration 0:00:15 Moving average: training : loss 0.229. Computed over dataset: validation : loss 0.370. INFO : TrainingManager::_train : Training completed. All good! ❤️ Using the classifier ... real label = 9, predicted label = 9 ``` Running `tensorflow/hello_world_not_eager.py` finishes like this: ```text ############################################################################### Epoch 9/9 READY - Duration 0:00:06 Moving average: training : loss 0.229. Computed over dataset: validation : loss 0.370. INFO : TrainingManager::_train : Training completed. All good! ❤️ Using the classifier ... real label = 9, predicted label = 9 ``` Note the difference in epoch duration! ## Feature parity list | Feature | PyTorch | PyTorch/XLA | Tensorflow | JAX | Comments | |-----------------------------------------------|-------------|-------------|-------------|-------------|----------------------------------| | Callbacks and training life cycle | ✓ | ✓ | ✓ | | | | Progress Logging | ✓ | ✓ | ✓ | | | | Distributed training | ✓ | ✓ | ✓ | | Both multi-GPU and multi-TPU support for PyTorch and TF. TPU training with TF is untested | | Distributed evaluation | ✓ | ✓ | ✓ | | Both multi-GPU and multi-TPU support for PyTorch and TF. evaluation on TPU with TF is untested | | Model and training checkpoint management | ✓ | ✓ | ✓ | | | | Custom metric evaluation | ✓ | ✓ | ✓ | | | | Conditional evaluation of metrics | ✓ | ✓ | ✓ | | | | Batch Chunking: gradient accumulation | ✓ | ✓ | ✓ | | | | Batch Chunking: chunked evaluation of metrics | ✓ | ✓ | ✓ | | | | Tensorboard support | ✓ | ✓ | ✓ | | Might be refactored | | Learning Rate scheduling | ✓ | ✓ | ✓ | | Might be refactored | | Mixed Precision Training | ✓ | ❌ | ~ | | Should work with TF, but no specific support | | Using multiple optimizers | ✓ | ✓ | ✓ | | | | Multi-task training | ~ | ~ | ~ | | No support yet, but can be done when only one DataLoader is required | %prep %autosetup -n mlpug-0.0.57 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-mlpug -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Wed May 10 2023 Python_Bot - 0.0.57-1 - Package Spec generated