# Multi-node Training

Using several Gaudi servers to perform multi-node training can be done easily. This guide shows how to:
- set up several Gaudi instances
- set up your computing environment
- launch a multi-node run

## Setting up several Gaudi instances

Two types of configurations are possible:
- scale-out using Gaudi NICs or Host NICs (on-premises)
- scale-out using Intel® Tiber™ AI Cloud instances

### On premises

To set up your servers on premises, check out the [installation](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html) and [distributed training](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Scaling_Guide/index.html) pages of Intel® Gaudi® AI Accelerator's documentation.

### Intel Tiber AI Cloud instances

Follow the steps on [creating an account and getting an instance](https://docs.habana.ai/en/latest/Intel_DevCloud_Quick_Start/Intel_DevCloud_Quick_Start.html#creating-an-account-and-getting-an-instance) pages of Intel® Gaudi® AI Accelerator's documentation.

## Launching a Multi-node Run

Once your Intel Gaudi instances are ready, follow the steps for [setting up a multi-server environment](https://docs.habana.ai/en/latest/Intel_DevCloud_Quick_Start/Intel_DevCloud_Quick_Start.html#setting-up-a-multi-server-environment) pages of Intel® Gaudi® AI Accelerator's documentation.

Finally, there are two possible ways to run your training script on several nodes:

1. With the [`gaudi_spawn.py`](https://github.com/huggingface/optimum-habana/blob/main/examples/gaudi_spawn.py) script, you can run the following command:
```bash
python gaudi_spawn.py \
    --hostfile path_to_my_hostfile --use_deepspeed \
    path_to_my_script.py --args1 --args2 ... --argsN \
    --deepspeed path_to_my_deepspeed_config
```
where `--argX` is an argument of the script to run.

2. With the `DistributedRunner`, you can add this code snippet to a script:
```python
from optimum.habana.distributed import DistributedRunner

distributed_runner = DistributedRunner(
    command_list=["path_to_my_script.py --args1 --args2 ... --argsN"],
    hostfile=path_to_my_hostfile,
    use_deepspeed=True,
)
```

## Environment Variables

If you need to set environment variables for all nodes, you can specify them in a [`.deepspeed_env`](https://www.deepspeed.ai/getting-started/#multi-node-environment-variables) file which should be located in the local path you are executing from or in your home directory. The format is the following:
```
env_variable_1_name=value
env_variable_2_name=value
...
```

## Recommendations

- It is strongly recommended to use gradient checkpointing for multi-node runs to get the highest speedups. You can enable it with `--gradient_checkpointing` in [these examples](/examples) or with `gradient_checkpointing=True` in your `GaudiTrainingArguments`.
- Larger batch sizes should lead to higher speedups.
- Multi-node inference is not recommended and can provide inconsistent results.
- On Intel Tiber AI Cloud instances, run your Docker containers with the `--privileged` flag so that EFA devices are visible.

## Example

In this example, we fine-tune a pre-trained GPT2-XL model on the [WikiText dataset](https://huggingface.co/datasets/wikitext).
We are going to use the [causal language modeling example which is given in the Github repository](/examples/language-modeling#gpt-2gpt-and-causal-language-modeling).

The first step consists in training the model on several nodes with this command:
```bash
PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
    --hostfile path_to_hostfile --use_deepspeed run_clm.py \
    --model_name_or_path gpt2-xl \
    --gaudi_config_name Habana/gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --output_dir /tmp/gpt2_xl_multi_node \
    --learning_rate 4e-04 \
    --per_device_train_batch_size 16 \
    --gradient_checkpointing \
    --num_train_epochs 1 \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 3 \
    --deepspeed path_to_deepspeed_config
```

Evaluation is not performed in the same command because we do not recommend performing multi-node inference at the moment.

Once the model is trained, we can evaluate it with the following command.
The argument `--model_name_or_path` should be equal to the argument `--output_dir` of the previous command.
```bash
PT_HPU_LAZY_MODE=1 python run_clm.py \
    --model_name_or_path /tmp/gpt2_xl_multi_node \
    --gaudi_config_name Habana/gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_eval \
    --output_dir /tmp/gpt2_xl_multi_node \
    --per_device_eval_batch_size 8 \
    --use_habana \
    --use_lazy_mode
```

