Skip to main content

Training LLaMA2-7B

Create a GPU Instance

  1. Log in to the HPC-AI.com website:

    • After logging in, navigate to the Instances page.
  2. Create an instance:

    • (Optional) Configure shared storage by going to the shared storage page, clicking Create Storage, and creating the necessary space.
    • Click the "Create GPU Instance" button to start the creation process.
  3. Select the configuration:

    • Select the GPU type.
    • Select the server location.
    • Select the number of cards (e.g., 8 cards per machine).
    • (Optional) Mount the previously created shared storage.
  4. Configure instance information:

    • Instance Name: colossal
    • Image: Select CUDA (12.1) ubuntu==20.04, cuda==12.1, python==3.11, conda, or upload a custom one.
    • External storage: Not used in this case due to small data size.
  5. Initialize the instance:

    • Wait for the instance to initialize after configuration.
    • Connect to the instance via SSH once initialized.

Environment Configuration

  1. Create a Conda environment:

    conda create --name myenv python=3.11
    conda activate myenv
  2. Install PyTorch:

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
  3. Verify installation:

    import torch
    print(f"PyTorch version: {torch.__version__}")
    print(f"Is CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of GPU devices: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
    print(f"Device {i}: {torch.cuda.get_device_name(i)}")

Installing Colossal-AI

Note: Please execute the following commands within the virtual environment you created earlier to ensure that all dependencies for Colossal-AI are properly installed.

  1. Clone the official repository:

    cd ~
    git clone https://github.com/hpcaitech/ColossalAI.git
    cd ColossalAI
  2. Compile Kernels:

    BUILD_EXT=1 pip install .
  3. Install Flash Attention:

    pip install flash-attn --no-build-isolation
  4. Install NVIDIA Apex:

    cd ~
    git clone https://github.com/NVIDIA/apex
    cd apex
    git checkout 22.04-dev
    • Modify the setup.py file at line 36:

      #if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
      # pass
    • Install dependencies:

      pip install -r requirements.txt
    • Build and install:

      pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
      python setup.py install
    • Comment out from torch._six import string_classes in _initialize.py and replace string_classes with str in line 42.

  5. Install TensorNVME:

    cd ~
    git clone https://github.com/hpcaitech/TensorNVMe.git
    apt-get update
    apt-get install cmake
    pip install -v --no-cache-dir .

Training the LLaMA2-7B Model

  1. Prepare the training script:

    • Modify /root/ColossalAI/examples/language/llama/scripts/benchmark_7B/gemini.sh:
      #!/bin/bash
      cd /root/ColossalAI/examples/language/llama/scripts/
      export OMP_NUM_THREADS=64
      colossalai run --nproc_per_node 8 benchmark.py -g -x -b 6 -s 10 --shard_param_frac 0
  2. Execute the training script:

    bash /root/ColossalAI/examples/language/llama/scripts/benchmark_7B/gemini.sh
  3. Result Plot: img_training1.png

  4. Try 3D parallelism:

    • dp8+tp1+pp1+zero2:

      colossalai run --nproc_per_node 8 benchmark.py -p 3d -g -x -b 16 -s 10 --tp 4 --zero 2

      img_training2.png

    • dp4+tp1+pp2+zero1:

      colossalai run --nproc_per_node 8 benchmark.py -p 3d -g -x -b 128 -s 10 --pp 2 --zero 1

      img_training3.png


Note: This tutorial uses the Colossal-AI main branch code (2024.12.8) by default. If some packages are missing or have installation problems due to the main branch upgrade, you can install them yourself through pip.