Getting Started with the NVIDIA NeMo Chainguard Image

Get started with the Chainguard Image for NVIDIA's NeMo framework for generative deep learning

NVIDIA NeMo is a deep learning framework for building conversational AI models that provides standalone module collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) tasks. The NeMo Chainguard Image is a comparatively lightweight NeMo environment with low to no CVEs, making it ideal for both training and production inference. The NeMo Chainguard Image is designed to work with the CUDA 12 parallel computing platform, and is suited to workloads that take advantage of connected GPUs.

What is Deep Learning?

Deep learning is a subset of machine learning that leverages a flexible computational architecture, the neural network, to address a wide variety of tasks. Neural networks emulate the structure of the brain and consist of interconnected nodes (neurons) that each contain an associated weight and threshold. In concert with an activation function, these values determine whether data is propagated within the network, producing an output layer corresponding to a classification, regression, or other result.

By technical convention, a deep neural network (DNN) has at least three layers: an input layer, an output layer, and one or more hidden layers. In practice, DNNs often have many layers.

Deep neural networks underpin many common computational tasks in modern applications, such as speech to text and generative AI.

In this getting started guide, we will use the NeMo Chainguard Image to generate speech from plain text using models provided by NeMo’s text-to-speech (TTS) and natural language processing (NLP) collections. In doing so, we’ll compare the security and footprint of the NeMo Chainguard Image to the official runtime image distributed by NVIDIA and consider further approaches and resources for applying the NeMo Chainguard Image to additional tasks in conversational AI.

This guide is primarily designed for use in an environment with access to one or more NVIDIA GPUs. However, NVIDIA NeMo is built on PyTorch Lightning, which supports a wide variety of accelerators, or interfaces to categories of processing units (CPU, GPU, TPU) or high-level clustering mechanisms such as Distributed Data Parallel. Some consideration will be given to alternative computing environments such as CPU in this tutorial.

Prerequisites

If Docker Engine (or Docker Desktop) is not already installed, follow the instructions for installing Docker Engine on your host machine.

To take advantage of connected GPUs, you’ll need to install CUDA Toolkit on your host machine.

Installing CUDA Toolkit

Compute Unified Device Architecture (CUDA) is a parallel computing platform developed by NVIDIA. To take advantage of connected NVIDIA GPUs, you’ll need to follow the setup instructions for your local machine or create a CUDA-enabled instance on a cloud provider.

To set up CUDA on your local machine, follow the installation instructions for Linux or Windows. CUDA is not currently supported on Mac OS.

Google Cloud Platform provides CUDA-ready deep learning instances, including PyTorch-specific instructions. Amazon Web Services also provides CUDA-ready deep learning instances.

This tutorial can be followed without connected GPUs or CUDA Toolkit. To run commands in this tutorial on CPU, omit the --gpus all flag when executing container commands. Keep in mind that some functionality within NeMo (such as training models) will take significantly longer on CPU.

Testing Access to GPUs

We’ll start by running the NeMo Chainguard Image interactively and determine whether the environment has access to connected GPUs.

Use the following command to pull the image, run it with GPU access, and start a Python interpreter inside the running container.

docker run -it --rm \
  --gpus all \
  --shm-size=8g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  cgr.dev/chainguard/nemo:latest

These options allow access to all available GPUs, allocate a custom amount of shared memory (8 GB) to the container, and set an upper bound on container memory use.

Running this command for the first time may take a few minutes, since it will download the NeMo Chainguard Image to your host machine. Once the image is pulled and the command runs successfully, you will be interacting with a bash shell in the running container. Enter the following commands at the prompt to check the availability of your GPU.

$ python
Python 3.11.9 (main, May  1 2024, 21:48:03) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nemo.core import pytorch_lightning
>>> len(pytorch_lightning.accelerators.find_usable_cuda_devices())
1

The above output shows that one GPU is connected and available. Since PyTorch is also accessible within our NeMo Chainguard Image, you can also use it to access more granular information on CUDA and attached GPUs.

>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'Tesla V100-SXM2-16GB'

Once you’ve determined that your environment has access to CUDA and connected GPUs, exit the container by typing Control-d or by typing exit() and pressing Enter. You should be returned to the prompt of your host machine.

NeMo Overview

NeMo is a generative AI toolkit and framework with a focus on conversational AI tasks such as NLP, ASR, and TTS, as well as large language models (LLM) and multimodal (MM) models. NeMo uses a system of neural modules, an abstraction over a variety of common elements in model training and inference such as encoders, decoders, loss functions, layers, or models. NeMo also provides collections of modules targeting specific areas of concern in conversational and generative AI, such as LLMs, speech AI / NLP, and TTS.

NeMo is built on PyTorch Lightning, a high-level interface to PyTorch with a focus on scalability, and uses the Hydra library for configuration management.

Since NeMo is a framework with many collections of modules suitable for a wide variety of projects, we’ve chosen an example task, generative text to speech, requiring the use of two TTS modules. This is an appropriate example of a task that might be run as part of a larger production application.

Text to Speech (TTS) Example

In this section, we’ll run a script that uses the NeMo Chainguard Image to:

  • Start with a message in plain text
  • Transform it into a set of phonemes
  • Generate a spectrogram (waveform representation) using a NeMo-provided spectrogram model
  • Transform the spectrogram values into audio using a NeMo-provided vocoder (human voice) model
  • Write the resulting audio to a .wav file at a set rate

First, let’s create a folder to work in on your host machine:

mkdir -p ~/nemo-tts && cd ~/nemo-tts

Next, let’s download our tts.py script:

curl https://raw.githubusercontent.com/chainguard-dev/nemo-examples/main/tts.py > tts.py

You should now be in a working directory containing only one file, tts.py.

We’ll be mounting this folder in our container as a volume, which will allow us to both pass in our script and extract our output.

We’ll now start a container based on our NeMo Chainguard Image, mount the current working directory containing our tts.py script inside the container as a volume, and run the script in the container:

docker run -it --rm \
  --gpus all \
  --user root \
  --shm-size=8g \
  --ulimit memlock=-1 \
 --ulimit stack=67108864 \
  -v $PWD:/home/nonroot/nemo-test \
  cgr.dev/chainguard/nemo:latest \
  -c "python /home/nonroot/nemo-test/tts.py"

Note that we ran the above script as root. This allows us to share the script and output .wav file between the host and container. Remember not to run your image as root in a production environment.

If your host machine does not have attached GPUs and you’d like to run the above on your CPU, omit the --gpus all \ line. The script tests for availability of the CUDA platform and sets the accelerator to CPU if CUDA is not detected, so the script will also function on CPU.

Since we’re using pretrained models to perform text to speech, this example will only take a few minutes using a CPU only. However, other tasks such as model training and finetuning may take significantly longer without connected GPUs.

Note that NeMo collections are large, and initial imports can take up to a minute depending on your environment. The script may appear to hang during that time.

After imports are complete, you should see a large amount of output as NeMo pulls models and works through the steps in the script (tokenizing, generating a spectrogram, generating audio, and writing audio to disk). On completion, the script outputs a test.wav file. Because we mounted a volume, this file should now be present in the working directory of your host machine.

ls
test.wav  tts.py

The test.wav file should contain audio similar to this output:

Output from the TTS script

Final Considerations and Next Steps

This section will consider next steps for applying the NVIDIA NeMo Chainguard Image to other tasks in conversational AI.

In the tts.py script run above, we used two models provided by NeMo, both contained within the TTS collection.

The former model allows us to convert plain text into a spectrogram, or a representation of a waveform. The second model generates audio from the spectrogram. Note that NVIDIA’s model overview pages provide useful background information, tags, and sample code. You can search the full NGC model catalog to find pretrained models for use with NeMo.

In this script, we used pretrained models to create the phonemes and audio output. These models can be finetuned with your own speech data to customize the results. NVIDIA hosts a tutorial on finetuning TTS models with NeMo.

The following resources may give a starting point for further explorations with the NeMo Chainguard Image:

  • NVIDIA provides a wide variety of NeMo Tutorials that are a strong entry point for working with the framework to accomplish specific tasks.
  • NVIDIA’s NeMo Playbooks provide a basis for more advanced tasks and configurations and address running workloads on different platforms and orchestration tooling.
  • The NeMo Collections organizes reference documentation for NeMo collections and modules.
  • The NVIDIA NGC model catalog can be searched to find models suitable for specific tasks, and each model’s overview page provides a useful reference with sample code.
  • This NVIDIA Conversational AI publications page collects papers that use the NeMo framework, showcasing cutting-edge generative deep learning using NeMo