How to Create a Full Model Optimization Pipeline with NVIDIA Model Optimizer and FastNAS Pruning Plus Fine-Tuning

Posted on April 7, 2026April 7, 2026 by Mark Harrell

Contents show

How to Create a Full Model Optimization Pipeline with NVIDIA Model Optimizer and FastNAS Pruning Plus Fine-Tuning

If you have ever trained a deep learning model and thought, “this thing is way too heavy to actually run anywhere useful,” you are not alone. Most models that come out of training look great on a benchmark but become a headache the moment you try to run them on real hardware. They are slow, power-hungry, and use more memory than most deployment environments can handle.

That is exactly the problem that model optimization solves. And with NVIDIA's Model Optimizer library and a technique called FastNAS pruning, you can go from a big, bloated model to a lean, deployment-ready one without throwing away all the accuracy you worked so hard to train.

This guide walks you through the entire pipeline, step by step, from training a baseline model on CIFAR-10 to pruning it with FastNAS and fine-tuning it back to a solid accuracy level. All of it runs in Google Colab. No fancy hardware setup required.

What Even Is Model Optimization and Why Should You Care

Let's start from scratch. When you train a neural network, you are basically teaching it to recognize patterns by adjusting millions of small numbers called weights. A bigger network with more weights can often learn more, but it also costs more to run.

Running a large model at inference time (meaning when you actually use it to make predictions) can be expensive in several ways:

It needs more GPU or CPU memory
It runs more slowly
It consumes more energy
It is harder to deploy on devices with limited resources like smartphones, edge devices, or embedded systems

Model optimization is the process of making a trained model faster and smaller without losing too much of what it learned. There are different ways to do this, and pruning is one of the most popular.

What Is Pruning

Think of pruning like trimming a tree. A tree with too many branches is heavy and harder to manage. You cut the branches that are not contributing much, and the tree grows back healthier and more efficient.

In neural networks, pruning removes filters, channels, or weights that contribute the least to the model's output. You end up with a smaller network that is faster and cheaper to run, while still being reasonably accurate.

The challenge is figuring out which parts to prune. Remove the wrong ones and your model falls apart. Remove the right ones and you barely notice the difference.

Where FastNAS Comes In

FastNAS is a specific pruning strategy from NVIDIA that makes this process smarter. Instead of guessing which filters to remove, it does a structured search across the possible architectures that can be derived from your original model.

It scores different sub-networks (smaller versions of your original model) using actual validation data. Then it picks the one that gives the best balance between accuracy and compute cost, measured in FLOPs (floating point operations per second, which tells you how much computation a model needs to run).

You give it a FLOPs budget, it finds the best model that fits inside that budget. That is the core idea.

Setting Up the Environment

Before writing any model code, you need to install the right packages. The NVIDIA Model Optimizer library is called nvidia-modelopt, and it integrates directly with PyTorch.

# Install required packages
pip install -q nvidia-modelopt torchvision torchprofile tqdm

Once installed, you bring in the core imports. These include PyTorch, torchvision for the dataset and model components, and the modelopt library itself.

import math
import os
import random
import time

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms

from torch.utils.data import DataLoader, Subset
from torchvision.models.resnet import BasicBlock
from tqdm.auto import tqdm

import modelopt.torch.opt as mto
import modelopt.torch.prune as mtp

Fixing Seeds for Reproducibility

Reproducibility is one of those things that sounds boring until you are three hours into debugging why your model gives different results every run. Fixing your random seeds means that if you run the same code twice, you get the same results.

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

Key Experiment Settings

These parameters control the scale of the experiment. The FAST_MODE flag is useful when you want to run a quick test to make sure everything works before committing to a full training run.

FAST_MODE = True
batch_size = 256 if FAST_MODE else 512
baseline_epochs = 20 if FAST_MODE else 120
finetune_epochs = 12 if FAST_MODE else 120
train_subset_size = 12000 if FAST_MODE else None
val_subset_size = 2000 if FAST_MODE else None
test_subset_size = 4000 if FAST_MODE else None

# Target FLOPs for the pruned model
target_flops = 60e6

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Setting target_flops = 60e6 means you are asking FastNAS to find a sub-network that runs in 60 million FLOPs or fewer. This is your compute budget.

Building the CIFAR-10 Data Pipeline

CIFAR-10 is a classic image classification dataset with 60,000 images across 10 categories. The categories include things like airplanes, dogs, ships, and trucks, all at a tiny 32×32 pixel resolution. It is a perfect benchmark for testing models without needing huge compute.

Data Augmentation and Normalization

When training, you want to augment your data to help the model generalize better. Random flips and crops create slightly different versions of each image, which gives the model more variety to learn from.

def build_cifar10_loaders(
    train_batch_size=256,
    train_subset_size=None,
    val_subset_size=None,
    test_subset_size=None,
):
    normalize = transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2470, 0.2435, 0.2616],
    )

    train_transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.RandomHorizontalFlip(),
        transforms.RandomCrop(32, padding=4),
        normalize,
    ])

    eval_transform = transforms.Compose([
        transforms.ToTensor(),
        normalize,
    ])

The normalization values are the channel-wise mean and standard deviation for the CIFAR-10 dataset. Using these keeps the pixel values in a range that neural networks train better on.

Splitting the Dataset

CIFAR-10 comes with 50,000 training images and 10,000 test images. You split the training data further into a train set and a validation set. The validation set lets you monitor performance during training without touching the test set.

    train_full = torchvision.datasets.CIFAR10(
        root="./data", train=True, transform=train_transform, download=True
    )
    val_full = torchvision.datasets.CIFAR10(
        root="./data", train=True, transform=eval_transform, download=True
    )
    test_full = torchvision.datasets.CIFAR10(
        root="./data", train=False, transform=eval_transform, download=True
    )

    ids = np.arange(len(train_full))
    np.random.shuffle(ids)
    n_train = int(len(train_full) * 0.9)
    train_ids = ids[:n_train]
    val_ids = ids[n_train:]

    # Optional subsetting for faster experiments
    if train_subset_size is not None:
        train_ids = train_ids[:min(train_subset_size, len(train_ids))]
    if val_subset_size is not None:
        val_ids = val_ids[:min(val_subset_size, len(val_ids))]

Creating the DataLoaders

DataLoaders handle the batching, shuffling, and parallelism involved in feeding data to your model during training.

    num_workers = min(2, os.cpu_count() or 1)

    train_loader = DataLoader(
        Subset(train_full, train_ids.tolist()),
        batch_size=train_batch_size,
        shuffle=True,
        num_workers=num_workers,
        pin_memory=torch.cuda.is_available(),
    )

    val_loader = DataLoader(
        Subset(val_full, val_ids.tolist()),
        batch_size=512,
        shuffle=False,
        num_workers=num_workers,
        pin_memory=torch.cuda.is_available(),
    )

    return train_loader, val_loader, test_loader

train_loader, val_loader, test_loader = build_cifar10_loaders(
    train_batch_size=batch_size,
    train_subset_size=train_subset_size,
    val_subset_size=val_subset_size,
    test_subset_size=test_subset_size,
)

Defining the ResNet20 Architecture

ResNet (Residual Network) is one of the most well-known architectures in deep learning. The key idea is the skip connection, where the input to a layer gets added to its output. This helps with training deeper networks because gradients can flow through both paths during backpropagation.

ResNet20 is a smaller version of ResNet specifically designed for CIFAR-10. It has just 20 layers, which makes it fast to train while still being expressive enough to achieve solid accuracy.

Custom Weight Initialization

Kaiming normal initialization sets up the weights in a way that keeps the variance of activations stable across layers. This leads to faster, more stable training.

def _weights_init(m):
    if isinstance(m, (nn.Linear, nn.Conv2d)):
        nn.init.kaiming_normal_(m.weight)

class LambdaLayer(nn.Module):
    def __init__(self, lambd):
        super().__init__()
        self.lambd = lambd

    def forward(self, x):
        return self.lambd(x)

Building the ResNet Architecture

class ResNet(nn.Module):
    def __init__(self, num_blocks, num_classes=10):
        super().__init__()
        self.in_planes = 16

        self.layers = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            self._make_layer(16, num_blocks, stride=1),
            self._make_layer(32, num_blocks, stride=2),
            self._make_layer(64, num_blocks, stride=2),
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Flatten(),
            nn.Linear(64, num_classes),
        )
        self.apply(_weights_init)

    def _make_layer(self, planes, num_blocks, stride):
        strides = [stride] + [1] * (num_blocks - 1)
        layers = []
        for s in strides:
            downsample = None
            if s != 1 or self.in_planes != planes:
                downsample = LambdaLayer(
                    lambda x: F.pad(
                        x[:, :, ::2, ::2],
                        (0, 0, 0, 0, planes // 4, planes // 4),
                        "constant",
                        0,
                    )
                )
            layers.append(BasicBlock(self.in_planes, planes, s, downsample))
            self.in_planes = planes
        return nn.Sequential(*layers)

    def forward(self, x):
        return self.layers(x)

def resnet20():
    return ResNet(num_blocks=3).to(device)

The LambdaLayer handles the shortcut connection when the dimensions change between blocks. This is the CIFAR-adapted version of the classic ResNet shortcut, which uses zero-padding instead of a 1×1 convolution for efficiency.

Writing the Training Loop

A solid training loop does several things: it runs the model on batches of data, computes the loss, updates the weights using backpropagation, and tracks the best model so you can restore it later.

The Learning Rate Scheduler

A cosine learning rate schedule gradually decreases the learning rate over training. The warmup phase at the start gives the model time to settle before the learning rate hits its peak.

class CosineLRwithWarmup(torch.optim.lr_scheduler._LRScheduler):
    def __init__(self, optimizer, warmup_steps, decay_steps, warmup_lr=0.0, last_epoch=-1):
        self.warmup_steps = warmup_steps
        self.warmup_lr = warmup_lr
        self.decay_steps = max(decay_steps, 1)
        super().__init__(optimizer, last_epoch)

    def get_lr(self):
        if self.last_epoch < self.warmup_steps:
            return [
                (base_lr - self.warmup_lr) * self.last_epoch / max(self.warmup_steps, 1) + self.warmup_lr
                for base_lr in self.base_lrs
            ]
        current_steps = self.last_epoch - self.warmup_steps
        return [
            0.5 * base_lr * (1 + math.cos(math.pi * current_steps / self.decay_steps))
            for base_lr in self.base_lrs
        ]

The Core Training Functions

def train_one_epoch(model, loader, optimizer, scheduler, loss_fn=None):
    model.train()
    running_loss, total = 0.0, 0
    for images, labels in loader:
        images = images.to(device, non_blocking=True)
        labels = labels.to(device, non_blocking=True)
        outputs = model(images)
        loss = F.cross_entropy(outputs, labels)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
        scheduler.step()
        running_loss += loss.item() * labels.size(0)
        total += labels.size(0)
    return running_loss / max(total, 1)

@torch.no_grad()
def evaluate(model, loader):
    model.eval()
    correct, total = 0, 0
    for images, labels in loader:
        images = images.to(device, non_blocking=True)
        labels = labels.to(device, non_blocking=True)
        preds = model(images).argmax(dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)
    return 100.0 * correct / max(total, 1)

The Full Training Pipeline

def train_model(model, train_loader, val_loader, epochs, ckpt_path, lr=None, weight_decay=1e-4):
    if lr is None:
        lr = 0.1 * batch_size / 128
    steps_per_epoch = len(train_loader)
    warmup_steps = max(1, 2 * steps_per_epoch if FAST_MODE else 5 * steps_per_epoch)
    decay_steps = max(1, epochs * steps_per_epoch)

    optimizer = torch.optim.SGD(
        filter(lambda p: p.requires_grad, model.parameters()),
        lr=lr, momentum=0.9, weight_decay=weight_decay,
    )
    scheduler = CosineLRwithWarmup(optimizer, warmup_steps, decay_steps)

    best_val = -1.0
    for epoch in tqdm(range(1, epochs + 1)):
        train_loss = train_one_epoch(model, train_loader, optimizer, scheduler)
        val_acc = evaluate(model, val_loader)
        if val_acc >= best_val:
            best_val = val_acc
            torch.save(model.state_dict(), ckpt_path)
        if epoch % max(1, epochs // 4) == 0 or epoch == epochs:
            print(f"Epoch {epoch:03d} | loss={train_loss:.4f} | val_acc={val_acc:.2f}%")

    model.load_state_dict(torch.load(ckpt_path, map_location=device))
    print(f"Best val accuracy: {best_val:.2f}%")
    return model, best_val

This function wraps everything together: optimizer setup, scheduling, training, validation, and checkpoint saving. The best_val tracking ensures you always restore the best version of the model, not just the last one.

Training the Baseline Model

Now you put everything together and train the full model from scratch. This is your reference point, the model before any pruning happens.

baseline_model = resnet20()
baseline_ckpt = "resnet20_baseline.pth"

start = time.time()
baseline_model, baseline_val = train_model(
    baseline_model,
    train_loader,
    val_loader,
    epochs=baseline_epochs,
    ckpt_path=baseline_ckpt,
    lr=0.1 * batch_size / 128,
    weight_decay=1e-4,
)

baseline_test = evaluate(baseline_model, test_loader)
baseline_time = time.time() - start

print(f"Baseline validation accuracy: {baseline_val:.2f}%")
print(f"Baseline test accuracy:       {baseline_test:.2f}%")
print(f"Baseline training time:       {baseline_time/60:.2f} min")

When this finishes, you have a working ResNet20 with a solid accuracy number. In fast mode with 20 epochs, you can expect somewhere around 80 to 85 percent test accuracy on the subset. With full training at 120 epochs, it gets significantly higher.

This checkpoint is the starting point for pruning.

Applying FastNAS Pruning

Here is where the real optimization work happens. FastNAS performs a structured search over possible sub-networks of your trained model, guided by a scoring function and constrained by your FLOPs budget.

Configuring FastNAS

fastnas_cfg = mtp.fastnas.FastNASConfig()
fastnas_cfg["nn.Conv2d"]["*"]["channel_divisor"] = 16
fastnas_cfg["nn.BatchNorm2d"]["*"]["feature_divisor"] = 16

The channel_divisor and feature_divisor settings control how granular the search is. Setting them to 16 means the number of channels in each layer can only be pruned in multiples of 16. This keeps the resulting architecture hardware-friendly.

Fixing a FLOPs Profiling Compatibility Issue

NVIDIA Model Optimizer uses torchprofile to measure FLOPs. Depending on your version, you might hit an attribute error. This patch prevents that:

import torchprofile.profile as tp_profile
from torchprofile.handlers import HANDLER_MAP

if not hasattr(tp_profile, "handlers"):
    tp_profile.handlers = tuple(
        (tuple([op_name]), handler)
        for op_name, handler in HANDLER_MAP.items()
    )

This is one of those things that is not in the official docs but will save you 20 minutes of frustrated Googling.

Running the Pruning Search

dummy_input = torch.randn(1, 3, 32, 32, device=device)

def score_func(model):
    return evaluate(model, val_loader)

model_for_prune = resnet20()
model_for_prune.load_state_dict(torch.load(baseline_ckpt, map_location=device))

search_ckpt = "modelopt_search_checkpoint_fastnas.pth"
pruned_ckpt = "modelopt_pruned_model_fastnas.pth"

prune_start = time.time()

pruned_model, pruned_metadata = mtp.prune(
    model=model_for_prune,
    mode=[("fastnas", fastnas_cfg)],
    constraints={"flops": target_flops},
    dummy_input=dummy_input,
    config={
        "data_loader": train_loader,
        "score_func": score_func,
        "checkpoint": search_ckpt,
    },
)

mto.save(pruned_model, pruned_ckpt)

prune_elapsed = time.time() - prune_start
pruned_test_before_ft = evaluate(pruned_model, test_loader)

print(f"Pruned model test accuracy before fine-tune: {pruned_test_before_ft:.2f}%")
print(f"Pruning/search time: {prune_elapsed/60:.2f} min")

A few things to note here:

You load the baseline model weights before pruning. FastNAS needs a trained model to score sub-networks meaningfully.
The score_func is a function that takes a model and returns a score. Here you use validation accuracy, which is the most direct measure of how good a sub-network is.
The dummy_input is just a sample tensor that helps the library profile FLOPs before the search starts.
mto.save saves the pruned model in NVIDIA's format, which preserves the model's pruned structure for later restoration.

After this step, the model's accuracy will typically drop a bit. That is expected. The fine-tuning step brings it back up.

Restoring and Fine-Tuning the Pruned Model

The mto.save and mto.restore pair is important because the pruned model has a modified architecture. You cannot just load it with standard PyTorch state dict loading. You need to restore it properly.

restored_pruned_model = resnet20()
restored_pruned_model = mto.restore(restored_pruned_model, pruned_ckpt)

restored_test = evaluate(restored_pruned_model, test_loader)
print(f"Restored pruned model test accuracy: {restored_test:.2f}%")

Running Fine-Tuning

Fine-tuning re-trains the pruned model for a few more epochs. The key difference from the original training is the lower starting learning rate. Since the model already has some knowledge baked in from the baseline training, you do not want to overwrite it with large gradient updates.

finetune_ckpt = "resnet20_pruned_finetuned.pth"
start_ft = time.time()

restored_pruned_model, pruned_val_after_ft = train_model(
    restored_pruned_model,
    train_loader,
    val_loader,
    epochs=finetune_epochs,
    ckpt_path=finetune_ckpt,
    lr=0.05 * batch_size / 128,
    weight_decay=1e-4,
)

pruned_test_after_ft = evaluate(restored_pruned_model, test_loader)
ft_time = time.time() - start_ft

print(f"Fine-tuned pruned validation accuracy: {pruned_val_after_ft:.2f}%")
print(f"Fine-tuned pruned test accuracy:       {pruned_test_after_ft:.2f}%")
print(f"Fine-tuning time:                      {ft_time/60:.2f} min")

Half the base learning rate is a reasonable starting point for fine-tuning. The cosine scheduler will still anneal the rate over the fine-tuning epochs, so the model adapts smoothly.

Comparing and Saving the Results

At the end of all of this, you want to see how the pruned model stacks up against the baseline. Two numbers matter most: accuracy and parameter count.

def count_params(model):
    return sum(p.numel() for p in model.parameters())

def count_nonzero_params(model):
    return sum((p.detach() != 0).sum().item() for p in model.parameters())

baseline_params = count_params(baseline_model)
pruned_params = count_params(restored_pruned_model)

print("=" * 60)
print("FINAL SUMMARY")
print("=" * 60)
print(f"Baseline test accuracy:                 {baseline_test:.2f}%")
print(f"Pruned test accuracy before finetune:   {pruned_test_before_ft:.2f}%")
print(f"Pruned test accuracy after finetune:    {pruned_test_after_ft:.2f}%")
print("-" * 60)
print(f"Baseline total params:                  {baseline_params:,}")
print(f"Pruned total params:                    {pruned_params:,}")
print("-" * 60)
print(f"Baseline train time:                    {baseline_time/60:.2f} min")
print(f"Pruning/search time:                    {prune_elapsed/60:.2f} min")
print(f"Pruned finetune time:                   {ft_time/60:.2f} min")
print("=" * 60)

Saving Your Final Artifacts

torch.save(baseline_model.state_dict(), "baseline_resnet20_final_state_dict.pth")
mto.save(restored_pruned_model, "pruned_resnet20_final_modelopt.pth")

print("Saved files:")
print(" - baseline_resnet20_final_state_dict.pth")
print(" - modelopt_pruned_model_fastnas.pth")
print(" - pruned_resnet20_final_modelopt.pth")
print(" - modelopt_search_checkpoint_fastnas.pth")

Keep both files. The baseline is useful for comparison. The pruned model is what you actually deploy.

Validating Predictions on Real Samples

It is always a good idea to visually inspect a few predictions rather than just looking at aggregate accuracy numbers. This function grabs a batch from the test loader and prints the predicted versus actual labels.

@torch.no_grad()
def show_sample_predictions(model, loader, n=8):
    model.eval()
    class_names = [
        "airplane", "automobile", "bird", "cat", "deer",
        "dog", "frog", "horse", "ship", "truck"
    ]
    images, labels = next(iter(loader))
    images = images[:n].to(device)
    labels = labels[:n]
    preds = model(images).argmax(dim=1).cpu()
    print("\nSample predictions:")
    for i in range(len(preds)):
        status = "OK" if preds[i] == labels[i] else "WRONG"
        print(f"{i:02d} | pred={class_names[preds[i]]:<12} | true={class_names[labels[i]]} | {status}")

show_sample_predictions(restored_pruned_model, test_loader, n=8)

Seeing the model correctly identify airplanes from horses after pruning is genuinely satisfying.

What You Actually Learned Here

Let's take a step back and look at what this pipeline taught you beyond just the code.

Pruning Is Not Destruction

A lot of people assume pruning will ruin a model. But structured pruning, done correctly with a good scoring function and real validation data guiding the search, is surprisingly gentle. You trim out the parts the model barely uses, and what remains is often nearly as capable.

FLOPs Are a Useful Proxy for Speed

FLOPs do not perfectly predict inference time on every piece of hardware, but they are a reliable proxy. A model with half the FLOPs will generally be faster. Combined with structured pruning (which removes entire channels rather than individual weights), the speedup is real and measurable.

Fine-Tuning After Pruning Is Not Optional

If you skip fine-tuning, you lose accuracy. The pruning search finds a good architecture, but the weights inside that architecture were optimized for the original, larger network. A few epochs of fine-tuning lets the remaining weights re-adapt to their new structure.

The Workflow Generalizes

The specific model here is ResNet20 on CIFAR-10, but the pipeline works with any PyTorch model and any dataset. You replace the model definition, the data loaders, and the score function, and everything else stays the same. That is a genuinely useful reusable pattern.

Wrapping It All Up

Building an end-to-end model optimization pipeline sounds complicated before you have done it, but it breaks down into a series of manageable steps. Train a solid baseline. Run the pruning search with a FLOPs budget. Fine-tune the result. Compare the numbers.

NVIDIA Model Optimizer handles a lot of the hard decisions for you. FastNAS does the architecture search automatically. All you need to do is wire it up correctly, and this guide gives you exactly that.

The most important thing is to trust the process. Pruning drops accuracy at first. That is expected. Fine-tuning brings it back. The final model is smaller, faster, and still good at its job.

That is the whole point of model optimization, and now you know how to do it from scratch.

Common Questions When You First Try This

Why Does My Accuracy Drop So Much After Pruning?

The drop in accuracy immediately after pruning is normal, and it can feel alarming if you are not expecting it. The pruned model has fewer channels than the original, but its weights were never trained in that configuration. Fine-tuning is essentially letting the model catch up. Think of it like moving to a smaller apartment. You have less space, but once you arrange things properly, it still feels like home.

If the accuracy drop is extreme (say, more than 20 percentage points below baseline), there are a few things to check:

Your FLOPs target might be too aggressive. Try relaxing it a bit.
Your score function might not be capturing performance accurately. Make sure val_loader covers a representative sample.
The channel_divisor might be too large relative to your model's channel counts, causing some layers to get pruned to near-zero width.

Can I Use This With Pretrained Models?

Yes. In fact, pruning a pretrained model is often more effective than pruning one trained from scratch on your specific dataset. The pretrained model starts with rich representations, and FastNAS finds which parts of those representations are actually useful for your task.

The process is the same: define your score function on your validation data, set your FLOPs constraint, and run the search. The key difference is that you might want to use a lower fine-tuning learning rate (try 0.01 * batch_size / 128) to preserve the pretrained features.

What Happens If the Search Runs Out of Memory?

FastNAS needs to evaluate many sub-networks during the search phase. If you run out of GPU memory, try reducing the batch size in the score function. You can define a separate smaller data loader just for scoring:

score_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)

def score_func(model):
    return evaluate(model, score_loader)

This uses less memory per evaluation step. The search will take a little longer, but it will not crash.

Going Further: What You Can Explore Next

Once you are comfortable with the basic pipeline, there are a few directions worth exploring.

Quantization After Pruning

NVIDIA Model Optimizer supports quantization alongside pruning. Quantization reduces the precision of the model's weights and activations, typically from 32-bit floats to 8-bit integers. Combined with pruning, this can cut model size by 4x to 8x compared to the original baseline.

The model optimizer library has an mto.quantize interface that works similarly to the pruning API. After you fine-tune your pruned model, you can apply post-training quantization or quantization-aware training on top.

Trying Different Architectures

ResNet20 is a small model. The same pipeline works on MobileNet, EfficientNet, and larger ResNets. Larger models have more headroom for pruning, which means FastNAS can find sub-networks with much bigger FLOPs reductions while still maintaining high accuracy.

If you want to push the limits, try starting with ResNet50 pretrained on ImageNet and pruning it down to 30 to 40 percent of its original FLOPs. The accuracy recovery during fine-tuning is genuinely impressive.

Automating the FLOPs Target Selection

Rather than picking a FLOPs target manually, you can run a short sweep across a few different targets and plot accuracy versus FLOPs. This gives you a pareto curve that shows the trade-off clearly. From that curve, you pick the operating point that matches your deployment constraints.

This is especially useful when you are working with hardware that has a specific latency budget. You can measure inference time at different FLOPs levels on your target device, then pick the FLOPs target that keeps you under your latency limit.

The bottom line is that model optimization is one of the most practical skills you can develop in machine learning right now. Models trained in research settings almost always need to be made smaller before they can run in production. The tools exist to do this well, and after working through this pipeline, you have a concrete foundation to build on.

How to Create a Full Model Optimization Pipeline with NVIDIA Model Optimizer and FastNAS Pruning Plus Fine-Tuning

What Even Is Model Optimization and Why Should You Care

What Is Pruning

Where FastNAS Comes In

Setting Up the Environment

Fixing Seeds for Reproducibility

Key Experiment Settings

Building the CIFAR-10 Data Pipeline

Data Augmentation and Normalization

Splitting the Dataset

Creating the DataLoaders

Defining the ResNet20 Architecture

Custom Weight Initialization

Building the ResNet Architecture

Writing the Training Loop

The Learning Rate Scheduler

The Core Training Functions

The Full Training Pipeline

Training the Baseline Model

Applying FastNAS Pruning

Configuring FastNAS

Fixing a FLOPs Profiling Compatibility Issue

Running the Pruning Search

Restoring and Fine-Tuning the Pruned Model

Running Fine-Tuning

Comparing and Saving the Results

Saving Your Final Artifacts

Validating Predictions on Real Samples

What You Actually Learned Here

Pruning Is Not Destruction

FLOPs Are a Useful Proxy for Speed

Fine-Tuning After Pruning Is Not Optional

The Workflow Generalizes

Wrapping It All Up

Common Questions When You First Try This

Why Does My Accuracy Drop So Much After Pruning?

Can I Use This With Pretrained Models?

What Happens If the Search Runs Out of Memory?

Going Further: What You Can Explore Next

Quantization After Pruning

Trying Different Architectures

Automating the FLOPs Target Selection

More Posts

Leave a Reply Cancel reply