How to Create a Full Model Optimization Pipeline with NVIDIA Model Optimizer and FastNAS Pruning Plus Fine-Tuning

How to Create a Full Model Optimization Pipeline with NVIDIA Model Optimizer and FastNAS Pruning Plus Fine-Tuning
If you have ever trained a deep learning model and thought, โthis thing is way too heavy to actually run anywhere useful,โ you are not alone. Most models that come out of training look great on a benchmark but become a headache the moment you try to run them on real hardware. They are slow, power-hungry, and use more memory than most deployment environments can handle.
That is exactly the problem that model optimization solves. And with NVIDIA's Model Optimizer library and a technique called FastNAS pruning, you can go from a big, bloated model to a lean, deployment-ready one without throwing away all the accuracy you worked so hard to train.
This guide walks you through the entire pipeline, step by step, from training a baseline model on CIFAR-10 to pruning it with FastNAS and fine-tuning it back to a solid accuracy level. All of it runs in Google Colab. No fancy hardware setup required.
What Even Is Model Optimization and Why Should You Care
Let's start from scratch. When you train a neural network, you are basically teaching it to recognize patterns by adjusting millions of small numbers called weights. A bigger network with more weights can often learn more, but it also costs more to run.
Running a large model at inference time (meaning when you actually use it to make predictions) can be expensive in several ways:
- It needs more GPU or CPU memory
- It runs more slowly
- It consumes more energy
- It is harder to deploy on devices with limited resources like smartphones, edge devices, or embedded systems
Model optimization is the process of making a trained model faster and smaller without losing too much of what it learned. There are different ways to do this, and pruning is one of the most popular.
What Is Pruning
Think of pruning like trimming a tree. A tree with too many branches is heavy and harder to manage. You cut the branches that are not contributing much, and the tree grows back healthier and more efficient.
In neural networks, pruning removes filters, channels, or weights that contribute the least to the model's output. You end up with a smaller network that is faster and cheaper to run, while still being reasonably accurate.
The challenge is figuring out which parts to prune. Remove the wrong ones and your model falls apart. Remove the right ones and you barely notice the difference.
Where FastNAS Comes In
FastNAS is a specific pruning strategy from NVIDIA that makes this process smarter. Instead of guessing which filters to remove, it does a structured search across the possible architectures that can be derived from your original model.
It scores different sub-networks (smaller versions of your original model) using actual validation data. Then it picks the one that gives the best balance between accuracy and compute cost, measured in FLOPs (floating point operations per second, which tells you how much computation a model needs to run).
You give it a FLOPs budget, it finds the best model that fits inside that budget. That is the core idea.
Setting Up the Environment
Before writing any model code, you need to install the right packages. The NVIDIA Model Optimizer library is called nvidia-modelopt, and it integrates directly with PyTorch.
# Install required packages
pip install -q nvidia-modelopt torchvision torchprofile tqdm
Once installed, you bring in the core imports. These include PyTorch, torchvision for the dataset and model components, and the modelopt library itself.
import math
import os
import random
import time
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Subset
from torchvision.models.resnet import BasicBlock
from tqdm.auto import tqdm
import modelopt.torch.opt as mto
import modelopt.torch.prune as mtp
Fixing Seeds for Reproducibility
Reproducibility is one of those things that sounds boring until you are three hours into debugging why your model gives different results every run. Fixing your random seeds means that if you run the same code twice, you get the same results.
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(SEED)
Key Experiment Settings
These parameters control the scale of the experiment. The FAST_MODE flag is useful when you want to run a quick test to make sure everything works before committing to a full training run.
FAST_MODE = True
batch_size = 256 if FAST_MODE else 512
baseline_epochs = 20 if FAST_MODE else 120
finetune_epochs = 12 if FAST_MODE else 120
train_subset_size = 12000 if FAST_MODE else None
val_subset_size = 2000 if FAST_MODE else None
test_subset_size = 4000 if FAST_MODE else None
# Target FLOPs for the pruned model
target_flops = 60e6
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
Setting target_flops = 60e6 means you are asking FastNAS to find a sub-network that runs in 60 million FLOPs or fewer. This is your compute budget.
Building the CIFAR-10 Data Pipeline
CIFAR-10 is a classic image classification dataset with 60,000 images across 10 categories. The categories include things like airplanes, dogs, ships, and trucks, all at a tiny 32ร32 pixel resolution. It is a perfect benchmark for testing models without needing huge compute.
Data Augmentation and Normalization
When training, you want to augment your data to help the model generalize better. Random flips and crops create slightly different versions of each image, which gives the model more variety to learn from.
def build_cifar10_loaders(
train_batch_size=256,
train_subset_size=None,
val_subset_size=None,
test_subset_size=None,
):
normalize = transforms.Normalize(
mean=[0.4914, 0.4822, 0.4465],
std=[0.2470, 0.2435, 0.2616],
)
train_transform = transforms.Compose([
transforms.ToTensor(),
transforms.RandomHorizontalFlip(),
transforms.RandomCrop(32, padding=4),
normalize,
])
eval_transform = transforms.Compose([
transforms.ToTensor(),
normalize,
])
The normalization values are the channel-wise mean and standard deviation for the CIFAR-10 dataset. Using these keeps the pixel values in a range that neural networks train better on.
Splitting the Dataset
CIFAR-10 comes with 50,000 training images and 10,000 test images. You split the training data further into a train set and a validation set. The validation set lets you monitor performance during training without touching the test set.
train_full = torchvision.datasets.CIFAR10(
root="./data", train=True, transform=train_transform, download=True
)
val_full = torchvision.datasets.CIFAR10(
root="./data", train=True, transform=eval_transform, download=True
)
test_full = torchvision.datasets.CIFAR10(
root="./data", train=False, transform=eval_transform, download=True
)
ids = np.arange(len(train_full))
np.random.shuffle(ids)
n_train = int(len(train_full) * 0.9)
train_ids = ids[:n_train]
val_ids = ids[n_train:]
# Optional subsetting for faster experiments
if train_subset_size is not None:
train_ids = train_ids[:min(train_subset_size, len(train_ids))]
if val_subset_size is not None:
val_ids = val_ids[:min(val_subset_size, len(val_ids))]
Creating the DataLoaders
DataLoaders handle the batching, shuffling, and parallelism involved in feeding data to your model during training.
num_workers = min(2, os.cpu_count() or 1)
train_loader = DataLoader(
Subset(train_full, train_ids.tolist()),
batch_size=train_batch_size,
shuffle=True,
num_workers=num_workers,
pin_memory=torch.cuda.is_available(),
)
val_loader = DataLoader(
Subset(val_full, val_ids.tolist()),
batch_size=512,
shuffle=False,
num_workers=num_workers,
pin_memory=torch.cuda.is_available(),
)
return train_loader, val_loader, test_loader
train_loader, val_loader, test_loader = build_cifar10_loaders(
train_batch_size=batch_size,
train_subset_size=train_subset_size,
val_subset_size=val_subset_size,
test_subset_size=test_subset_size,
)
Defining the ResNet20 Architecture
ResNet (Residual Network) is one of the most well-known architectures in deep learning. The key idea is the skip connection, where the input to a layer gets added to its output. This helps with training deeper networks because gradients can flow through both paths during backpropagation.
ResNet20 is a smaller version of ResNet specifically designed for CIFAR-10. It has just 20 layers, which makes it fast to train while still being expressive enough to achieve solid accuracy.
Custom Weight Initialization
Kaiming normal initialization sets up the weights in a way that keeps the variance of activations stable across layers. This leads to faster, more stable training.
def _weights_init(m):
if isinstance(m, (nn.Linear, nn.Conv2d)):
nn.init.kaiming_normal_(m.weight)
class LambdaLayer(nn.Module):
def __init__(self, lambd):
super().__init__()
self.lambd = lambd
def forward(self, x):
return self.lambd(x)
Building the ResNet Architecture
class ResNet(nn.Module):
def __init__(self, num_blocks, num_classes=10):
super().__init__()
self.in_planes = 16
self.layers = nn.Sequential(
nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False),
nn.BatchNorm2d(16),
nn.ReLU(),
self._make_layer(16, num_blocks, stride=1),
self._make_layer(32, num_blocks, stride=2),
self._make_layer(64, num_blocks, stride=2),
nn.AdaptiveAvgPool2d((1, 1)),
nn.Flatten(),
nn.Linear(64, num_classes),
)
self.apply(_weights_init)
def _make_layer(self, planes, num_blocks, stride):
strides = [stride] + [1] * (num_blocks - 1)
layers = []
for s in strides:
downsample = None
if s != 1 or self.in_planes != planes:
downsample = LambdaLayer(
lambda x: F.pad(
x[:, :, ::2, ::2],
(0, 0, 0, 0, planes // 4, planes // 4),
"constant",
0,
)
)
layers.append(BasicBlock(self.in_planes, planes, s, downsample))
self.in_planes = planes
return nn.Sequential(*layers)
def forward(self, x):
return self.layers(x)
def resnet20():
return ResNet(num_blocks=3).to(device)
The LambdaLayer handles the shortcut connection when the dimensions change between blocks. This is the CIFAR-adapted version of the classic ResNet shortcut, which uses zero-padding instead of a 1ร1 convolution for efficiency.
Writing the Training Loop
A solid training loop does several things: it runs the model on batches of data, computes the loss, updates the weights using backpropagation, and tracks the best model so you can restore it later.
The Learning Rate Scheduler
A cosine learning rate schedule gradually decreases the learning rate over training. The warmup phase at the start gives the model time to settle before the learning rate hits its peak.
class CosineLRwithWarmup(torch.optim.lr_scheduler._LRScheduler):
def __init__(self, optimizer, warmup_steps, decay_steps, warmup_lr=0.0, last_epoch=-1):
self.warmup_steps = warmup_steps
self.warmup_lr = warmup_lr
self.decay_steps = max(decay_steps, 1)
super().__init__(optimizer, last_epoch)
def get_lr(self):
if self.last_epoch < self.warmup_steps:
return [
(base_lr - self.warmup_lr) * self.last_epoch / max(self.warmup_steps, 1) + self.warmup_lr
for base_lr in self.base_lrs
]
current_steps = self.last_epoch - self.warmup_steps
return [
0.5 * base_lr * (1 + math.cos(math.pi * current_steps / self.decay_steps))
for base_lr in self.base_lrs
]
The Core Training Functions
def train_one_epoch(model, loader, optimizer, scheduler, loss_fn=None):
model.train()
running_loss, total = 0.0, 0
for images, labels in loader:
images = images.to(device, non_blocking=True)
labels = labels.to(device, non_blocking=True)
outputs = model(images)
loss = F.cross_entropy(outputs, labels)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
scheduler.step()
running_loss += loss.item() * labels.size(0)
total += labels.size(0)
return running_loss / max(total, 1)
@torch.no_grad()
def evaluate(model, loader):
model.eval()
correct, total = 0, 0
for images, labels in loader:
images = images.to(device, non_blocking=True)
labels = labels.to(device, non_blocking=True)
preds = model(images).argmax(dim=1)
correct += (preds == labels).sum().item()
total += labels.size(0)
return 100.0 * correct / max(total, 1)
The Full Training Pipeline
def train_model(model, train_loader, val_loader, epochs, ckpt_path, lr=None, weight_decay=1e-4):
if lr is None:
lr = 0.1 * batch_size / 128
steps_per_epoch = len(train_loader)
warmup_steps = max(1, 2 * steps_per_epoch if FAST_MODE else 5 * steps_per_epoch)
decay_steps = max(1, epochs * steps_per_epoch)
optimizer = torch.optim.SGD(
filter(lambda p: p.requires_grad, model.parameters()),
lr=lr, momentum=0.9, weight_decay=weight_decay,
)
scheduler = CosineLRwithWarmup(optimizer, warmup_steps, decay_steps)
best_val = -1.0
for epoch in tqdm(range(1, epochs + 1)):
train_loss = train_one_epoch(model, train_loader, optimizer, scheduler)
val_acc = evaluate(model, val_loader)
if val_acc >= best_val:
best_val = val_acc
torch.save(model.state_dict(), ckpt_path)
if epoch % max(1, epochs // 4) == 0 or epoch == epochs:
print(f"Epoch {epoch:03d} | loss={train_loss:.4f} | val_acc={val_acc:.2f}%")
model.load_state_dict(torch.load(ckpt_path, map_location=device))
print(f"Best val accuracy: {best_val:.2f}%")
return model, best_val
This function wraps everything together: optimizer setup, scheduling, training, validation, and checkpoint saving. The best_val tracking ensures you always restore the best version of the model, not just the last one.
Training the Baseline Model
Now you put everything together and train the full model from scratch. This is your reference point, the model before any pruning happens.
baseline_model = resnet20()
baseline_ckpt = "resnet20_baseline.pth"
start = time.time()
baseline_model, baseline_val = train_model(
baseline_model,
train_loader,
val_loader,
epochs=baseline_epochs,
ckpt_path=baseline_ckpt,
lr=0.1 * batch_size / 128,
weight_decay=1e-4,
)
baseline_test = evaluate(baseline_model, test_loader)
baseline_time = time.time() - start
print(f"Baseline validation accuracy: {baseline_val:.2f}%")
print(f"Baseline test accuracy: {baseline_test:.2f}%")
print(f"Baseline training time: {baseline_time/60:.2f} min")
When this finishes, you have a working ResNet20 with a solid accuracy number. In fast mode with 20 epochs, you can expect somewhere around 80 to 85 percent test accuracy on the subset. With full training at 120 epochs, it gets significantly higher.
This checkpoint is the starting point for pruning.
Applying FastNAS Pruning
Here is where the real optimization work happens. FastNAS performs a structured search over possible sub-networks of your trained model, guided by a scoring function and constrained by your FLOPs budget.
Configuring FastNAS
fastnas_cfg = mtp.fastnas.FastNASConfig()
fastnas_cfg["nn.Conv2d"]["*"]["channel_divisor"] = 16
fastnas_cfg["nn.BatchNorm2d"]["*"]["feature_divisor"] = 16
The channel_divisor and feature_divisor settings control how granular the search is. Setting them to 16 means the number of channels in each layer can only be pruned in multiples of 16. This keeps the resulting architecture hardware-friendly.
Fixing a FLOPs Profiling Compatibility Issue
NVIDIA Model Optimizer uses torchprofile to measure FLOPs. Depending on your version, you might hit an attribute error. This patch prevents that:
import torchprofile.profile as tp_profile
from torchprofile.handlers import HANDLER_MAP
if not hasattr(tp_profile, "handlers"):
tp_profile.handlers = tuple(
(tuple([op_name]), handler)
for op_name, handler in HANDLER_MAP.items()
)
This is one of those things that is not in the official docs but will save you 20 minutes of frustrated Googling.
Running the Pruning Search
dummy_input = torch.randn(1, 3, 32, 32, device=device)
def score_func(model):
return evaluate(model, val_loader)
model_for_prune = resnet20()
model_for_prune.load_state_dict(torch.load(baseline_ckpt, map_location=device))
search_ckpt = "modelopt_search_checkpoint_fastnas.pth"
pruned_ckpt = "modelopt_pruned_model_fastnas.pth"
prune_start = time.time()
pruned_model, pruned_metadata = mtp.prune(
model=model_for_prune,
mode=[("fastnas", fastnas_cfg)],
constraints={"flops": target_flops},
dummy_input=dummy_input,
config={
"data_loader": train_loader,
"score_func": score_func,
"checkpoint": search_ckpt,
},
)
mto.save(pruned_model, pruned_ckpt)
prune_elapsed = time.time() - prune_start
pruned_test_before_ft = evaluate(pruned_model, test_loader)
print(f"Pruned model test accuracy before fine-tune: {pruned_test_before_ft:.2f}%")
print(f"Pruning/search time: {prune_elapsed/60:.2f} min")
A few things to note here:
- You load the baseline model weights before pruning. FastNAS needs a trained model to score sub-networks meaningfully.
- The
score_funcis a function that takes a model and returns a score. Here you use validation accuracy, which is the most direct measure of how good a sub-network is. - The
dummy_inputis just a sample tensor that helps the library profile FLOPs before the search starts. mto.savesaves the pruned model in NVIDIA's format, which preserves the model's pruned structure for later restoration.
After this step, the model's accuracy will typically drop a bit. That is expected. The fine-tuning step brings it back up.
Restoring and Fine-Tuning the Pruned Model
The mto.save and mto.restore pair is important because the pruned model has a modified architecture. You cannot just load it with standard PyTorch state dict loading. You need to restore it properly.
restored_pruned_model = resnet20()
restored_pruned_model = mto.restore(restored_pruned_model, pruned_ckpt)
restored_test = evaluate(restored_pruned_model, test_loader)
print(f"Restored pruned model test accuracy: {restored_test:.2f}%")
Running Fine-Tuning
Fine-tuning re-trains the pruned model for a few more epochs. The key difference from the original training is the lower starting learning rate. Since the model already has some knowledge baked in from the baseline training, you do not want to overwrite it with large gradient updates.
finetune_ckpt = "resnet20_pruned_finetuned.pth"
start_ft = time.time()
restored_pruned_model, pruned_val_after_ft = train_model(
restored_pruned_model,
train_loader,
val_loader,
epochs=finetune_epochs,
ckpt_path=finetune_ckpt,
lr=0.05 * batch_size / 128,
weight_decay=1e-4,
)
pruned_test_after_ft = evaluate(restored_pruned_model, test_loader)
ft_time = time.time() - start_ft
print(f"Fine-tuned pruned validation accuracy: {pruned_val_after_ft:.2f}%")
print(f"Fine-tuned pruned test accuracy: {pruned_test_after_ft:.2f}%")
print(f"Fine-tuning time: {ft_time/60:.2f} min")
Half the base learning rate is a reasonable starting point for fine-tuning. The cosine scheduler will still anneal the rate over the fine-tuning epochs, so the model adapts smoothly.
Comparing and Saving the Results
At the end of all of this, you want to see how the pruned model stacks up against the baseline. Two numbers matter most: accuracy and parameter count.
def count_params(model):
return sum(p.numel() for p in model.parameters())
def count_nonzero_params(model):
return sum((p.detach() != 0).sum().item() for p in model.parameters())
baseline_params = count_params(baseline_model)
pruned_params = count_params(restored_pruned_model)
print("=" * 60)
print("FINAL SUMMARY")
print("=" * 60)
print(f"Baseline test accuracy: {baseline_test:.2f}%")
print(f"Pruned test accuracy before finetune: {pruned_test_before_ft:.2f}%")
print(f"Pruned test accuracy after finetune: {pruned_test_after_ft:.2f}%")
print("-" * 60)
print(f"Baseline total params: {baseline_params:,}")
print(f"Pruned total params: {pruned_params:,}")
print("-" * 60)
print(f"Baseline train time: {baseline_time/60:.2f} min")
print(f"Pruning/search time: {prune_elapsed/60:.2f} min")
print(f"Pruned finetune time: {ft_time/60:.2f} min")
print("=" * 60)
Saving Your Final Artifacts
torch.save(baseline_model.state_dict(), "baseline_resnet20_final_state_dict.pth")
mto.save(restored_pruned_model, "pruned_resnet20_final_modelopt.pth")
print("Saved files:")
print(" - baseline_resnet20_final_state_dict.pth")
print(" - modelopt_pruned_model_fastnas.pth")
print(" - pruned_resnet20_final_modelopt.pth")
print(" - modelopt_search_checkpoint_fastnas.pth")
Keep both files. The baseline is useful for comparison. The pruned model is what you actually deploy.
Validating Predictions on Real Samples
It is always a good idea to visually inspect a few predictions rather than just looking at aggregate accuracy numbers. This function grabs a batch from the test loader and prints the predicted versus actual labels.
@torch.no_grad()
def show_sample_predictions(model, loader, n=8):
model.eval()
class_names = [
"airplane", "automobile", "bird", "cat", "deer",
"dog", "frog", "horse", "ship", "truck"
]
images, labels = next(iter(loader))
images = images[:n].to(device)
labels = labels[:n]
preds = model(images).argmax(dim=1).cpu()
print("\nSample predictions:")
for i in range(len(preds)):
status = "OK" if preds[i] == labels[i] else "WRONG"
print(f"{i:02d} | pred={class_names[preds[i]]:<12} | true={class_names[labels[i]]} | {status}")
show_sample_predictions(restored_pruned_model, test_loader, n=8)
Seeing the model correctly identify airplanes from horses after pruning is genuinely satisfying.
What You Actually Learned Here
Let's take a step back and look at what this pipeline taught you beyond just the code.
Pruning Is Not Destruction
A lot of people assume pruning will ruin a model. But structured pruning, done correctly with a good scoring function and real validation data guiding the search, is surprisingly gentle. You trim out the parts the model barely uses, and what remains is often nearly as capable.
FLOPs Are a Useful Proxy for Speed
FLOPs do not perfectly predict inference time on every piece of hardware, but they are a reliable proxy. A model with half the FLOPs will generally be faster. Combined with structured pruning (which removes entire channels rather than individual weights), the speedup is real and measurable.
Fine-Tuning After Pruning Is Not Optional
If you skip fine-tuning, you lose accuracy. The pruning search finds a good architecture, but the weights inside that architecture were optimized for the original, larger network. A few epochs of fine-tuning lets the remaining weights re-adapt to their new structure.
The Workflow Generalizes
The specific model here is ResNet20 on CIFAR-10, but the pipeline works with any PyTorch model and any dataset. You replace the model definition, the data loaders, and the score function, and everything else stays the same. That is a genuinely useful reusable pattern.
Wrapping It All Up
Building an end-to-end model optimization pipeline sounds complicated before you have done it, but it breaks down into a series of manageable steps. Train a solid baseline. Run the pruning search with a FLOPs budget. Fine-tune the result. Compare the numbers.
NVIDIA Model Optimizer handles a lot of the hard decisions for you. FastNAS does the architecture search automatically. All you need to do is wire it up correctly, and this guide gives you exactly that.
The most important thing is to trust the process. Pruning drops accuracy at first. That is expected. Fine-tuning brings it back. The final model is smaller, faster, and still good at its job.
That is the whole point of model optimization, and now you know how to do it from scratch.
Common Questions When You First Try This
Why Does My Accuracy Drop So Much After Pruning?
The drop in accuracy immediately after pruning is normal, and it can feel alarming if you are not expecting it. The pruned model has fewer channels than the original, but its weights were never trained in that configuration. Fine-tuning is essentially letting the model catch up. Think of it like moving to a smaller apartment. You have less space, but once you arrange things properly, it still feels like home.
If the accuracy drop is extreme (say, more than 20 percentage points below baseline), there are a few things to check:
- Your FLOPs target might be too aggressive. Try relaxing it a bit.
- Your score function might not be capturing performance accurately. Make sure
val_loadercovers a representative sample. - The
channel_divisormight be too large relative to your model's channel counts, causing some layers to get pruned to near-zero width.
Can I Use This With Pretrained Models?
Yes. In fact, pruning a pretrained model is often more effective than pruning one trained from scratch on your specific dataset. The pretrained model starts with rich representations, and FastNAS finds which parts of those representations are actually useful for your task.
The process is the same: define your score function on your validation data, set your FLOPs constraint, and run the search. The key difference is that you might want to use a lower fine-tuning learning rate (try 0.01 * batch_size / 128) to preserve the pretrained features.
What Happens If the Search Runs Out of Memory?
FastNAS needs to evaluate many sub-networks during the search phase. If you run out of GPU memory, try reducing the batch size in the score function. You can define a separate smaller data loader just for scoring:
score_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)
def score_func(model):
return evaluate(model, score_loader)
This uses less memory per evaluation step. The search will take a little longer, but it will not crash.
Going Further: What You Can Explore Next
Once you are comfortable with the basic pipeline, there are a few directions worth exploring.
Quantization After Pruning
NVIDIA Model Optimizer supports quantization alongside pruning. Quantization reduces the precision of the model's weights and activations, typically from 32-bit floats to 8-bit integers. Combined with pruning, this can cut model size by 4x to 8x compared to the original baseline.
The model optimizer library has an mto.quantize interface that works similarly to the pruning API. After you fine-tune your pruned model, you can apply post-training quantization or quantization-aware training on top.
Trying Different Architectures
ResNet20 is a small model. The same pipeline works on MobileNet, EfficientNet, and larger ResNets. Larger models have more headroom for pruning, which means FastNAS can find sub-networks with much bigger FLOPs reductions while still maintaining high accuracy.
If you want to push the limits, try starting with ResNet50 pretrained on ImageNet and pruning it down to 30 to 40 percent of its original FLOPs. The accuracy recovery during fine-tuning is genuinely impressive.
Automating the FLOPs Target Selection
Rather than picking a FLOPs target manually, you can run a short sweep across a few different targets and plot accuracy versus FLOPs. This gives you a pareto curve that shows the trade-off clearly. From that curve, you pick the operating point that matches your deployment constraints.
This is especially useful when you are working with hardware that has a specific latency budget. You can measure inference time at different FLOPs levels on your target device, then pick the FLOPs target that keeps you under your latency limit.
The bottom line is that model optimization is one of the most practical skills you can develop in machine learning right now. Models trained in research settings almost always need to be made smaller before they can run in production. The tools exist to do this well, and after working through this pipeline, you have a concrete foundation to build on.
More Posts
- InstantlyClaw Review: The 1-Click Hosted OpenClaw, Fully Loaded, and Secured (Deploy a Full AI Agent Team in 60 Seconds and Start Charging Clients $500-$2,000/Month)
- Building Production-Grade Agentic Systems with Z.AI GLM-5: A Complete Developer Guide
- The Ultimate List of Free AI APIs in 2026, No Credit Card Needed
- Mistral Just Dropped Voxtral TTS and AI Voice Is About to Get a Whole Lot More Human
- Cohere Just Dropped a Free, Open Source Voice Model That Could Change How We Handle Audio