Transfer Learning and Pre-trained Models

90 min

Transfer learning allows models to leverage knowledge from pre-trained models trained on large datasets, applying that knowledge to new, related tasks. Instead of training from scratch, you start with a model that has already learned useful features. Transfer learning is especially powerful when you have limited data for your specific task. Understanding transfer learning enables you to build effective models with less data and computation. Transfer learning is widely used in computer vision, NLP, and other domains.

Pre-trained models like BERT (NLP), GPT (language generation), ResNet (image classification), and VGG (computer vision) provide powerful starting points. These models were trained on massive datasets (ImageNet for vision, Wikipedia/books for language) and learned general features useful for many tasks. Pre-trained models are available through model hubs (Hugging Face, TensorFlow Hub, PyTorch Hub). Understanding pre-trained models enables you to leverage state-of-the-art architectures. Using pre-trained models saves time and computational resources.

Fine-tuning adapts pre-trained models to specific tasks with less data by training only some layers (typically the final layers) while keeping early layers frozen. Early layers learn general features (edges, textures, basic language patterns), while later layers learn task-specific features. Fine-tuning requires less data than training from scratch because the model already knows useful features. Understanding fine-tuning enables you to adapt pre-trained models effectively. Fine-tuning is faster and more data-efficient than training from scratch.

Feature extraction uses pre-trained models as fixed feature extractors, removing the final classification layers and using the model's intermediate representations as features for a new classifier. This approach is useful when you have very little data or when the pre-trained model's features are directly applicable. Feature extraction is simpler than fine-tuning but may be less flexible. Understanding feature extraction enables you to use pre-trained models when fine-tuning isn't feasible.

Transfer learning reduces training time and improves performance compared to training from scratch, especially with limited data. Pre-trained models provide good initialization, leading to faster convergence and better final performance. Transfer learning is essential for many real-world applications where large labeled datasets aren't available. Understanding transfer learning enables you to build effective models efficiently. Transfer learning has become standard practice in deep learning.

Best practices include choosing appropriate pre-trained models for your domain, freezing early layers during fine-tuning, using appropriate learning rates (lower for frozen layers), fine-tuning gradually (unfreezing layers progressively), and understanding when to use feature extraction vs fine-tuning. Understanding transfer learning enables you to leverage pre-trained models effectively. Transfer learning is one of the most practical techniques in modern deep learning.

Key Concepts

Transfer learning leverages knowledge from pre-trained models.
Pre-trained models (BERT, GPT, ResNet) provide powerful starting points.
Fine-tuning adapts pre-trained models to specific tasks.
Feature extraction uses pre-trained models as fixed feature extractors.
Transfer learning reduces training time and improves performance.

Learning Objectives

Master

Understanding transfer learning concepts and benefits
Using pre-trained models for new tasks
Fine-tuning pre-trained models effectively
Choosing between feature extraction and fine-tuning

Develop

Transfer learning thinking
Understanding when to use pre-trained models
Designing effective transfer learning workflows

Tips

Use pre-trained models when you have limited data—they're powerful.
Freeze early layers during fine-tuning to preserve learned features.
Use lower learning rates for fine-tuning than training from scratch.
Choose pre-trained models from your domain when possible.

Common Pitfalls

Fine-tuning all layers when freezing early layers would suffice.
Using learning rates that are too high, destroying pre-trained features.
Not using pre-trained models when they would help.
Using pre-trained models from different domains inappropriately.

Summary

Transfer learning leverages pre-trained models for new tasks.
Pre-trained models provide powerful starting points.
Fine-tuning adapts models to specific tasks with less data.
Understanding transfer learning enables efficient model development.
Transfer learning is essential for many real-world applications.

Exercise

Use a pre-trained model for image classification with fine-tuning.

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import models, transforms, datasets
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt

# Load pre-trained ResNet model
model = models.resnet18(pretrained=True)

# Freeze early layers
for param in model.parameters():
    param.requires_grad = False

# Modify the final layer for our task
num_classes = 5
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Data transformations
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                        std=[0.229, 0.224, 0.225])
])

# For demonstration, we'll create synthetic data
# In practice, you'd load real images
class SyntheticDataset(torch.utils.data.Dataset):
    def __init__(self, num_samples=100):
        self.num_samples = num_samples
        
    def __len__(self):
        return self.num_samples
    
    def __getitem__(self, idx):
        # Create random image-like tensor
        image = torch.randn(3, 224, 224)
        label = torch.randint(0, 5, (1,)).item()
        return image, label

# Create datasets
train_dataset = SyntheticDataset(100)
val_dataset = SyntheticDataset(20)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=False)

# Training setup
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)

# Training loop
num_epochs = 5
train_losses = []
val_losses = []

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    
    for images, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    
    # Validation
    model.eval()
    val_loss = 0.0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for images, labels in val_loader:
            outputs = model(images)
            loss = criterion(outputs, labels)
            val_loss += loss.item()
            
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    train_losses.append(running_loss / len(train_loader))
    val_losses.append(val_loss / len(val_loader))
    
    print(f'Epoch {epoch+1}/{num_epochs}:')
    print(f'Training Loss: {train_losses[-1]:.4f}')
    print(f'Validation Loss: {val_losses[-1]:.4f}')
    print(f'Validation Accuracy: {100 * correct / total:.2f}%')

# Plot training progress
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(train_losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot([100 * correct / total for _ in range(num_epochs)], label='Accuracy')
plt.title('Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy (%)')
plt.legend()

plt.tight_layout()
plt.show()

print("Transfer learning completed!")

Transfer Learning and Pre-trained Models