What is deep learning?
Deep learning is a subfield of machine learning that uses multi-layer neural networks to learn hierarchical representations directly from raw data. Rather than manually engineering features (edges, textures, shapes), a deep network learns them automatically during training. In computer vision, this matters enormously:- Traditional pipelines relied on hand-crafted descriptors (SIFT, HOG, LBP).
- Deep networks learn task-specific features end-to-end, from pixels to predictions.
- With enough data and compute, deep models consistently outperform classical methods.
Neural network basics
Layers and activations
A neural network is a composition of layers, each performing an affine transformation followed by a non-linear activation function: Common activation functions:| Activation | Formula | Use case |
|---|---|---|
| ReLU | Hidden layers (default choice) | |
| Sigmoid | Binary outputs | |
| Softmax | Multi-class classification | |
| Tanh | Hidden layers, GANs |
Backpropagation
Training minimizes a loss function by computing gradients via the chain rule and updating weights with gradient descent: where is the learning rate. Modern optimizers like Adam adapt the learning rate per parameter.Overfitting and regularization
- Dropout: randomly zero out activations during training.
- Batch normalization: normalize layer inputs to stabilize training.
- Data augmentation: artificially expand the training set with transforms.
- Weight decay (L2 regularization): penalize large weights.
Course deep learning roadmap
This chapter follows a natural progression from the simplest deep learning tasks to the most advanced:- Image classification with CNNs — the foundation. Learn convolutions, pooling, transfer learning.
- Object detection with YOLO — locate and classify multiple objects in a scene.
- Facial analysis — detection, recognition, and social attribute estimation.
- Segmentation with UNet — pixel-level labeling using encoder-decoder networks.
- Generative models — GANs and diffusion models for image synthesis.
- Vision Transformers — attention-based architectures and multimodal models.
Key frameworks
PyTorch (primary)
PyTorch is the primary framework used in this course. It provides dynamic computation graphs, an intuitive Python API, and excellent GPU support.TensorFlow / Keras
TensorFlow with the Keras API is a popular alternative, especially in production deployments. The course uses PyTorch, but many pretrained models and notebooks are available in both frameworks.Supplementary resources
This chapter builds on neural network fundamentals from the Patterns Recognition course. Review those notes before starting if you are new to neural networks.
Neural Network Notes (Patrones)
Supplementary notes on neural networks from the patterns course — essential background reading.
Video: Introduction to Deep Learning
Recorded lecture (2021) covering deep learning foundations for computer vision.
Sub-topics in this chapter
Image Classification with CNNs
Convolutional architectures, transfer learning, and PyTorch training loops.
Object Detection with YOLO
Real-time detection, tracking, and anomaly detection.
Facial Analysis
Face detection, recognition with ArcFace/AdaFace, and social attribute estimation.
Image Segmentation with UNet
Semantic and instance segmentation using encoder-decoder architectures.
Generative Models
GANs, DCGAN, and diffusion-based image synthesis with Stable Diffusion.
Vision Transformers
Self-attention, ViT, CLIP, and HuggingFace models for vision tasks.
