Focus keyword: Deep Learning
1) Intuition & Formalism
Deep learning learns a function f(x; θ) that maps inputs to outputs by stacking many layers. Each layer performs an affine transform plus a non-linearity:
z^(l) = W^(l) · a^(l-1) + b^(l)
a^(l) = σ(z^(l)) (σ = ReLU, GELU, etc.)With enough data and capacity, the network discovers useful representations to minimize a loss (e.g., cross-entropy) between predictions and targets. Early layers capture simple patterns; deeper layers capture semantics.
Why depth? Many real-world functions are compositional. Depth lets networks reuse features and represent complex functions more efficiently than a single shallow layer.
2) Optimization & Backpropagation
Training minimizes a loss ℒ(θ) over data by iterative gradient updates:
θ ← θ − η · ∇_θ ℒ(θ) (η = learning rate)Backprop in plain steps
- Forward pass: compute predictions layer by layer.
- Loss: measure error between predictions and ground truth.
- Backward pass: use chain rule to compute gradients for each weight.
- Update: optimizer (SGD/Adam) nudges weights to reduce loss.
Popular optimizers
- SGD + Momentum: stable, strong generalization; needs LR schedules.
- Adam/AdamW: adaptive per-parameter learning rates; fast convergence; AdamW separates weight decay for better generalization.
- Schedulers: cosine decay, warmup, One-Cycle improve stability.
Batch Normalization / LayerNorm stabilize activations, allowing higher learning rates and deeper networks.
3) Regularization & Generalization
- Data augmentation: flips/crops/color jitter (vision), masking/time-warping (audio), synonym/Back-translation (text).
- Dropout: randomly zero activations to prevent co-adaptation.
- Weight decay (L2): penalize large weights; often via AdamW.
- Early stopping: stop when validation metric degrades.
- MixUp/CutMix/Label smoothing: encourage smoother decision boundaries.
Bias–variance trade-off: Larger models fit better but can overfit; regularization and more data reduce variance.
4) Architectures: What Really Happens Inside
CNNs (Images & Video)
Convolutions slide kernels across an image to learn local features. Key knobs: kernel stride padding. Stacking layers grows the receptive field to capture global context. Modern blocks add residual connections to ease optimization.
Output size: ⌊(W - K + 2P)/S⌋ + 1
Where: W=width, K=kernel, P=padding, S=strideRNNs, LSTMs, GRUs (Sequences)
RNNs process inputs one timestep at a time; LSTMs/GRUs add gates to preserve long-range information.
i_t = σ(W_i x_t + U_i h_{t-1} + b_i) (input gate)
f_t = σ(W_f x_t + U_f h_{t-1} + b_f) (forget gate)
g_t = tanh(W_g x_t + U_g h_{t-1} + b_g) (candidate)
c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t (cell state)
o_t = σ(W_o x_t + U_o h_{t-1} + b_o) (output gate)
h_t = o_t ⊙ tanh(c_t)Use for time-series, speech, and small/medium sequence tasks when Transformers are overkill.
Transformers (NLP, Vision, Audio, Multimodal)
Transformers replace recurrence with self-attention, letting each token attend to any other. Core operation:
Attention(Q, K, V) = softmax( (Q Kᵀ) / √d_k ) V
Q = X W_Q, K = X W_K, V = X W_VComplexity: vanilla self-attention is O(n²) in sequence length; long-context models use sparse/windowed/linear attention to scale.
5) Training Pipeline & Reproducibility
- Data management: version datasets; record provenance and splits (train/val/test, e.g., 80/10/10).
- Preprocessing: standardize/normalize; tokenize text; resample time-series; handle class imbalance (reweighting or focal loss).
- Experiment tracking: log configs, metrics, and artifacts; fix seeds for comparability.
- Validation: stratified K-fold when data is scarce; otherwise hold-out + early stopping.
- Deployment: export to ONNX/TensorRT; add monitoring for data/label drift and latency.
Tip: Always benchmark a simple baseline (e.g., linear/logistic or a shallow CNN). If your deep model doesn’t beat it, fix data or training, not just architecture.
6) Task-Specific Metrics & Losses
Classification
Accuracy F1 ROC-AUC Precision/Recall
Cross-entropy: ℒ = −Σ yᵢ log pᵢ
Focal loss (imbalanced): ℒ = −(1 − p_t)^γ log p_tDetection & Segmentation
mAP@IoU IoU Dice
Dice = 2|P ∩ G| / (|P| + |G|)Regression
MAE RMSE R²
MSE loss: ℒ = (1/N) Σ (ŷᵢ − yᵢ)²Representation Learning
Triplet Contrastive InfoNCE
Triplet: ℒ = max(0, d(a,p) − d(a,n) + margin)7) Efficiency: Transfer, Distillation, Quantization
- Transfer learning: start from a pre-trained encoder; fine-tune last layers first, then unfreeze more.
- Knowledge distillation: train a smaller student to mimic a large teacher (soft targets improve calibration).
- Quantization & pruning: INT8/FP16 + structured pruning reduce latency/size with minor accuracy loss.
- Mixed precision: FP16/BF16 speeds training and inference on modern GPUs/TPUs.
- Distributed training: data parallel (DDP), gradient accumulation, gradient checkpointing for memory.
8) Interpretability, Robustness & Safety
- Attribution: saliency maps, Grad-CAM (vision), Integrated Gradients/DeepSHAP (tabular/text) reveal feature influence.
- Robustness: adversarial training, strong augmentations, out-of-distribution (OOD) detection.
- Monitoring: track data drift, confidence calibration (ECE), and failure cases; add human-in-the-loop for high-risk tasks.
- Fairness: compare metrics across groups; mitigate with reweighting/representation debiasing.
9) Quick-Start: From Zero to a Solid Baseline (Keras)
This wraps best practices: normalization, augmentation, dropout, and a cosine LR schedule. Lines wrap to avoid scrollbars.
import tensorflow as tf
from tensorflow.keras import layers, models, optimizers
# Data pipeline (example: 128x128 images, 2 classes)
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
"data/train", image_size=(128,128), batch_size=64)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
"data/val", image_size=(128,128), batch_size=64)
# Caching & prefetch (throughput)
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(AUTOTUNE)
val_ds = val_ds.cache().prefetch(AUTOTUNE)
# Augment & normalize
data_aug = tf.keras.Sequential([
layers.RandomFlip("horizontal"),
layers.RandomRotation(0.05),
layers.RandomZoom(0.1)
])
model = models.Sequential([
layers.Input(shape=(128,128,3)),
layers.Rescaling(1./255),
data_aug,
layers.Conv2D(32, 3, padding="same", activation="relu"),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding="same", activation="relu"),
layers.MaxPooling2D(),
layers.Conv2D(128, 3, padding="same", activation="relu"),
layers.GlobalAveragePooling2D(),
layers.Dropout(0.3),
layers.Dense(2, activation="softmax")
])
# Optimizer + cosine decay LR schedule
base_lr = 3e-3
epochs = 20
decay = optimizers.schedules.CosineDecayRestarts(initial_learning_rate=base_lr,
first_decay_steps=5)
opt = optimizers.AdamW(learning_rate=decay, weight_decay=1e-4)
model.compile(optimizer=opt,
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
history = model.fit(train_ds, validation_data=val_ds, epochs=epochs)Upgrade path: swap the encoder with a pre-trained backbone (e.g., EfficientNet) and fine-tune; add label smoothing (0.05) and early stopping on validation F1.
10) Deep Learning FAQ
How can I pick the right model quickly?
Start from a proven pre-trained backbone for your modality (CNN/ViT for images, Transformer for text), freeze most layers, fine-tune the head, then unfreeze progressively.
What’s the easiest way to avoid overfitting?
Augment data, use early stopping, apply dropout and weight decay, and monitor validation curves. If possible, get more diverse data.
How do I handle imbalanced classes?
Use class weights or focal loss, oversample minority classes, and report per-class metrics (recall/precision) not just accuracy.
Leave a Reply