100-day roadmap for studying multimodal deep learning

Certainly! Here is a concise 100-day roadmap for studying multimodal deep learning, focusing on the names of the key papers you should read:

Week 1-2: Foundational Papers

Deep Visual-Semantic Alignments
Show and Tell
DeViSE
VQA

Week 3-4: Multimodal Representation Learning

ViLBERT
LXMERT
VisualBERT

Week 5-6: Unified Multimodal Models

CLIP
ALIGN

Week 7-8: Multimodal Transformers

M3P
FLAVA

Week 9-10: Advanced Applications

DALL-E
VATT
Perceiver IO

Week 11-12: Recent Trends (2023-2024)

Flamingo
Gato

Week 13-14: Early Multimodal Learning

Multimodal Learning with Deep Boltzmann Machines
Multimodal Learning with Multiple Modalities

Week 15-16: Multimodal Representation Learning (Cont.)

ImageBERT
VideoBERT

Week 17-18: Audio-Visual Models

AV-BERT
L3-Net

Week 19-20: Self-Supervised Multimodal Learning

MIL-NCE
MMV

Week 21-22: Graph-Based Multimodal Learning

MM-Graph
MultiGraph

Week 23-24: Multimodal Generative Models

StyleGAN-T
Multimodal Variational Autoencoders

Week 25-26: Multimodal Machine Translation

Multimodal Transformer Networks for End-to-End Sign Language Production
OpenNMT-Multi

Week 27-28: Multimodal Sentiment Analysis

CMU-MOSEI
Multimodal Transformer for Unaligned Multimodal Language Sequences

Week 29-30: Multimodal Neural Machine Translation

NMT with Visual Attention
Multimodal Neural Machine Translation with Embedding Prediction

Week 31-32: Healthcare and Biomedical Applications

Multimodal Data Fusion for Healthcare Applications
MedFuse: Multimodal Representation Learning for Medical Data

Week 33-34: Advanced Techniques

Cross-Modal Attention
Co-Attention Networks

Week 35-36: Multimodal Fusion Techniques

Tensor Fusion Network
Dynamic Multimodal Fusion with BERT

Week 37-38: Temporal Multimodal Models

Temporal Multimodal Learning with Attention
MM-TCN: Multi-Modal Temporal Convolution Network

Week 39-40: Visual Question Answering (Advanced)

MAC Networks
BAN: Bilinear Attention Networks

Week 41-42: Multimodal Dialogue Systems

M2M: Towards Multimodal to Multimodal Dialogue Systems
Multimodal Transformer for End-to-End Multimodal Dialog

Week 43-44: Multimodal Embeddings

Unicoder-VL
VisualBERT: A Simple and Performant Baseline for Vision and Language

Week 45-46: Multimodal Learning in Robotics

Perception as Generative Reasoning
Multimodal Sensor Fusion for Object Recognition in Robotic Systems

Week 47-48: Multimodal Video Understanding

HERO: Hierarchical Encoder for Video+Language Tasks
ActBERT: Learning Global-Local Video-Text Representations

Week 49-50: Evaluation and Benchmarking

GLUE: A Multi-Task Benchmark and Analysis Platform
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Week 51-52: Multimodal Learning Trends

UniT: Unified Transformer for Multimodal Multitask Learning
LXMERT: Learning Cross-Modality Encoder Representations from Transformers (Revisit)
FLAVA: A Foundational Language and Vision Alignment Model (Revisit)
Flamingo: A Visual Language Model for Few-Shot Learning (Revisit)
Gato: A Generalist Agent (Revisit)
CLIP (Revisit)
ALIGN (Revisit)

This roadmap will guide you through the key milestones in multimodal deep learning research over 100 days, focusing on one or two papers per day.

🪴 Jihee's Blog

Explorer