Certainly! Here is a concise 100-day roadmap for studying multimodal deep learning, focusing on the names of the key papers you should read:
Week 1-2: Foundational Papers
- Deep Visual-Semantic Alignments
- Show and Tell
- DeViSE
- VQA
Week 3-4: Multimodal Representation Learning
- ViLBERT
- LXMERT
- VisualBERT
Week 5-6: Unified Multimodal Models
- CLIP
- ALIGN
Week 7-8: Multimodal Transformers
- M3P
- FLAVA
Week 9-10: Advanced Applications
- DALL-E
- VATT
- Perceiver IO
Week 11-12: Recent Trends (2023-2024)
- Flamingo
- Gato
Week 13-14: Early Multimodal Learning
- Multimodal Learning with Deep Boltzmann Machines
- Multimodal Learning with Multiple Modalities
Week 15-16: Multimodal Representation Learning (Cont.)
- ImageBERT
- VideoBERT
Week 17-18: Audio-Visual Models
- AV-BERT
- L3-Net
Week 19-20: Self-Supervised Multimodal Learning
- MIL-NCE
- MMV
Week 21-22: Graph-Based Multimodal Learning
- MM-Graph
- MultiGraph
Week 23-24: Multimodal Generative Models
- StyleGAN-T
- Multimodal Variational Autoencoders
Week 25-26: Multimodal Machine Translation
- Multimodal Transformer Networks for End-to-End Sign Language Production
- OpenNMT-Multi
Week 27-28: Multimodal Sentiment Analysis
- CMU-MOSEI
- Multimodal Transformer for Unaligned Multimodal Language Sequences
Week 29-30: Multimodal Neural Machine Translation
- NMT with Visual Attention
- Multimodal Neural Machine Translation with Embedding Prediction
Week 31-32: Healthcare and Biomedical Applications
- Multimodal Data Fusion for Healthcare Applications
- MedFuse: Multimodal Representation Learning for Medical Data
Week 33-34: Advanced Techniques
- Cross-Modal Attention
- Co-Attention Networks
Week 35-36: Multimodal Fusion Techniques
- Tensor Fusion Network
- Dynamic Multimodal Fusion with BERT
Week 37-38: Temporal Multimodal Models
- Temporal Multimodal Learning with Attention
- MM-TCN: Multi-Modal Temporal Convolution Network
Week 39-40: Visual Question Answering (Advanced)
- MAC Networks
- BAN: Bilinear Attention Networks
Week 41-42: Multimodal Dialogue Systems
- M2M: Towards Multimodal to Multimodal Dialogue Systems
- Multimodal Transformer for End-to-End Multimodal Dialog
Week 43-44: Multimodal Embeddings
- Unicoder-VL
- VisualBERT: A Simple and Performant Baseline for Vision and Language
Week 45-46: Multimodal Learning in Robotics
- Perception as Generative Reasoning
- Multimodal Sensor Fusion for Object Recognition in Robotic Systems
Week 47-48: Multimodal Video Understanding
- HERO: Hierarchical Encoder for Video+Language Tasks
- ActBERT: Learning Global-Local Video-Text Representations
Week 49-50: Evaluation and Benchmarking
- GLUE: A Multi-Task Benchmark and Analysis Platform
- SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
Week 51-52: Multimodal Learning Trends
- UniT: Unified Transformer for Multimodal Multitask Learning
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers (Revisit)
- FLAVA: A Foundational Language and Vision Alignment Model (Revisit)
- Flamingo: A Visual Language Model for Few-Shot Learning (Revisit)
- Gato: A Generalist Agent (Revisit)
- CLIP (Revisit)
- ALIGN (Revisit)
This roadmap will guide you through the key milestones in multimodal deep learning research over 100 days, focusing on one or two papers per day.