Certainly! Here is a concise 100-day roadmap for studying multimodal deep learning, focusing on the names of the key papers you should read:

Week 1-2: Foundational Papers

  1. Deep Visual-Semantic Alignments
  2. Show and Tell
  3. DeViSE
  4. VQA

Week 3-4: Multimodal Representation Learning

  1. ViLBERT
  2. LXMERT
  3. VisualBERT

Week 5-6: Unified Multimodal Models

  1. CLIP
  2. ALIGN

Week 7-8: Multimodal Transformers

  1. M3P
  2. FLAVA

Week 9-10: Advanced Applications

  1. DALL-E
  2. VATT
  3. Perceiver IO
  1. Flamingo
  2. Gato

Week 13-14: Early Multimodal Learning

  1. Multimodal Learning with Deep Boltzmann Machines
  2. Multimodal Learning with Multiple Modalities

Week 15-16: Multimodal Representation Learning (Cont.)

  1. ImageBERT
  2. VideoBERT

Week 17-18: Audio-Visual Models

  1. AV-BERT
  2. L3-Net

Week 19-20: Self-Supervised Multimodal Learning

  1. MIL-NCE
  2. MMV

Week 21-22: Graph-Based Multimodal Learning

  1. MM-Graph
  2. MultiGraph

Week 23-24: Multimodal Generative Models

  1. StyleGAN-T
  2. Multimodal Variational Autoencoders

Week 25-26: Multimodal Machine Translation

  1. Multimodal Transformer Networks for End-to-End Sign Language Production
  2. OpenNMT-Multi

Week 27-28: Multimodal Sentiment Analysis

  1. CMU-MOSEI
  2. Multimodal Transformer for Unaligned Multimodal Language Sequences

Week 29-30: Multimodal Neural Machine Translation

  1. NMT with Visual Attention
  2. Multimodal Neural Machine Translation with Embedding Prediction

Week 31-32: Healthcare and Biomedical Applications

  1. Multimodal Data Fusion for Healthcare Applications
  2. MedFuse: Multimodal Representation Learning for Medical Data

Week 33-34: Advanced Techniques

  1. Cross-Modal Attention
  2. Co-Attention Networks

Week 35-36: Multimodal Fusion Techniques

  1. Tensor Fusion Network
  2. Dynamic Multimodal Fusion with BERT

Week 37-38: Temporal Multimodal Models

  1. Temporal Multimodal Learning with Attention
  2. MM-TCN: Multi-Modal Temporal Convolution Network

Week 39-40: Visual Question Answering (Advanced)

  1. MAC Networks
  2. BAN: Bilinear Attention Networks

Week 41-42: Multimodal Dialogue Systems

  1. M2M: Towards Multimodal to Multimodal Dialogue Systems
  2. Multimodal Transformer for End-to-End Multimodal Dialog

Week 43-44: Multimodal Embeddings

  1. Unicoder-VL
  2. VisualBERT: A Simple and Performant Baseline for Vision and Language

Week 45-46: Multimodal Learning in Robotics

  1. Perception as Generative Reasoning
  2. Multimodal Sensor Fusion for Object Recognition in Robotic Systems

Week 47-48: Multimodal Video Understanding

  1. HERO: Hierarchical Encoder for Video+Language Tasks
  2. ActBERT: Learning Global-Local Video-Text Representations

Week 49-50: Evaluation and Benchmarking

  1. GLUE: A Multi-Task Benchmark and Analysis Platform
  2. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
  1. UniT: Unified Transformer for Multimodal Multitask Learning
  2. LXMERT: Learning Cross-Modality Encoder Representations from Transformers (Revisit)
  3. FLAVA: A Foundational Language and Vision Alignment Model (Revisit)
  4. Flamingo: A Visual Language Model for Few-Shot Learning (Revisit)
  5. Gato: A Generalist Agent (Revisit)
  6. CLIP (Revisit)
  7. ALIGN (Revisit)

This roadmap will guide you through the key milestones in multimodal deep learning research over 100 days, focusing on one or two papers per day.