Late Fusion(2019)

Built on the Late Fusion architecture specified in our CVPR 2017 paper.The model was trained on VisDial v1.0 train+val. It uses ResNeXt detector features, and Pythia for captioning. Code available here .

Hierarchical Recurrent Encoder(2017)

The Hierarchical Recurrent Encoder architecture as specified in our CVPR 2017 paper. The model was trained on VisDial v0.9 train+val and uses VGG-16 to extract image features, and NeuralTalk2 for captioning. Code available here .

Demo was live from: 2017-2021

150,000+
Hits
75,000+
Images uploaded
300,000+
Questions asked