Search papers, labs, and topics across Lattice.
A Bidirectional Multimodal Fusion Network (BMFNet) for image captioning is proposed, which provides deep interaction and fusion of multiple features throughout the encoding and decoding process, and significantly enhances the model’s cross-modal reasoning capability.