Class-dependent and cross-modal memory network considering sentimental features for video-based captioning
The video-based commonsense captioning task aims to add multiple commonsense descriptions to video captions to understand video content better.This paper aims to consider the importance of cross-modal mapping.We propose a combined framework called Class-dependent roman atwood gfuel and Cross-modal Memory Network considering SENtimental features (CC