We analyze twenty-six benchmark datasets, showing their drawbacks and strengths for the problem requirements. It covers the main video-to-text methods and the ways to evaluate their performance. This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. The spatiotemporal information present in videos introduces diversity and complexity regarding the visual content and the structure of associated language descriptions. These two tasks are substantially more complex than predicting or retrieving a single sentence from an image. These two ways represent essential tasks for Computer Vision and Natural Language Processing communities, called text retrieval from video task and video captioning/description task. This association can be mainly made by retrieving the most relevant descriptions from a corpus or generating a new one given a context video. This paper reviews the video-to-text problem, in which the goal is to associate an input video with its textual description. When the visual information is related to videos, this takes us into Video-Text Research, which includes several challenging tasks such as video question answering, video summarization with natural language, and video-to-text and text-to-video conversion. Research in the Vision and Language area encompasses challenging topics that seek to connect visual and textual information. Extensive experiments conducted on three widely used datasets, including MSR-VTT, VATEX, and TRECVID AVS 2016-2018, demonstrate that our proposed approach is superior to several state-of-the-art text–video retrieval approaches. Finally, LADN aligns different levels of information in various spaces. Then, they are mapped into four different latent spaces and one semantic space. Specifically, LADN first extracts different levels of information, including global, local, temporal, and spatial–temporal information, from videos and text. LADN uses four common latent spaces to improve the performance of text–video retrieval and utilizes the semantic concept space to increase the interpretability of the model. In this paper, we put forward a method called level-wise aligned dual networks (LADNs) for text–video retrieval.
However, a high-dimensional space cannot fully use different levels of information in videos and text. The current methods leverage a high-dimensional space to align video and text for these tasks. The vast amount of videos on the Internet makes efficient and accurate text–video retrieval tasks increasingly important.