VIDEO CAPTIONING WITH SPATIAL-TEMPORAL ATTENTION MECHANISM (STAT)

T.  SARITHA

VIDEO CAPTIONING WITH SPATIAL-TEMPORAL ATTENTION MECHANISM (STAT)

Authors

T. SARITHA

Abstract

Video captioning refers to automatic generate natural language sentences which summarize the video contents. Inspired by the visual attention mechanism of human beings, temporal attention mechanism has been widely used in video description to selectively focus on important frames. However, most existing methods based on temporal attention mechanism suffer from the problems of recognition error and detail missing, because temporal attention mechanism cannot further catch significant regions in frames. In order to address above problems, we propose the use of a novel spatial-temporal attention mechanism (STAT) within an encoder-decoder neural network for video captioning. The proposed STAT successfully takes into account both the spatial and temporal structures in a video, so it makes the decoder to automatically select the significant regions in the most relevant temporal segments for word prediction. We evaluate our STAT on two well-known benchmarks: MSVD and MSR-VTT-10K. Experimental results show that our proposed STAT achieves the state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR and CIDEr.

Downloads

Download data is not yet available.

References

L. Gao, Z. Guo, H. Zhang, X. Xu, and H.

T. Shen, “Video captioning with attention- based LSTM and semantic consistency,” IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2045–2055, 2017.

X. Liu and W. Wang, “Robustly extracting captions in videos based on stroke-like edges and spatio-temporal analysis,” IEEE Trans. Multimedia, vol. 14, no. 2, pp. 482–489, 2012.

C. Xu, J. Wang, H. Lu, and Y. Zhang, “A novel framework for semantic annotation and personalized retrieval of sports video,” IEEE Trans. Multimedia, vol. 10, no. 3, pp. 421–436, 2008.

Y. Liao and J. D. Gibson, “Routing- aware multiple description video coding over mobile ad-hoc networks,” IEEE Trans. Multimedia, vol. 13, no. 1, pp. 132–142, 2011.

L. Li, S. Tang, Y. Zhang, L. Deng, and

Q. Tian, “GLA: globallocal attention for image description,” IEEE Trans. Multimedia, vol. 20, no. 3, pp. 726–737, 2018. [Online]. Available: https://doi.org/10.1109/TMM.2017.2751140

L. Gao, Z. Guo, H. Zhang, X. Xu, and H.

T. Shen, “Video captioning with attention- based lstm and semantic consistency,” IEEE

Transactions on Multimedia, vol. 19, no. 9, pp. 2045–2055, 2017.

J. Song, H. Zhang, X. Li, L. Gao, M. Wang, and R. Hong, “Selfsupervised video hashing with hierarchical binary auto- encoder,” IEEE Transactions on Image Processing, vol. 27, no. 7, pp. 3210–3221, 2018.

L. Pang, S. Zhu, and C. Ngo, “Deep multimodal learning for affective analysis and retrieval,” IEEE Trans. Multimedia, vol. 17, no. 11, pp. 2008–2020, 2015. [Online]. Available: https://doi.org/10.1109/TMM.2015.2482228

N. Zhao, H. Zhang, R. Hong, M. Wang, and T.-S. Chua, “Videowhisper: Toward discriminative unsupervised video feature learning with attention-based recurrent neural networks,” IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2080–2092, 2017.

C. Hori, T. Hori, T.-Y. Lee, K. Sumi, J.

R. Hershey, and T. K. Marks, “Attention- based multimodal fusion for video description,” arXiv preprint arXiv:1701.03126, 2017.

P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video representation with application to captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1029– 1038.

L. Yao, A. Torabi, K. Cho, N. Ballas,

C. Pal, H. Larochelle, and A. Courville,

“Describing videos by exploiting temporal structure,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4507–4515.

Downloads

Published

2022-12-31

How to Cite

SARITHA, T. . (2022). VIDEO CAPTIONING WITH SPATIAL-TEMPORAL ATTENTION MECHANISM (STAT). The Journal of Contemporary Issues in Business and Government, 28(4), 2318–2335. Retrieved from https://cibgp.com/au/index.php/1323-6903/article/view/2773

Download Citation

Issue

Vol. 28 No. 4 (2022): The journal of contemporary issues in business and government

Section

Articles

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

You are free to:

Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
Adapt — remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Notices:

You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .

No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.

VIDEO CAPTIONING WITH SPATIAL-TEMPORAL ATTENTION MECHANISM (STAT)

Authors

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

You are free to:

Under the following terms:

Notices:

Indexing Databases