INTERNATIONAL FOCUS ON IMAGE DESCRIPTION

Authors

  • Ms T Usha Durga

Abstract

In recent years, the task of automatically generating image description has attracted a lot of attention in the field of artificial intelligence. Benefitting from the development of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), many approaches based on the CNN-RNN framework have been proposed to solve this task and achieved remarkable process. However, there remain two problems to be tackled in that most of existing methods only use imagelevel representation. One problem is object missing that there may miss some important objects when generating the image description and the other is misprediction that it may recognize one object to a wrong category. In this paper, to address the two problems, we propose a new method called global-local attention (GLA) for generating image description. The proposed GLA model utilizes attention mechanism to integrate objectlevel features with image- level feature. Through this manner, our model can selectively pay attention to objects and context information concurrently. Therefore, our proposed GLA method can generate more relevant image description sentences, and achieves the state-of-the-art performance on the well- known Microsoft COCO caption dataset with several popular evaluation metrics — CIDEr, METEOR, ROUGE-L and BLEU-1,2,3,4.

References

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

S. Tang, Y.-T. Zheng, Y. Wang, and T.-

S. Chua, “Sparse ensemble learning for concept detection,” IEEE Transactions on Multimedia, vol. 14, no. 1, pp. 43–54, 2012.

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2016.

S. Tang, Y. Li, L. Deng, and Y.-D. Zhang, “Object localization based on proposal fusion,” IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2015–2116,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Neural Information Processing Systems, 2015.

C. Szegedy, D. Erhan, and A. T. Toshev, “Object detection using deep neural networks,” Mar. 1 2016, uS Patent

,275,308.

W. Ouyang, X. Wang, X. Zeng, S. Qiu,

P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C.-C. Loy et al., “Deepid-net: Deformable deep convolutional neural networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2403–2412.

J. Redmon, S. Divvala, R. Girshick, and

A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–

C. Kang, S. Xiang, S. Liao, C. Xu, and

C. Pan, “Learning consistent feature representation for cross-modal multimedia retrieval,” IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 370–381,

S. Bu, Z. Liu, J. Han, J. Wu, and R. Ji,

“Learning high-level feature by deep belief networks for 3-d model retrieval and recognition,” IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2154–2167,

F. Radenovic, G. Tolias, and O. Chum, “Cnn image retrieval learns ´ from bow:Unsupervised fine-tuning with hard examples,” in European Conference on Computer Vision. Springer, 2016, pp. 3–20.

Downloads

Published

2023-09-30

How to Cite

Durga, M. T. U. . (2023). INTERNATIONAL FOCUS ON IMAGE DESCRIPTION. The Journal of Contemporary Issues in Business and Government, 29(3), 534–551. Retrieved from https://cibgp.com/au/index.php/1323-6903/article/view/2612