报告题目:Multimodal Machine Learning: Efficient Visual-Language Deep Learning Models for Image Captioning and Cross-Modal Retrieval.
报告时间:2022年02月22日 上午(morning)8:30
报告地点:逸夫楼 109
报告人:苏罗: Osolo Ian Raymond
Topic of Report 1: Making images matter more: A Fourier Augmented image captioning transformer
Summary:
Many vision-language models that output natural language, such as image-captioning models, usually use image features merely for grounding the captions and most of the good performance of the model can be attributed to the language model, which does all the heavy lifting. In this report, we propose a method to make the images matter more by using fast Fourier transforms to further breakdown the input features and extract more of their intrinsic salient information, resulting in more detailed yet concise captions. Furthermore, we analyze and provide insight into the use of fast Fourier transform features as alternatives or supplements to regional features for self-attention in image-captioning applications.
Topic of Report 2: An analysis of the use of feed-forward sub-modules in a transformer-based visual-language multimodal environment
Summary :
Transformers have become the go-to architecture when dealing with computer vision and natural language processing deep learning tasks. This is because of their state-of-the-art performance in most of those tasks. The main feature of the transformers to which this good performance has been attributed is the self-attention mechanism. Not much research has gone into investigating whether they are indeed responsible for most of the good performance. In this report, we use image captioning as the choice of application to perform a comprehensive analysis of the effect of replacing the self-attention mechanism with feed-forward layers both for the image encoder and the text decoder. We investigate the effect on the memory usage, and sequence length where our experiments demonstrated many surprising results. This provides a qualitative analysis of the resulting captions, an empirical analysis of the evaluation metrics, and memory usage, providing a practical insight into the effect of this substitution in vision-language tasks while also demonstrating competitive results with the much simpler architecture.
Topic of Report 3: A Nonlinear Supervised Discrete Hashing framework for large-scale cross-modal retrieval
Summary :
In cross-modal retrieval, the biggest issue is the large semantic gap that exists between the feature distributions of heterogeneous data. This makes it very difficult to directly compute the relationships between different modalities. In order to bridge the heterogeneous gap, many techniques have been proposed to create an effective common latent common representation between the heterogeneous modalities, which can then be leveraged to bridge the gap so that the common representation can be computed efficiently by using common distance metrics. Some of the shortcomings of current supervised cross-modal hashing methods will be discussed. Then, a novel hashing based cross-modal retrieval method that uses food ingredient retrieval as a proof of concept will be presented.
报告人简介: Osolo Ian Raymond received his BTech degree in Electrical Engineering from Nelson Mandela University, South Africa and the M.Eng. degree in Software Engineering at Central South University, China where he is currently pursuing a PhD degree in Computer Science Application & Technology. He has published papers in reputed ESCI/SCI journals focusing on Image captioning and cross-modal retrieval. His research interests include Machine Learning, specifically, Deep Learning for Computer Vision, Natural Language Processing and Embedded Systems.