Visual question answering (VQA) is a dynamic field of research that aims to generate textual answers from given visual and question information. It is a multimodal field that has garnered significant interest from the computer vision and natural language processing communities. Furthermore, recent advances in these fields have yielded numerous achievements in VQA research. In VQA research, achieving balanced learning that avoids bias towards either visual or question information is crucial. The primary challenge in VQA lies in eliminating noise, while utilizing valuable and accurate information from different modalities. Various research methodologies have been developed to address these issues. In this study, we classify these research methods into three categories: Joint Embedding, Attention Mechanism, and Model-agnostic methods. We analyze the advantages, disadvantages, and limitations of each approach. In addition, we trace the evolution of datasets in VQA research, categorizing them into three types: Real Image, Synthetic Image, and Unbiased datasets. This study also provides an overview of evaluation metrics based on future research directions. Finally, we discuss future research and application directions for VQA research. We anticipate that this survey will offer useful perspectives and essential information to researchers and practitioners seeking to address visual questions effectively.