Social media marketing has been relentlessly developed and integrated into firm operations. On social media platforms, firms rely on a combination of verbal and visual elements to communicate with consumers and attract their attention. The present research investigates how the semantic relationship between text and image information affects consumer engagement (forwards and comments). Leveraging a large-scale dataset of firm-generated messages, we use deep learning, LLM, and topic models to quantify each text-image message with a theorized two-dimensional text-image incongruency (relevancy and expectancy). Relevancy is how closely the information aligns with the main message. Expectancy is how predictable or surprising the information is based on what people expect, which concerns long-term affective and cognitive memories about one’s past and present experiences. We find that the interaction of relevancy and expectancy, two distinct dimensions at the cognitive level, is a crucial antecedent of consumer engagement on social media. High-relevancy-high-expectancy (HRHE) content and low-relevancy-low-expectancy (LRLE) content are the most effective strategies, whereas high-relevancy-low-expectancy (HRLE) and low-relevancy-high-expectancy (LRHE) contents do not work so well. Furthermore, this paper also uncovers the distinct nature of consumer engagement forms in social media settings, including forwards and comments. In particular, HRHE offers an exclusive benefit of boosting forwards while the two strategies are equally effective in eliciting comments. This research derives several important operational implications of consumer engagement and social media marketing by addressing the importance of multi-dimensional text-image incongruency and contributes to the literature on operations management and marketing interface.