Abstract In recent years, deep learning-based recommender systems have received increasing attention, as deep neural networks can detect important product features in images and text descriptions and capture them in semantic vector representations of items. This is especially relevant for outfit recommendation, since a variety of fashion product features play a role in creating outfits. This work is a comparative study of fusion methods for outfit recommendation that combine relevant product features extracted from visual and textual data in semantic, multimodal item representations. We compare traditional fusion methods with attention-based fusion methods, which are designed to focus on the fine-grained product features of items. We evaluate the fusion methods on four benchmark datasets for outfit recommendation and provide insights into the importance of the multimodality and granularity of the fashion item representations. We find that the visual and textual item data not only share product features but also contain complementary product features for the outfit recommendation task, confirming the need to effectively combine them into multimodal item representations. Furthermore, we show that the average performance of attention-based fusion methods surpasses the average performance of traditional fusion methods on three out of the four benchmark datasets, demonstrating the ability of attention to learn relevant correlations among fine-grained fashion attributes.