Accurately predicting drug-drug interaction events (DDIEs) is critical for optimizing combination therapies and ensuring drug safety. However, existing methods typically rely on either handcrafted molecular fingerprints or static embeddings from pretrained models, which limits their ability to jointly capture local chemical substructures and three-dimensional geometric features. To overcome these limitations, we propose MMFF-DDI, a multi-modal fusion framework based on contrastive learning for drug-drug interaction event (DDIE) prediction. MMFF-DDI extracts drug representations from three modalities-Morgan fingerprints, canonical SMILES, and 3D molecular graphs-using an attention-augmented autoencoder, a MolFormer encoder, and an Equivariant Graph Neural Network (EGNN), respectively. Furthermore, a contrastive multi-modal integration submodule is designed to transform multi-modal representation learning from a concatenation-based paradigm to an alignment-based paradigm, thereby achieving cross-modal consistency and complementary feature fusion. Experimental results show that MMFF-DDI outperforms the best competitive method (MRGCDDI) in predicting DDIE involving existing drugs, achieving improvements of 7.87% and 7.99% in Macro-F1 and Macro-precision, respectively. Furthermore, MMFF-DDI outperforms the best competitive method (DSN-DDI) in predicting DDIEs involving new drugs, achieving improvements of 8.06% and 12.79% in Macro-F1 and Macro-precision, respectively. Visualization experiments and case studies validate its practical applicability and superior predictive performance. The source code of MMFF-DDI is available at https://github.com/jianzhong123/MMFF-DDI.