Drug hepatotoxicity is one of the primary reasons for drug clinical trial failures and market withdrawals, with mitochondrial dysfunction being one of the mechanisms inducing drug hepatotoxicity. Manifestation of mitochondrial toxicity occurs when mitochondria are damaged or their functions are inhibited. This study introduces M3Hep, a novel multimodal framework that integrates SMILES, molecular graphs, and mitochondrial toxicity through a masking strategy to improve hepatotoxicity prediction. A total of 8,459 mitochondrial toxicity samples and 6,418 hepatotoxicity samples were collected for constructing the mitochondrial toxicity prediction model and M3Hep, respectively. To fully utilize the collected hepatotoxicity samples, this study developed a mitochondrial toxicity prediction model to predict mitochondrial toxicity for molecules without experimental mitochondrial toxicity data, achieving an AUC of 0.96 for the mitochondrial toxicity prediction model. The ablation study results of M3Hep indicate that incorporating mitochondrial toxicity enhances the performance of hepatotoxicity prediction models, further demonstrating the connection between mitochondrial toxicity and hepatotoxicity. M3Hep outperforms most baseline models across all metrics, with its AUC reaching up to 0.81. Moreover, in terms of the MCC metric, M3Hep surpasses all commonly used hepatotoxicity prediction tools collected, with a value of 0.49. In order to better understand the prediction mechanism of M3Hep, we conducted an interpretability analysis based on the GNNExplainer and SHAP methods.