Deep ensemble models have been demonstrated to show promising generalization capability. A deep ensemble model includes several deep neural networks as base-learners. Building a deep ensemble model is a challenging task, since maintaining the prediction performance of each base-learner and the diversity among base-learners at the same time is difficult. To address this problem, this paper proposes a two-stage optimization algorithm for deep ensemble model generation, called ELDE-TS. ELDE-TS aims to build a weighted voting-based deep ensemble model for classification tasks end-to-end. The ensemble model includes several convolutional neural network classifiers with different hyperparameters. Each classifier is assigned a weight. The first stage of ELDE-TS is a bi-objective algorithm that generates candidate classifiers for the ensemble model. It takes the validation accuracy and the diversity among classifiers as the optimization objectives. A novel objective function is proposed for the first stage to describe the diversity among the classifiers. The second stage is a single-objective algorithm, which selects representative classifiers for the ensemble model and calculates a weight for each classifier. A tree-based non-repetitive evaluation mechanism is embedded in the second stage to accelerate the search process. The experimental results show that the ensemble model generated by ELDE-TS has competitive performance over the state-of-the-art ensemble models and hand-designed deep models on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets. Furthermore, further analysis demonstrates that the proposed ensemble selection method and the non-repetitive evaluation mechanism positively contribute to improving the performance of the ensemble model.