A major challenge for modern transit systems relying on traditional fixed-route designs is providing broad accessibility to users. Flex-route transit can enhance accessibility in low-density areas, since it combines the directness of fixed-route transit with the coverage of on-demand mobility. Although deviating for optional pickups can increase ridership and transit accessibility, it also deteriorates the service performance for fixed-route riders. To balance this inherent trade-off, this paper proposes a reinforcement learning approach for deviation decisions. The proposed model is used in a case study of a proposed flex-route service in the city of Boston. The performance on competing objectives is evaluated for reward configurations that adapt to peak and off-peak scenarios. The analysis shows a significant improvement of our method compared to a heuristic derived from industry practice as a baseline. To evaluate robustness, we assess performance across scenarios with varying demand compositions (fixed and requested riders). The results show that the method achieves greater improvements than the baseline in scenarios with increased request ridership, i.e., where decision-making is more complex. Our approach improves service performance under dynamic demand conditions and varying priorities, offering a valuable tool for smart cities to operate flex-route services.