Sponsored-Products (SP) advertising is a popular way to promote products on Amazon. Etailers who have a large catalog of products often create SP ad groups for products with similar attributes. An SP ad group consists of a set of products and a set of keywords, and all the products in the ad group share the same keyword set. These keywords are the ones that shoppers may search for when looking for products on Amazon. In addition to SP ads, etailers may link to external websites for advertising their products, which is called off-Amazon (OA) ads. This study focuses on the optimization of sequential SP and OA (abbreviated as SSPOA) ads decisions for etailers. Practically, many etailers set sales targets for products as manufacturing and logistics are planned ahead of time. Hence, we consider the etailer’s objective as minimizing the expected long-run average cost incurred by advertising and cumulative unmet sales target. We model the SSPOA optimization as a controlled Markovian multi-armed bandit (MAB) process. When the mean of the sales number per unit time (i.e., sales rate) for each product is known, we characterize the etailer’s optimal SSPOA policy for products in an ad group. In reality, etailers may not know the exact means of sales rates. To learn the unknown parameters while simultaneously minimizing the long-run average cost, we develop a Thompson-sampling-based algorithm for the controlled Markovian MAB problem that couples the SP and OA ads decisions. We prove that the regret bound of the proposed algorithm is O~(T) , where T is the total horizon length. Compared with existing literature, our problem additionally considers the regret from applying the estimated control policy and the impacts of choosing non-optimal keyword sets on subsequent states. We also conduct numerical experiments that validate our theoretical results. Moreover, we extend the base model in several directions, i.e., considering unknown transition rates between different sales rate levels, incorporating correlated keyword sets, and learning the optimal policy using Posterior Sampling for Reinforcement Learning under a discretized setting.