Research on video smoke detection has become a hot topic in fire disaster prevention and control as it can realize early detection. Conventional methods use handcrafted features rely on prior knowledge to recognize whether a frame contains smoke. Such methods are often proposed for fixed fire scene and sensitive to the environment resulting in false alarms. In this paper, we use convolutional neural networks (CNN), which are state-of-the-art for image recognition tasks to identify smoke in video. We develop a joint detection framework based on faster RCNN and 3D CNN. An improved faster RCNN with non-maximum annexation is used to realize the smoke target location based on static spatial information. Then, 3D CNN realizes smoke recognition by combining dynamic spatial–temporal information. Compared with common CNN methods using image for smoke detection, 3D CNN improved the recognition accuracy significantly. Different network structures and data processing methods of 3D CNN have been compared, including Slow Fusion and optical flow. Tested on a dataset that comprises smoke video from multiple sources, the proposed frameworks are shown to perform very well in smoke location and recognition. Finally, the framework of two-stream 3D CNN performs the best, with a detection rate of 95.23% and a low false alarm rate of 0.39% for smoke video sequences.