Generating a more realistic 3D reconstruction point cloud is an ill-posed problem. It is a challenging task to infer 3D shape from a single image. In this paper, a two-stage training network that can reconstruct point cloud from a single image is proposed, namely, 3D-ARNet. The 3D-ARNet uses the designed image encoder with an attention mechanism to extract image features and output a simple point cloud. To improve the accuracy of point cloud reconstruction, the 3D-ARNet network contains a pre-trained point cloud auto-encoder, which a takes simple point cloud as input, and finally obtains an accurately reconstructed point cloud. The proposed approach is analyzed qualitatively and quantitatively on both synthetic and real-world datasets. Improvements are evidently demonstrated from experimental comparison results in reference to existing related state-of-the-art networks.