We present an early fusion framework for robust object detection in autonomous vehicles. This framework firstly employs Monodepth as a self-supervised learning method to automatically infer a dense depth image from a single color input image. Then, the RGB image and its corresponding depth image are processed by a deep Convolutional Neural Networks (CNNs) to predict multiple 2D bounding boxes. We conduct experiments on the challenging KITTI benchmark dataset. The experimental results show that the features learnt from our fusion framework, when fused with the features learnt from depth-only and RGB-only architectures, outperform the state of the art on RGB-depth category recognition. We also investigated on performance of our fusion framework when it utilizes various sources (such as monocular and stereo imagery or both imageries) for generating the depth image.