Audio scenes are often composed of a variety of sound events from different sources. Their content exhibits wide variations in both frequency and time domain. Convolutional neural networks (CNNs) provide an effective way to extract spatial information of multidimensional data such as image, audio, and video; they have the ability to learn hierarchical representation from time-frequency features of audio signals. In this paper, we develop a convolutional neural network and employ a multi-scale multi-feature extraction methods for acoustic scene classification. We conduct experiments on the TUT Acoustic Scenes 2016 dataset. Experimental results show that the use of multi-scale multi-feature extraction methods improves significantly the performance of the system. Our proposed approach obtains a high accuracy of 85.9% that outperforms the baseline approach by a large margin of 8.7%.