In this paper, a deep learning-based violin action recognition is proposed. By fusing the sensing signals from depth camera modality and inertial sensor modalities, violin bowing actions can be recognized by the proposed deep learning scheme. The actions performed by a violinist are captured by a depth camera, and recorded by wearable sensors on the forearm of a violinist. In the proposed system, 3D convolution neural network (3D-CNN) and long short-term memory (LSTM) deep learning algorithms are adopted to generate the action models from depth camera modality and inertial sensor modalities. The features and models obtained from multi-modalities are used to classify different violin bowing actions. A fusion process from different modalities can achieve satisfactory recognition accuracy. In this paper, we generate a violin bowing actions dataset for the preliminary study and the system performance evaluation.