CNN-Based Video shot boundary detection and video annotation(录音版)-

CNN-Based Shot Boundary Detection and Video Annotation,Wenjing Tong1, Li Song1, 2, Xiaokang Yang1, 2, Hui Qu1,2, Rong Xie1,2 1 Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University. 2 Cooperative Medianet Innovation Center, Shanghai China. BMSB2015 19 June 2015,Outline Introduction Technical details Results Conclusion,1.Introduction,Problem: Video shot boundary detection and video annotationVideo shot boundary detection Cut transition(CT) detection. Gradual transition(GT) detection.(Harder)Video annotation A process to allocate pre-defined labels to video shots. Importance: Video shot boundary detection is the first and fundamental step of many video analysis technologies; Video annotation is a general way of representing videos,1.Introduction,Motivation of our algorithm Using interpretable labels learned by convolution neural networks(CNNs) to do video shot boundary detection and video annotation. Since frames in the same shot tend to share similar labels while frames in different shots do not. Since CNNs can learn labels of frames, we can merge the labels within one shot and get labels for the shot.,2.Technical details,The approach steps Using candidate segmentation selection to eliminate non-boundary frames. Using a pre-trained CNN model to extract tags of frames. Use the tags to do shot boundary detection. Use the tags to do video annotation.,2.Technical details,Candidate segmentation selectionCut video sequence into segments of 21 consecutive frames. So the n th segment ranges from 20n th frame to the 20(n+1) th frame.Step 1: Calculating distance of segment(1)where F(x,y;k) denotes the luminance component of the pixel in the position (x,y) at frame k.Step 2: Calculating local threshold (2)where is the distance mean value of the neighboring 10 segments， is the distance standard deviation of the neighboring 10 segments ， is the distance mean value of the neighboring 100 segments.,2.Technical details,Candidate segmentation selectionStep 3: Classifying candidate segmentSegments whose distances are larger than the threshold are selected as candidate segments. Besides segments whose distances are much larger than the neighboring segments distances are selected as candidate segments too. (3)(4)where ， is the distance mean value of the neighboring 100 segments.,2.Technical details,Candidate segmentation selectionStep 4 : First round bisection-based comparisonsThis step divides candidate segment into two parts, discards thenon-boundary parts and preserves the suspect shot changeparts.(5)(6)In Type 1, the first half of the segment is kept as Candidate segment. In Type 2, the second half is kept as candidate Segment. In Type 3, the whole segment is discarded. In Type 4, the segment is regarded as having a GT boundary.Step 5 : Second round bisection-based comparisonsThis step is similar to step 4.,2.Technical details,Feature Extraction Using CNN,Main architecture of the CNN model,2.Technical details,Feature Extraction Using CNN,Tags extracted by the CNN model,2.Technical details,Shot boundary detection A cut boundary occurs in the segment if the following expressions are satisfied:(7)(8)where T(i) is the tags of the i th frame, is the maximum distance value, is the second maximum distance value, C is a small constant for avoiding divide-by-zero error.For the candidate segments which dont have cut boundary. Assuming the segment starts at s th frame and ends at e th frame, then a gradual boundary occurs in the segment if the following expressions are satisfied: (9),2.Technical details,Video annotationAfter the video shot boundary detection process, we can do video annotation to the shots. The labels allocated to the shots are as the following expression.(10)where , and denote the first, middle and last frame of the k th shot, respectively, denotes the k th shot and denotes the semantic labels for the k th shot.ImageNet is an image database organized according to the WordNet hierarchy(in tree structures). Thus tags are leaf nodes of the WordNet tree. For each tag in , the content of grandfather node of the tag node is selected as the semantic labels for the k th shot.,3.Results,Ground truth and benchmark algorithms Ground truthState-of-the-art shot boundary detection algorithms: 3 Fast video shot boundary detection framework employing pre-processing techniques 4 Fast video shot boundary detection based on SVD and pattern matching,3.Results,Evaluation standardsWe take recall, precision and as evaluation standards just like other works.(11)(12)(13)is the number of correctly detected shot boundariesis the number of missed shot boundariesis the number of falsely detected shot boundariestake both recall and precision into account,3.Results,Recall, Precision and F1 Comapre,3.Results,Example of video shot boundary detection,3.Results,Example of video annotation,4.Conclusion,