最终深度学习Deep Learning Towards Intelligence and Security-

Deep Learning Towards Intelligence and SecurityShuicheng YANyanshuicheng360.cn Artificial Intelligence Institute, Qihoo/360eleyansnus.edu.sg National University of Singapore Qihoo/360: Intelligence and SecurityIOTSecuritySecurityIntelligenceUnified Deep Learning Eco-system Infrastructure: Scalable: multi-machine multi-CPU/GPU Flexible: parallel-scheme + network-structureCore Research: Brain-like: discriminating power + generalization power Baby-like: self-learning + reinforcement-learningProduct/Service-support: Vision + Speech + NPL + Big DataSecurityIntelligenceUniversities/ Research Institutes Collaborations: source of fresh waterAds recommendation Malicious code Traffic identificationIntelligence: feedback modelingSecurity: new applicationPart I: Deep Learning for Human Analytics Intelligence: Feedback ModelingFace Detection-FDDBSome Results on Human AnalyticsSome Results on Human AnalyticsFace Alignment-300WMethods300-W DatasetCommonChallengingFullsetZhu et.al 20128.2218.331.20RCPR Burgos,20136.1817.268.35SDM Xiong,20135.5715.407.50LBF Ren,20144.9511.986.32LBF fast Ren,20145.3815.507.37CFSS Zhu, 20154.739.985.76cGPRT Lee, 2015-5.71Linkface4.808.605.54P2P-aware Face Alignment4.198.425.02Some Results on Human AnalyticsHuman Detection - Caltech PedestrianTask-today: Human Parsing (by Feedback Modeling) Decompose a human photo into semantic fashion/body itemsPixel-level semantic labelingSun-glassUpper-clothesskirtscarfright-shoeright-legright-armpantsleft-shoeleft-legleft-armhatfacedressbeltbaghairnullHuman Parsing = Engine for ApplicationsOften the cases where face is not well sensible while body is sensibleWhy Feedback/Contexts ImportantFeedback essentially exists in brain1. Cross-layer/area feedback2. Continuous stimulus for visual perception (slow inference)3. Cross-task feedback/contextsContexts + Fully convolutional neural networkCross-layer context: multi-level feature fusionGlobal top-down context: coherence between pixel-wise labelling and image label predictionLocal bottom-up context: : within-superpixel consistency and cross-superpixel appearance consistencyContextualized NetworkContextualized NetworkCross-layer contextvFour feature map fusions5*5 convolutionsHierarchically combine the low-level local details and high-level semantic information Contextualized NetworkGlobal top-down contextvIncorporate global image label predictionvGlobal top-down context helps distinguish locally-ambiguous labelsSkirtDressCo-CNN w/o global labelCo-CNNskirtupper-clothesdressGlobal image labelContextualized NetworkContextualized NetworkLocal bottom-up contextvIntegrate within-super-pixel smoothing and cross-super-pixel neighbourhood votingResultsDatasets: 7,700 images, 6,000 for training, 1,000 for testing and 700 for validationTraining: Manually decrease the learning rate according to the validation errorResultsAccuracyForeground accuracyAverageprecisionAverage recallAverage F-1 scoresYamaguchi 184.3855.5937.5451.0541.80Paperdoll 288.9662.1852.7549.4344.76ATR 391.1171.0471.6960.2564.38Co-CNN95.2380.9081.5574.4276.95Comparison of parsing performances with four state-of-the-art methods on ATR dataset:2 K. Yamaguchi, M.H. Kiapour, L.E. Ortiz, and T.L. Berg. Parsing clothing in fashion photographs. In CVPR 2012.1 K. Yamaguchi, M.H. Kiapour, and T.L. Berg. Paper doll parsing: Retrieving similar styles to parse clothing items. In ICCV, 20133 X. Liang, et al. Deep Human Parsing with Active Template Regression. In TPAMI, 2015ResultsAnalysis on architectural variants of our modelCross-layer context ResultsAnalysis on architectural variants of our modelGlobal image label context ResultsAnalyses on architectural variants of our modelLocal super-pixel contextResultsAdding 10,000 human pictures from “chictopia.com”AccuracyForeground accuracyAverageprecisionAverage recallAverage F-1 scoresPaperdoll88.9662.1852.7549.4344.76ATR91.1171.0471.6960.2564.38Co-CNN95.2380.9081.5574.4276.95Co- CNN(+Chictopia10k )96.0283.5784.9577.6680.14Cross-task Context from Semantic Edge DetectionSemantic edge detection taskSemantic EdgeInput within-item edge vs. cross-item edgeMotivations: Incorrect EdgesCross-task Context from Semantic Edge DetectionSemantic edgevintegrate the semantic edge into the Co-CNNMulti-resolution fusionResultsAccuracyForeground accuracyAverageprecisionAverage recallAverage F-1 scoresPaperdoll88.9662.1852.7549.4344.76ATR91.1171.0471.6960.2564.38Co-CNN95.2380.9081.5574.4276.95Co- CNN(+Chictopia10k)96.0283.5784.9577.6680.14Above + Semantic- Edge97.1888.8487.1284.0585.36Comparison of parsing performances with state-of-the-art methods on ATR dataset:Feedback with Local-Global-Aware LSTM (slow inference)Multi-resolution fusionStarting point for LSTMLocal hidden cellsInput features mapsInitialized hidden cells and memory cells Local hidden cellsLocal-Global-Aware LSTM layersLSTM1LSTM2LSTM3LSTM4LSTM5LSTM6LSTM7LSTM8LSTM9Global hidden cells9 grid