MACHINE LEARNING ON SPARK - UC BERKELEY AMP CAMP-

Machine Learning on SparkShivaram Venkataraman UC BerkeleyComputer Science Machine learningStatisticsMachine learningSpam filtersRecommendationsClick predictionSearch rankingMachine learning techniquesClassificationRegressionClusteringActive learningCollaborative filteringImplementing Machine Learning Machine learning algorithms are- Complex, multi-stage- Iterative MapReduce/Hadoop unsuitable Need efficient primitives for data sharing Spark RDDs efficient data sharing In-memory caching accelerates performance- Up to 20x faster than Hadoop Easy to use high-level programming interface- Express complex algorithms 100 lines.Machine Learning using SparkMachine learning techniquesClassificationRegressionClusteringActive learningCollaborative filteringK-Means Clustering using SparkFocus: Implementation and PerformanceClusteringGrouping data according to similarityDistance EastDistance NorthE.g. archaeological digClusteringGrouping data according to similarityDistance EastDistance NorthE.g. archaeological digK-Means AlgorithmBenefits Popular Fast Conceptually straightforwardDistance EastDistance NorthE.g. archaeological digK-Means: preliminariesFeature 1Feature 2Data: Collection of valuesdata = lines.map(line=parseVector(line)K-Means: preliminariesFeature 1Feature 2Dissimilarity: Squared Euclidean distancedist = p.squaredDist(q)K-Means: preliminariesFeature 1Feature 2K = Number of clustersData assignments to clustersS1, S2,. . ., SKK-Means: preliminariesFeature 1Feature 2K = Number of clustersData assignments to clustersS1, S2,. . ., SKK-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)pointsGroup = closest.groupByKey()K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)pointsGroup = closest.groupByKey()newCenters = pointsGroup.mapValues(ps = average(ps)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)pointsGroup = closest.groupByKey()newCenters = pointsGroup.mapValues(ps = average(ps)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)pointsGroup = closest.groupByKey()newCenters = pointsGroup.mapValues(ps = average(ps)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)pointsGroup = closest.groupByKey()newCenters =pointsGroup.mapValues(ps = average(ps)while (dist(centers, newCenters) )K-Means AlgorithmFeature 1Featur