StandardScaler#
- class pyspark.mllib.feature.StandardScaler(withMean=False, withStd=True)[source]#
- Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. - New in version 1.2.0. - Parameters
- withMeanbool, optional
- False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input. 
- withStdbool, optional
- True by default. Scales the data to unit standard deviation. 
 
 - Examples - >>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])] >>> dataset = sc.parallelize(vs) >>> standardizer = StandardScaler(True, True) >>> model = standardizer.fit(dataset) >>> result = model.transform(dataset) >>> for r in result.collect(): r DenseVector([-0.7071, 0.7071, -0.7071]) DenseVector([0.7071, -0.7071, 0.7071]) >>> int(model.std[0]) 4 >>> int(model.mean[0]*10) 9 >>> model.withStd True >>> model.withMean True - Methods - fit(dataset)- Computes the mean and variance and stores as a model to be used for later scaling. - Methods Documentation - fit(dataset)[source]#
- Computes the mean and variance and stores as a model to be used for later scaling. - New in version 1.2.0. - Parameters
- datasetpyspark.RDD
- The data used to compute the mean and variance to build the transformation model. 
 
- dataset
- Returns