pyspark.ml package¶
ML Pipeline APIs¶
-
class
pyspark.ml.
Transformer
[source]¶ Abstract class for transformers that transform one dataset into another.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)[source]¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.
Estimator
[source]¶ Abstract class for estimators that fit models to data.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
fit
(dataset, params=None)[source]¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
-
class
pyspark.ml.
Model
[source]¶ Abstract class for models that are fitted by estimators.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.
Pipeline
(*args, **kwargs)[source]¶ A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each of which is either an
Estimator
or aTransformer
. WhenPipeline.fit()
is called, the stages are executed in order. If a stage is anEstimator
, itsEstimator.fit()
method will be called on the input dataset to fit a model. Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage. If a stage is aTransformer
, itsTransformer.transform()
method will be called to produce the dataset for the next stage. The fitted model from aPipeline
is anPipelineModel
, which consists of fitted models and transformers, corresponding to the pipeline stages. If there are no stages, the pipeline acts as an identity transformer.-
copy
(extra=None)[source]¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
-
class
pyspark.ml.
PipelineModel
(stages)[source]¶ Represents a compiled pipeline with transformers and fitted models.
-
copy
(extra=None)[source]¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
pyspark.ml.param module¶
-
class
pyspark.ml.param.
Params
[source]¶ Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
-
copy
(extra=None)[source]¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)[source]¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()[source]¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)[source]¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)[source]¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
pyspark.ml.feature module¶
-
class
pyspark.ml.feature.
Binarizer
(*args, **kwargs)[source]¶ Binarize a column of continuous features given a threshold.
>>> df = sqlContext.createDataFrame([(0.5,)], ["values"]) >>> binarizer = Binarizer(threshold=1.0, inputCol="values", outputCol="features") >>> binarizer.transform(df).head().features 0.0 >>> binarizer.setParams(outputCol="freqs").transform(df).head().freqs 0.0 >>> params = {binarizer.threshold: -0.5, binarizer.outputCol: "vector"} >>> binarizer.transform(df, params).head().vector 1.0
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
setParams
(self, threshold=0.0, inputCol=None, outputCol=None)[source]¶ Sets params for this Binarizer.
-
threshold
= Param(parent='undefined', name='threshold', doc='threshold in binary classification prediction, in range [0, 1]')¶
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
Bucketizer
(*args, **kwargs)[source]¶ Maps a column of continuous features to a column of feature buckets.
>>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"]) >>> bucketizer = Bucketizer(splits=[-float("inf"), 0.5, 1.4, float("inf")], ... inputCol="values", outputCol="buckets") >>> bucketed = bucketizer.transform(df).collect() >>> bucketed[0].buckets 0.0 >>> bucketed[1].buckets 0.0 >>> bucketed[2].buckets 1.0 >>> bucketed[3].buckets 2.0 >>> bucketizer.setParams(outputCol="b").transform(df).head().b 0.0
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
setParams
(self, splits=None, inputCol=None, outputCol=None)[source]¶ Sets params for this Bucketizer.
-
splits
= Param(parent='undefined', name='splits', doc='Split points for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. The splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; otherwise, values outside the splits specified will be treated as errors.')¶ param for Splitting points for mapping continuous features into buckets. With n+1 splits,
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
ElementwiseProduct
(*args, **kwargs)[source]¶ Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided “weight” vector. In other words, it scales each column of the dataset by a scalar multiplier.
>>> from pyspark.mllib.linalg import Vectors >>> df = sqlContext.createDataFrame([(Vectors.dense([2.0, 1.0, 3.0]),)], ["values"]) >>> ep = ElementwiseProduct(scalingVec=Vectors.dense([1.0, 2.0, 3.0]), ... inputCol="values", outputCol="eprod") >>> ep.transform(df).head().eprod DenseVector([2.0, 2.0, 9.0]) >>> ep.setParams(scalingVec=Vectors.dense([2.0, 3.0, 5.0])).transform(df).head().eprod DenseVector([4.0, 3.0, 15.0])
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
scalingVec
= Param(parent='undefined', name='scalingVec', doc='vector for hadamard product, it must be MLlib Vector type.')¶
-
setParams
(self, scalingVec=None, inputCol=None, outputCol=None)[source]¶ Sets params for this ElementwiseProduct.
-
setScalingVec
(value)[source]¶ Sets the value of
scalingVec
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
HashingTF
(*args, **kwargs)[source]¶ Maps a sequence of terms to their term frequencies using the hashing trick.
>>> df = sqlContext.createDataFrame([(["a", "b", "c"],)], ["words"]) >>> hashingTF = HashingTF(numFeatures=10, inputCol="words", outputCol="features") >>> hashingTF.transform(df).head().features SparseVector(10, {7: 1.0, 8: 1.0, 9: 1.0}) >>> hashingTF.setParams(outputCol="freqs").transform(df).head().freqs SparseVector(10, {7: 1.0, 8: 1.0, 9: 1.0}) >>> params = {hashingTF.numFeatures: 5, hashingTF.outputCol: "vector"} >>> hashingTF.transform(df, params).head().vector SparseVector(5, {2: 1.0, 3: 1.0, 4: 1.0})
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getNumFeatures
()¶ Gets the value of numFeatures or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
numFeatures
= Param(parent='undefined', name='numFeatures', doc='number of features')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
setNumFeatures
(value)¶ Sets the value of
numFeatures
.
-
setParams
(self, numFeatures=1 << 18, inputCol=None, outputCol=None)[source]¶ Sets params for this HashingTF.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
IDF
(*args, **kwargs)[source]¶ Compute the Inverse Document Frequency (IDF) given a collection of documents.
>>> from pyspark.mllib.linalg import DenseVector >>> df = sqlContext.createDataFrame([(DenseVector([1.0, 2.0]),), ... (DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["tf"]) >>> idf = IDF(minDocFreq=3, inputCol="tf", outputCol="idf") >>> idf.fit(df).transform(df).head().idf DenseVector([0.0, 0.0]) >>> idf.setParams(outputCol="freqs").fit(df).transform(df).collect()[1].freqs DenseVector([0.0, 0.0]) >>> params = {idf.minDocFreq: 1, idf.outputCol: "vector"} >>> idf.fit(df, params).transform(df).head().vector DenseVector([0.2877, 0.0])
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
minDocFreq
= Param(parent='undefined', name='minDocFreq', doc='minimum of documents in which a term should appear for filtering')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
setMinDocFreq
(value)[source]¶ Sets the value of
minDocFreq
.
-
-
class
pyspark.ml.feature.
IDFModel
(java_model)[source]¶ Model fitted by IDF.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
NGram
(*args, **kwargs)[source]¶ A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words. When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.
>>> df = sqlContext.createDataFrame([Row(inputTokens=["a", "b", "c", "d", "e"])]) >>> ngram = NGram(n=2, inputCol="inputTokens", outputCol="nGrams") >>> ngram.transform(df).head() Row(inputTokens=[u'a', u'b', u'c', u'd', u'e'], nGrams=[u'a b', u'b c', u'c d', u'd e']) >>> # Change n-gram length >>> ngram.setParams(n=4).transform(df).head() Row(inputTokens=[u'a', u'b', u'c', u'd', u'e'], nGrams=[u'a b c d', u'b c d e']) >>> # Temporarily modify output column. >>> ngram.transform(df, {ngram.outputCol: "output"}).head() Row(inputTokens=[u'a', u'b', u'c', u'd', u'e'], output=[u'a b c d', u'b c d e']) >>> ngram.transform(df).head() Row(inputTokens=[u'a', u'b', u'c', u'd', u'e'], nGrams=[u'a b c d', u'b c d e']) >>> # Must use keyword arguments to specify params. >>> ngram.setParams("text") Traceback (most recent call last): ... TypeError: Method setParams forces keyword arguments.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
n
= Param(parent='undefined', name='n', doc='number of elements per n-gram (>=1)')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
Normalizer
(*args, **kwargs)[source]¶ - Normalize a vector to have unit norm using the given p-norm.
>>> from pyspark.mllib.linalg import Vectors >>> svec = Vectors.sparse(4, {1: 4.0, 3: 3.0}) >>> df = sqlContext.createDataFrame([(Vectors.dense([3.0, -4.0]), svec)], ["dense", "sparse"]) >>> normalizer = Normalizer(p=2.0, inputCol="dense", outputCol="features") >>> normalizer.transform(df).head().features DenseVector([0.6, -0.8]) >>> normalizer.setParams(inputCol="sparse", outputCol="freqs").transform(df).head().freqs SparseVector(4, {1: 0.8, 3: 0.6}) >>> params = {normalizer.p: 1.0, normalizer.inputCol: "dense", normalizer.outputCol: "vector"} >>> normalizer.transform(df, params).head().vector DenseVector([0.4286, -0.5714])
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
p
= Param(parent='undefined', name='p', doc='the p norm value.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
OneHotEncoder
(*args, **kwargs)[source]¶ A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via
dropLast
) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]. Note that this is different from scikit-learn’s OneHotEncoder, which keeps all categories. The output vectors are sparse.See also
StringIndexer
for converting categorical values into category indices>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed") >>> model = stringIndexer.fit(stringIndDf) >>> td = model.transform(stringIndDf) >>> encoder = OneHotEncoder(inputCol="indexed", outputCol="features") >>> encoder.transform(td).head().features SparseVector(2, {0: 1.0}) >>> encoder.setParams(outputCol="freqs").transform(td).head().freqs SparseVector(2, {0: 1.0}) >>> params = {encoder.dropLast: False, encoder.outputCol: "test"} >>> encoder.transform(td, params).head().test SparseVector(3, {0: 1.0})
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
dropLast
= Param(parent='undefined', name='dropLast', doc='whether to drop the last category')¶
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
setParams
(self, dropLast=True, inputCol=None, outputCol=None)[source]¶ Sets params for this OneHotEncoder.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
PolynomialExpansion
(*args, **kwargs)[source]¶ Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, which is available at http://en.wikipedia.org/wiki/Polynomial_expansion, “In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition”. Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y).
>>> from pyspark.mllib.linalg import Vectors >>> df = sqlContext.createDataFrame([(Vectors.dense([0.5, 2.0]),)], ["dense"]) >>> px = PolynomialExpansion(degree=2, inputCol="dense", outputCol="expanded") >>> px.transform(df).head().expanded DenseVector([0.5, 0.25, 2.0, 1.0, 4.0]) >>> px.setParams(outputCol="test").transform(df).head().test DenseVector([0.5, 0.25, 2.0, 1.0, 4.0])
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
degree
= Param(parent='undefined', name='degree', doc='the polynomial degree to expand (>= 1)')¶
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
setParams
(self, degree=2, inputCol=None, outputCol=None)[source]¶ Sets params for this PolynomialExpansion.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
RegexTokenizer
(*args, **kwargs)[source]¶ A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
>>> df = sqlContext.createDataFrame([("a b c",)], ["text"]) >>> reTokenizer = RegexTokenizer(inputCol="text", outputCol="words") >>> reTokenizer.transform(df).head() Row(text=u'a b c', words=[u'a', u'b', u'c']) >>> # Change a parameter. >>> reTokenizer.setParams(outputCol="tokens").transform(df).head() Row(text=u'a b c', tokens=[u'a', u'b', u'c']) >>> # Temporarily modify a parameter. >>> reTokenizer.transform(df, {reTokenizer.outputCol: "words"}).head() Row(text=u'a b c', words=[u'a', u'b', u'c']) >>> reTokenizer.transform(df).head() Row(text=u'a b c', tokens=[u'a', u'b', u'c']) >>> # Must use keyword arguments to specify params. >>> reTokenizer.setParams("text") Traceback (most recent call last): ... TypeError: Method setParams forces keyword arguments.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
gaps
= Param(parent='undefined', name='gaps', doc='whether regex splits on gaps (True) or matches tokens')¶
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
minTokenLength
= Param(parent='undefined', name='minTokenLength', doc='minimum token length (>= 0)')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
pattern
= Param(parent='undefined', name='pattern', doc='regex pattern (Java dialect) used for tokenizing')¶
-
setMinTokenLength
(value)[source]¶ Sets the value of
minTokenLength
.
-
setParams
(self, minTokenLength=1, gaps=True, pattern="s+", inputCol=None, outputCol=None)[source]¶ Sets params for this RegexTokenizer.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
StandardScaler
(*args, **kwargs)[source]¶ Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
>>> from pyspark.mllib.linalg import Vectors >>> df = sqlContext.createDataFrame([(Vectors.dense([0.0]),), (Vectors.dense([2.0]),)], ["a"]) >>> standardScaler = StandardScaler(inputCol="a", outputCol="scaled") >>> model = standardScaler.fit(df) >>> model.mean DenseVector([1.0]) >>> model.std DenseVector([1.4142]) >>> model.transform(df).collect()[1].scaled DenseVector([1.4142])
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
setParams
(self, withMean=False, withStd=True, inputCol=None, outputCol=None)[source]¶ Sets params for this StandardScaler.
-
withMean
= Param(parent='undefined', name='withMean', doc='Center data with mean')¶
-
withStd
= Param(parent='undefined', name='withStd', doc='Scale to unit standard deviation')¶
-
-
class
pyspark.ml.feature.
StandardScalerModel
(java_model)[source]¶ Model fitted by StandardScaler.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
mean
¶ Mean of the StandardScalerModel.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
std
¶ Standard deviation of the StandardScalerModel.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
StringIndexer
(*args, **kwargs)[source]¶ A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels), ordered by label frequencies. So the most frequent label gets index 0.
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed") >>> model = stringIndexer.fit(stringIndDf) >>> td = model.transform(stringIndDf) >>> sorted(set([(i[0], i[1]) for i in td.select(td.id, td.indexed).collect()]), ... key=lambda x: x[0]) [(0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), (5, 1.0)]
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
-
class
pyspark.ml.feature.
StringIndexerModel
(java_model)[source]¶ Model fitted by StringIndexer.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
Tokenizer
(*args, **kwargs)[source]¶ A tokenizer that converts the input string to lowercase and then splits it by white spaces.
>>> df = sqlContext.createDataFrame([("a b c",)], ["text"]) >>> tokenizer = Tokenizer(inputCol="text", outputCol="words") >>> tokenizer.transform(df).head() Row(text=u'a b c', words=[u'a', u'b', u'c']) >>> # Change a parameter. >>> tokenizer.setParams(outputCol="tokens").transform(df).head() Row(text=u'a b c', tokens=[u'a', u'b', u'c']) >>> # Temporarily modify a parameter. >>> tokenizer.transform(df, {tokenizer.outputCol: "words"}).head() Row(text=u'a b c', words=[u'a', u'b', u'c']) >>> tokenizer.transform(df).head() Row(text=u'a b c', tokens=[u'a', u'b', u'c']) >>> # Must use keyword arguments to specify params. >>> tokenizer.setParams("text") Traceback (most recent call last): ... TypeError: Method setParams forces keyword arguments.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
VectorAssembler
(*args, **kwargs)[source]¶ A feature transformer that merges multiple columns into a vector column.
>>> df = sqlContext.createDataFrame([(1, 0, 3)], ["a", "b", "c"]) >>> vecAssembler = VectorAssembler(inputCols=["a", "b", "c"], outputCol="features") >>> vecAssembler.transform(df).head().features DenseVector([1.0, 0.0, 3.0]) >>> vecAssembler.setParams(outputCol="freqs").transform(df).head().freqs DenseVector([1.0, 0.0, 3.0]) >>> params = {vecAssembler.inputCols: ["b", "a"], vecAssembler.outputCol: "vector"} >>> vecAssembler.transform(df, params).head().vector DenseVector([0.0, 1.0])
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getInputCols
()¶ Gets the value of inputCols or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCols
= Param(parent='undefined', name='inputCols', doc='input column names')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
VectorIndexer
(*args, **kwargs)[source]¶ Class for indexing categorical feature columns in a dataset of [[Vector]].
- This has 2 usage modes:
- Automatically identify categorical features (default behavior)
- This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.
- Set maxCategories to the maximum number of categorical any categorical feature should have.
- E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1}, and feature 1 will be declared continuous.
- Index all features, if all features are categorical
- If maxCategories is set to be very large, then this will build an index of unique values for all features.
- Warning: This can cause problems if features are continuous since this will collect ALL unique values to the driver.
- E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories >= 3, then both features will be declared categorical.
This returns a model which can transform categorical features to use 0-based indices.
- Index stability:
- This is not guaranteed to choose the same category index across multiple runs.
- If a categorical feature includes value 0, then this is guaranteed to map value 0 to index 0. This maintains vector sparsity.
- More stability may be added in the future.
- TODO: Future extensions: The following functionality is planned for the future:
- Preserve metadata in transform; if a feature’s metadata is already present, do not recompute.
- Specify certain features to not index, either via a parameter or via existing metadata.
- Add warning if a categorical feature has only 1 category.
- Add option for allowing unknown categories.
>>> from pyspark.mllib.linalg import Vectors >>> df = sqlContext.createDataFrame([(Vectors.dense([-1.0, 0.0]),), ... (Vectors.dense([0.0, 1.0]),), (Vectors.dense([0.0, 2.0]),)], ["a"]) >>> indexer = VectorIndexer(maxCategories=2, inputCol="a", outputCol="indexed") >>> model = indexer.fit(df) >>> model.transform(df).head().indexed DenseVector([1.0, 0.0]) >>> indexer.setParams(outputCol="test").fit(df).transform(df).collect()[1].test DenseVector([0.0, 1.0]) >>> params = {indexer.maxCategories: 3, indexer.outputCol: "vector"} >>> model2 = indexer.fit(df, params) >>> model2.transform(df).head().vector DenseVector([1.0, 0.0])
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
maxCategories
= Param(parent='undefined', name='maxCategories', doc='Threshold for the number of values a categorical feature can take (>= 2). If a feature is found to have > maxCategories values, then it is declared continuous.')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
setMaxCategories
(value)[source]¶ Sets the value of
maxCategories
.
-
class
pyspark.ml.feature.
Word2Vec
(*args, **kwargs)[source]¶ Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.
>>> sent = ("a b " * 100 + "a c " * 10).split(" ") >>> doc = sqlContext.createDataFrame([(sent,), (sent,)], ["sentence"]) >>> model = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model").fit(doc) >>> model.getVectors().show() +----+--------------------+ |word| vector| +----+--------------------+ | a|[-0.3511952459812...| | b|[0.29077222943305...| | c|[0.02315592765808...| +----+--------------------+ ... >>> model.findSynonyms("a", 2).show() +----+-------------------+ |word| similarity| +----+-------------------+ | b|0.29255685145799626| | c|-0.5414068302988307| +----+-------------------+ ... >>> model.transform(doc).head().model DenseVector([-0.0422, -0.5138, -0.2546, 0.6885, 0.276])
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getMaxIter
()¶ Gets the value of maxIter or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
getSeed
()¶ Gets the value of seed or its default value.
-
getStepSize
()¶ Gets the value of stepSize or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
maxIter
= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0)')¶
-
minCount
= Param(parent='undefined', name='minCount', doc="the minimum number of times a token must appear to be included in the word2vec model's vocabulary")¶
-
numPartitions
= Param(parent='undefined', name='numPartitions', doc='number of partitions for sentences of words')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
seed
= Param(parent='undefined', name='seed', doc='random seed')¶
-
setNumPartitions
(value)[source]¶ Sets the value of
numPartitions
.
-
setParams
(self, minCount=5, numPartitions=1, stepSize=0.025, maxIter=1, seed=None, inputCol=None, outputCol=None)[source]¶ Sets params for this Word2Vec.
-
setVectorSize
(value)[source]¶ Sets the value of
vectorSize
.
-
stepSize
= Param(parent='undefined', name='stepSize', doc='Step size to be used for each iteration of optimization.')¶
-
vectorSize
= Param(parent='undefined', name='vectorSize', doc='the dimension of codes after transforming from words')¶
-
-
class
pyspark.ml.feature.
Word2VecModel
(java_model)[source]¶ Model fitted by Word2Vec.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
findSynonyms
(word, num)[source]¶ Find “num” number of words closest in similarity to “word”. word can be a string or vector representation. Returns a dataframe with two fields word and similarity (which gives the cosine similarity).
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getVectors
()[source]¶ Returns the vector representation of the words as a dataframe with two fields, word and vector.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
PCA
(*args, **kwargs)[source]¶ PCA trains a model to project vectors to a low-dimensional space using PCA.
>>> from pyspark.mllib.linalg import Vectors >>> data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),), ... (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), ... (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)] >>> df = sqlContext.createDataFrame(data,["features"]) >>> pca = PCA(k=2, inputCol="features", outputCol="pca_features") >>> model = pca.fit(df) >>> model.transform(df).collect()[0].pca_features DenseVector([1.648..., -4.013...])
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
k
= Param(parent='undefined', name='k', doc='the number of principal components')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
-
class
pyspark.ml.feature.
PCAModel
(java_model)[source]¶ Model fitted by PCA.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.feature.
RFormula
(*args, **kwargs)[source]¶ Note
Experimental
Implements the transforms required for fitting a dataset against an R model formula. Currently we support a limited subset of the R operators, including ‘~’, ‘+’, ‘-‘, and ‘.’. Also see the R formula docs: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html
>>> df = sqlContext.createDataFrame([ ... (1.0, 1.0, "a"), ... (0.0, 2.0, "b"), ... (0.0, 0.0, "a") ... ], ["y", "x", "s"]) >>> rf = RFormula(formula="y ~ x + s") >>> rf.fit(df).transform(df).show() +---+---+---+---------+-----+ | y| x| s| features|label| +---+---+---+---------+-----+ |1.0|1.0| a|[1.0,1.0]| 1.0| |0.0|2.0| b|[2.0,0.0]| 0.0| |0.0|0.0| a|[0.0,1.0]| 0.0| +---+---+---+---------+-----+ ... >>> rf.fit(df, {rf.formula: "y ~ . - s"}).transform(df).show() +---+---+---+--------+-----+ | y| x| s|features|label| +---+---+---+--------+-----+ |1.0|1.0| a| [1.0]| 1.0| |0.0|2.0| b| [2.0]| 0.0| |0.0|0.0| a| [0.0]| 0.0| +---+---+---+--------+-----+ ...
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
formula
= Param(parent='undefined', name='formula', doc='R model formula')¶
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
-
class
pyspark.ml.feature.
RFormulaModel
(java_model)[source]¶ Model fitted by
RFormula
.-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
pyspark.ml.classification module¶
-
class
pyspark.ml.classification.
LogisticRegression
(*args, **kwargs)[source]¶ Logistic regression. Currently, this class only supports binary classification.
>>> from pyspark.sql import Row >>> from pyspark.mllib.linalg import Vectors >>> df = sc.parallelize([ ... Row(label=1.0, features=Vectors.dense(1.0)), ... Row(label=0.0, features=Vectors.sparse(1, [], []))]).toDF() >>> lr = LogisticRegression(maxIter=5, regParam=0.01) >>> model = lr.fit(df) >>> model.weights DenseVector([5.5...]) >>> model.intercept -2.68... >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0))]).toDF() >>> result = model.transform(test0).head() >>> result.prediction 0.0 >>> result.probability DenseVector([0.99..., 0.00...]) >>> result.rawPrediction DenseVector([8.22..., -8.22...]) >>> test1 = sc.parallelize([Row(features=Vectors.sparse(1, [0], [1.0]))]).toDF() >>> model.transform(test1).head().prediction 1.0 >>> lr.setParams("vector") Traceback (most recent call last): ... TypeError: Method setParams forces keyword arguments.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
elasticNetParam
= Param(parent='undefined', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.')¶ param for the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
fitIntercept
= Param(parent='undefined', name='fitIntercept', doc='whether to fit an intercept term.')¶ param for whether to fit an intercept term.
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getMaxIter
()¶ Gets the value of maxIter or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
getProbabilityCol
()¶ Gets the value of probabilityCol or its default value.
-
getRawPredictionCol
()¶ Gets the value of rawPredictionCol or its default value.
-
getRegParam
()¶ Gets the value of regParam or its default value.
-
getThresholds
()[source]¶ If
thresholds
is set, return its value. Otherwise, ifthreshold
is set, return the equivalent thresholds for binary classification: (1-threshold, threshold). If neither are set, throw an error.
-
getTol
()¶ Gets the value of tol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name')¶
-
maxIter
= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0)')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name')¶
-
probabilityCol
= Param(parent='undefined', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.')¶
-
rawPredictionCol
= Param(parent='undefined', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name')¶
-
regParam
= Param(parent='undefined', name='regParam', doc='regularization parameter (>= 0)')¶
-
setElasticNetParam
(value)[source]¶ Sets the value of
elasticNetParam
.
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setFitIntercept
(value)[source]¶ Sets the value of
fitIntercept
.
-
setParams
(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, regParam=0.1, elasticNetParam=0.0, tol=1e-6, fitIntercept=True, threshold=0.5, thresholds=None, probabilityCol="probability", rawPredictionCol="rawPrediction")[source]¶ Sets params for logistic regression. If the threshold and thresholds Params are both set, they must be equivalent.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
setProbabilityCol
(value)¶ Sets the value of
probabilityCol
.
-
setRawPredictionCol
(value)¶ Sets the value of
rawPredictionCol
.
-
setThreshold
(value)[source]¶ Sets the value of
threshold
. Clears value ofthresholds
if it has been set.
-
setThresholds
(value)[source]¶ Sets the value of
thresholds
. Clears value ofthreshold
if it has been set.
-
threshold
= Param(parent='undefined', name='threshold', doc='Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.')¶ param for threshold in binary classification, in range [0, 1].
-
thresholds
= Param(parent='undefined', name='thresholds', doc="Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.")¶ param for thresholds or cutoffs in binary or multiclass classification
-
tol
= Param(parent='undefined', name='tol', doc='the convergence tolerance for iterative algorithms')¶
-
-
class
pyspark.ml.classification.
LogisticRegressionModel
(java_model)[source]¶ Model fitted by LogisticRegression.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
intercept
¶ Model intercept.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
weights
¶ Model weights.
-
-
class
pyspark.ml.classification.
DecisionTreeClassifier
(*args, **kwargs)[source]¶ http://en.wikipedia.org/wiki/Decision_tree_learning Decision tree learning algorithm for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.
>>> from pyspark.mllib.linalg import Vectors >>> from pyspark.ml.feature import StringIndexer >>> df = sqlContext.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed") >>> si_model = stringIndexer.fit(df) >>> td = si_model.transform(df) >>> dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed") >>> model = dt.fit(td) >>> model.numNodes 3 >>> model.depth 1 >>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"]) >>> result = model.transform(test0).head() >>> result.prediction 0.0 >>> result.probability DenseVector([1.0, 0.0]) >>> result.rawPrediction DenseVector([1.0, 0.0]) >>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"]) >>> model.transform(test1).head().prediction 1.0
-
cacheNodeIds
= Param(parent='undefined', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees.')¶
-
checkpointInterval
= Param(parent='undefined', name='checkpointInterval', doc='checkpoint interval (>= 1)')¶
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getCacheNodeIds
()¶ Gets the value of cacheNodeIds or its default value.
-
getCheckpointInterval
()¶ Gets the value of checkpointInterval or its default value.
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getMaxBins
()¶ Gets the value of maxBins or its default value.
-
getMaxDepth
()¶ Gets the value of maxDepth or its default value.
-
getMaxMemoryInMB
()¶ Gets the value of maxMemoryInMB or its default value.
-
getMinInfoGain
()¶ Gets the value of minInfoGain or its default value.
-
getMinInstancesPerNode
()¶ Gets the value of minInstancesPerNode or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
getProbabilityCol
()¶ Gets the value of probabilityCol or its default value.
-
getRawPredictionCol
()¶ Gets the value of rawPredictionCol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
impurity
= Param(parent='undefined', name='impurity', doc='Criterion used for information gain calculation (case-insensitive). Supported options: entropy, gini')¶ param for Criterion used for information gain calculation (case-insensitive).
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name')¶
-
maxBins
= Param(parent='undefined', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.')¶
-
maxDepth
= Param(parent='undefined', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.')¶
-
maxMemoryInMB
= Param(parent='undefined', name='maxMemoryInMB', doc='Maximum memory in MB allocated to histogram aggregation.')¶
-
minInfoGain
= Param(parent='undefined', name='minInfoGain', doc='Minimum information gain for a split to be considered at a tree node.')¶
-
minInstancesPerNode
= Param(parent='undefined', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name')¶
-
probabilityCol
= Param(parent='undefined', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.')¶
-
rawPredictionCol
= Param(parent='undefined', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name')¶
-
setCacheNodeIds
(value)¶ Sets the value of
cacheNodeIds
.
-
setCheckpointInterval
(value)¶ Sets the value of
checkpointInterval
.
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setMaxMemoryInMB
(value)¶ Sets the value of
maxMemoryInMB
.
-
setMinInfoGain
(value)¶ Sets the value of
minInfoGain
.
-
setMinInstancesPerNode
(value)¶ Sets the value of
minInstancesPerNode
.
-
setParams
(self, featuresCol="features", labelCol="label", predictionCol="prediction", probabilityCol="probability", rawPredictionCol="rawPrediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="gini")[source]¶ Sets params for the DecisionTreeClassifier.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
setProbabilityCol
(value)¶ Sets the value of
probabilityCol
.
-
setRawPredictionCol
(value)¶ Sets the value of
rawPredictionCol
.
-
-
class
pyspark.ml.classification.
DecisionTreeClassificationModel
(java_model)[source]¶ Model fitted by DecisionTreeClassifier.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
depth
¶ Return depth of the decision tree.
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
numNodes
¶ Return number of nodes of the decision tree.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.classification.
GBTClassifier
(*args, **kwargs)[source]¶ http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees (GBTs) learning algorithm for classification. It supports binary labels, as well as both continuous and categorical features. Note: Multiclass labels are not currently supported.
>>> from numpy import allclose >>> from pyspark.mllib.linalg import Vectors >>> from pyspark.ml.feature import StringIndexer >>> df = sqlContext.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed") >>> si_model = stringIndexer.fit(df) >>> td = si_model.transform(df) >>> gbt = GBTClassifier(maxIter=5, maxDepth=2, labelCol="indexed") >>> model = gbt.fit(td) >>> allclose(model.treeWeights, [1.0, 0.1, 0.1, 0.1, 0.1]) True >>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"]) >>> model.transform(test0).head().prediction 0.0 >>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"]) >>> model.transform(test1).head().prediction 1.0
-
cacheNodeIds
= Param(parent='undefined', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees.')¶
-
checkpointInterval
= Param(parent='undefined', name='checkpointInterval', doc='checkpoint interval (>= 1)')¶
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getCacheNodeIds
()¶ Gets the value of cacheNodeIds or its default value.
-
getCheckpointInterval
()¶ Gets the value of checkpointInterval or its default value.
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getMaxBins
()¶ Gets the value of maxBins or its default value.
-
getMaxDepth
()¶ Gets the value of maxDepth or its default value.
-
getMaxIter
()¶ Gets the value of maxIter or its default value.
-
getMaxMemoryInMB
()¶ Gets the value of maxMemoryInMB or its default value.
-
getMinInfoGain
()¶ Gets the value of minInfoGain or its default value.
-
getMinInstancesPerNode
()¶ Gets the value of minInstancesPerNode or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name')¶
-
lossType
= Param(parent='undefined', name='lossType', doc='Loss function which GBT tries to minimize (case-insensitive). Supported options: logistic')¶ param for Loss function which GBT tries to minimize (case-insensitive).
-
maxBins
= Param(parent='undefined', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.')¶
-
maxDepth
= Param(parent='undefined', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.')¶
-
maxIter
= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0)')¶
-
maxMemoryInMB
= Param(parent='undefined', name='maxMemoryInMB', doc='Maximum memory in MB allocated to histogram aggregation.')¶
-
minInfoGain
= Param(parent='undefined', name='minInfoGain', doc='Minimum information gain for a split to be considered at a tree node.')¶
-
minInstancesPerNode
= Param(parent='undefined', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name')¶
-
setCacheNodeIds
(value)¶ Sets the value of
cacheNodeIds
.
-
setCheckpointInterval
(value)¶ Sets the value of
checkpointInterval
.
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setMaxMemoryInMB
(value)¶ Sets the value of
maxMemoryInMB
.
-
setMinInfoGain
(value)¶ Sets the value of
minInfoGain
.
-
setMinInstancesPerNode
(value)¶ Sets the value of
minInstancesPerNode
.
-
setParams
(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, lossType="logistic", maxIter=20, stepSize=0.1)[source]¶ Sets params for Gradient Boosted Tree Classification.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
setSubsamplingRate
(value)[source]¶ Sets the value of
subsamplingRate
.
-
stepSize
= Param(parent='undefined', name='stepSize', doc='Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator')¶ Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of
-
subsamplingRate
= Param(parent='undefined', name='subsamplingRate', doc='Fraction of the training data used for learning each decision tree, in range (0, 1].')¶ Fraction of the training data used for learning each decision tree, in range (0, 1].
-
-
class
pyspark.ml.classification.
GBTClassificationModel
(java_model)[source]¶ Model fitted by GBTClassifier.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
treeWeights
¶ Return the weights for each tree
-
-
class
pyspark.ml.classification.
RandomForestClassifier
(*args, **kwargs)[source]¶ http://en.wikipedia.org/wiki/Random_forest Random Forest learning algorithm for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.
>>> import numpy >>> from numpy import allclose >>> from pyspark.mllib.linalg import Vectors >>> from pyspark.ml.feature import StringIndexer >>> df = sqlContext.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed") >>> si_model = stringIndexer.fit(df) >>> td = si_model.transform(df) >>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42) >>> model = rf.fit(td) >>> allclose(model.treeWeights, [1.0, 1.0, 1.0]) True >>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"]) >>> result = model.transform(test0).head() >>> result.prediction 0.0 >>> numpy.argmax(result.probability) 0 >>> numpy.argmax(result.rawPrediction) 0 >>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"]) >>> model.transform(test1).head().prediction 1.0
-
cacheNodeIds
= Param(parent='undefined', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees.')¶
-
checkpointInterval
= Param(parent='undefined', name='checkpointInterval', doc='checkpoint interval (>= 1)')¶
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
featureSubsetStrategy
= Param(parent='undefined', name='featureSubsetStrategy', doc='The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2')¶ param for The number of features to consider for splits at each tree node
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getCacheNodeIds
()¶ Gets the value of cacheNodeIds or its default value.
-
getCheckpointInterval
()¶ Gets the value of checkpointInterval or its default value.
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getMaxBins
()¶ Gets the value of maxBins or its default value.
-
getMaxDepth
()¶ Gets the value of maxDepth or its default value.
-
getMaxMemoryInMB
()¶ Gets the value of maxMemoryInMB or its default value.
-
getMinInfoGain
()¶ Gets the value of minInfoGain or its default value.
-
getMinInstancesPerNode
()¶ Gets the value of minInstancesPerNode or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
getProbabilityCol
()¶ Gets the value of probabilityCol or its default value.
-
getRawPredictionCol
()¶ Gets the value of rawPredictionCol or its default value.
-
getSeed
()¶ Gets the value of seed or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
impurity
= Param(parent='undefined', name='impurity', doc='Criterion used for information gain calculation (case-insensitive). Supported options: entropy, gini')¶ param for Criterion used for information gain calculation (case-insensitive).
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name')¶
-
maxBins
= Param(parent='undefined', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.')¶
-
maxDepth
= Param(parent='undefined', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.')¶
-
maxMemoryInMB
= Param(parent='undefined', name='maxMemoryInMB', doc='Maximum memory in MB allocated to histogram aggregation.')¶
-
minInfoGain
= Param(parent='undefined', name='minInfoGain', doc='Minimum information gain for a split to be considered at a tree node.')¶
-
minInstancesPerNode
= Param(parent='undefined', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.')¶
-
numTrees
= Param(parent='undefined', name='numTrees', doc='Number of trees to train (>= 1)')¶ param for Number of trees to train (>= 1)
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name')¶
-
probabilityCol
= Param(parent='undefined', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.')¶
-
rawPredictionCol
= Param(parent='undefined', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name')¶
-
seed
= Param(parent='undefined', name='seed', doc='random seed')¶
-
setCacheNodeIds
(value)¶ Sets the value of
cacheNodeIds
.
-
setCheckpointInterval
(value)¶ Sets the value of
checkpointInterval
.
-
setFeatureSubsetStrategy
(value)[source]¶ Sets the value of
featureSubsetStrategy
.
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setMaxMemoryInMB
(value)¶ Sets the value of
maxMemoryInMB
.
-
setMinInfoGain
(value)¶ Sets the value of
minInfoGain
.
-
setMinInstancesPerNode
(value)¶ Sets the value of
minInstancesPerNode
.
-
setParams
(self, featuresCol="features", labelCol="label", predictionCol="prediction", probabilityCol="probability", rawPredictionCol="rawPrediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, seed=None, impurity="gini", numTrees=20, featureSubsetStrategy="auto")[source]¶ Sets params for linear classification.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
setProbabilityCol
(value)¶ Sets the value of
probabilityCol
.
-
setRawPredictionCol
(value)¶ Sets the value of
rawPredictionCol
.
-
setSubsamplingRate
(value)[source]¶ Sets the value of
subsamplingRate
.
-
subsamplingRate
= Param(parent='undefined', name='subsamplingRate', doc='Fraction of the training data used for learning each decision tree, in range (0, 1].')¶ param for Fraction of the training data used for learning each decision tree,
-
-
class
pyspark.ml.classification.
RandomForestClassificationModel
(java_model)[source]¶ Model fitted by RandomForestClassifier.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
treeWeights
¶ Return the weights for each tree
-
-
class
pyspark.ml.classification.
NaiveBayes
(*args, **kwargs)[source]¶ Naive Bayes Classifiers. It supports both Multinomial and Bernoulli NB. Multinomial NB (http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html) can handle finitely supported discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification. By making every vector a binary (0/1) data, it can also be used as Bernoulli NB (http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html). The input feature values must be nonnegative.
>>> from pyspark.sql import Row >>> from pyspark.mllib.linalg import Vectors >>> df = sqlContext.createDataFrame([ ... Row(label=0.0, features=Vectors.dense([0.0, 0.0])), ... Row(label=0.0, features=Vectors.dense([0.0, 1.0])), ... Row(label=1.0, features=Vectors.dense([1.0, 0.0]))]) >>> nb = NaiveBayes(smoothing=1.0, modelType="multinomial") >>> model = nb.fit(df) >>> model.pi DenseVector([-0.51..., -0.91...]) >>> model.theta DenseMatrix(2, 2, [-1.09..., -0.40..., -0.40..., -1.09...], 1) >>> test0 = sc.parallelize([Row(features=Vectors.dense([1.0, 0.0]))]).toDF() >>> result = model.transform(test0).head() >>> result.prediction 1.0 >>> result.probability DenseVector([0.42..., 0.57...]) >>> result.rawPrediction DenseVector([-1.60..., -1.32...]) >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF() >>> model.transform(test1).head().prediction 1.0
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
getProbabilityCol
()¶ Gets the value of probabilityCol or its default value.
-
getRawPredictionCol
()¶ Gets the value of rawPredictionCol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name')¶
-
modelType
= Param(parent='undefined', name='modelType', doc='The model type which is a string (case-sensitive). Supported options: multinomial (default) and bernoulli.')¶ param for the model type.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name')¶
-
probabilityCol
= Param(parent='undefined', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.')¶
-
rawPredictionCol
= Param(parent='undefined', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name')¶
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setParams
(self, featuresCol="features", labelCol="label", predictionCol="prediction", probabilityCol="probability", rawPredictionCol="rawPrediction", smoothing=1.0, modelType="multinomial")[source]¶ Sets params for Naive Bayes.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
setProbabilityCol
(value)¶ Sets the value of
probabilityCol
.
-
setRawPredictionCol
(value)¶ Sets the value of
rawPredictionCol
.
-
smoothing
= Param(parent='undefined', name='smoothing', doc='The smoothing parameter, should be >= 0, default is 1.0')¶ param for the smoothing parameter.
-
-
class
pyspark.ml.classification.
NaiveBayesModel
(java_model)[source]¶ Model fitted by NaiveBayes.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
pi
¶ log of class priors.
-
theta
¶ log of class conditional probabilities.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
pyspark.ml.clustering module¶
-
class
pyspark.ml.clustering.
KMeans
(*args, **kwargs)[source]¶ K-means clustering with support for multiple parallel runs and a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al). When multiple concurrent runs are requested, they are executed together with joint passes over the data for efficiency.
>>> from pyspark.mllib.linalg import Vectors >>> data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),), ... (Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)] >>> df = sqlContext.createDataFrame(data, ["features"]) >>> kmeans = KMeans(k=2, seed=1) >>> model = kmeans.fit(df) >>> centers = model.clusterCenters() >>> len(centers) 2 >>> transformed = model.transform(df).select("features", "prediction") >>> rows = transformed.collect() >>> rows[0].prediction == rows[1].prediction True >>> rows[2].prediction == rows[3].prediction True
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getMaxIter
()¶ Gets the value of maxIter or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
getSeed
()¶ Gets the value of seed or its default value.
-
getTol
()¶ Gets the value of tol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
initMode
= Param(parent='undefined', name='initMode', doc='the initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++')¶
-
initSteps
= Param(parent='undefined', name='initSteps', doc='steps for k-means initialization mode')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
k
= Param(parent='undefined', name='k', doc='number of clusters to create')¶
-
maxIter
= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0)')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name')¶
-
seed
= Param(parent='undefined', name='seed', doc='random seed')¶
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setInitMode
(value)[source]¶ Sets the value of
initMode
.>>> algo = KMeans() >>> algo.getInitMode() 'k-means||' >>> algo = algo.setInitMode("random") >>> algo.getInitMode() 'random'
-
setInitSteps
(value)[source]¶ Sets the value of
initSteps
.>>> algo = KMeans().setInitSteps(10) >>> algo.getInitSteps() 10
-
setParams
(self, featuresCol="features", predictionCol="prediction", k=2, initMode="k-means||", initSteps=5, tol=1e-4, maxIter=20, seed=None)[source]¶ Sets params for KMeans.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
tol
= Param(parent='undefined', name='tol', doc='the convergence tolerance for iterative algorithms')¶
-
-
class
pyspark.ml.clustering.
KMeansModel
(java_model)[source]¶ Model fitted by KMeans.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
pyspark.ml.recommendation module¶
-
class
pyspark.ml.recommendation.
ALS
(*args, **kwargs)[source]¶ Alternating Least Squares (ALS) matrix factorization.
ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called ‘factor’ matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix.
This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as “users” and “products”) into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user’s feature vector. This is achieved by pre-computing some information about the ratings matrix to determine the “out-links” of each user (which blocks of products it will contribute to) and “in-link” information for each product (which of the feature vectors it receives from each user block it will depend on). This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users’ ratings and update the products based on these messages.
For implicit preference data, the algorithm used is based on “Collaborative Filtering for Implicit Feedback Datasets”, available at http://dx.doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here.
Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r > 0 and 0 if r <= 0. The ratings then act as ‘confidence’ values related to strength of indicated user preferences rather than explicit ratings given to items.
>>> df = sqlContext.createDataFrame( ... [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)], ... ["user", "item", "rating"]) >>> als = ALS(rank=10, maxIter=5) >>> model = als.fit(df) >>> model.rank 10 >>> model.userFactors.orderBy("id").collect() [Row(id=0, features=[...]), Row(id=1, ...), Row(id=2, ...)] >>> test = sqlContext.createDataFrame([(0, 2), (1, 0), (2, 0)], ["user", "item"]) >>> predictions = sorted(model.transform(test).collect(), key=lambda r: r[0]) >>> predictions[0] Row(user=0, item=2, prediction=0.39...) >>> predictions[1] Row(user=1, item=0, prediction=3.19...) >>> predictions[2] Row(user=2, item=0, prediction=-1.15...)
-
alpha
= Param(parent='undefined', name='alpha', doc='alpha for implicit preference')¶
-
checkpointInterval
= Param(parent='undefined', name='checkpointInterval', doc='checkpoint interval (>= 1)')¶
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getCheckpointInterval
()¶ Gets the value of checkpointInterval or its default value.
-
getMaxIter
()¶ Gets the value of maxIter or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
getRegParam
()¶ Gets the value of regParam or its default value.
-
getSeed
()¶ Gets the value of seed or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
implicitPrefs
= Param(parent='undefined', name='implicitPrefs', doc='whether to use implicit preference')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
itemCol
= Param(parent='undefined', name='itemCol', doc='column name for item ids')¶
-
maxIter
= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0)')¶
-
nonnegative
= Param(parent='undefined', name='nonnegative', doc='whether to use nonnegative constraint for least squares')¶
-
numItemBlocks
= Param(parent='undefined', name='numItemBlocks', doc='number of item blocks')¶
-
numUserBlocks
= Param(parent='undefined', name='numUserBlocks', doc='number of user blocks')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name')¶
-
rank
= Param(parent='undefined', name='rank', doc='rank of the factorization')¶
-
ratingCol
= Param(parent='undefined', name='ratingCol', doc='column name for ratings')¶
-
regParam
= Param(parent='undefined', name='regParam', doc='regularization parameter (>= 0)')¶
-
seed
= Param(parent='undefined', name='seed', doc='random seed')¶
-
setCheckpointInterval
(value)¶ Sets the value of
checkpointInterval
.
-
setImplicitPrefs
(value)[source]¶ Sets the value of
implicitPrefs
.
-
setNonnegative
(value)[source]¶ Sets the value of
nonnegative
.
-
setNumBlocks
(value)[source]¶ Sets both
numUserBlocks
andnumItemBlocks
to the specific value.
-
setNumItemBlocks
(value)[source]¶ Sets the value of
numItemBlocks
.
-
setNumUserBlocks
(value)[source]¶ Sets the value of
numUserBlocks
.
-
setParams
(self, rank=10, maxIter=10, regParam=0.1, numUserBlocks=10, numItemBlocks=10, implicitPrefs=False, alpha=1.0, userCol="user", itemCol="item", seed=None, ratingCol="rating", nonnegative=False, checkpointInterval=10)[source]¶ Sets params for ALS.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
userCol
= Param(parent='undefined', name='userCol', doc='column name for user ids')¶
-
-
class
pyspark.ml.recommendation.
ALSModel
(java_model)[source]¶ Model fitted by ALS.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
itemFactors
¶ a DataFrame that stores item factors in two columns: id and features
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
rank
¶ rank of the matrix factorization model
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
userFactors
¶ a DataFrame that stores user factors in two columns: id and features
-
pyspark.ml.regression module¶
-
class
pyspark.ml.regression.
DecisionTreeRegressor
(*args, **kwargs)[source]¶ http://en.wikipedia.org/wiki/Decision_tree_learning Decision tree learning algorithm for regression. It supports both continuous and categorical features.
>>> from pyspark.mllib.linalg import Vectors >>> df = sqlContext.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> dt = DecisionTreeRegressor(maxDepth=2) >>> model = dt.fit(df) >>> model.depth 1 >>> model.numNodes 3 >>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"]) >>> model.transform(test0).head().prediction 0.0 >>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"]) >>> model.transform(test1).head().prediction 1.0
-
cacheNodeIds
= Param(parent='undefined', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees.')¶
-
checkpointInterval
= Param(parent='undefined', name='checkpointInterval', doc='checkpoint interval (>= 1)')¶
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getCacheNodeIds
()¶ Gets the value of cacheNodeIds or its default value.
-
getCheckpointInterval
()¶ Gets the value of checkpointInterval or its default value.
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getMaxBins
()¶ Gets the value of maxBins or its default value.
-
getMaxDepth
()¶ Gets the value of maxDepth or its default value.
-
getMaxMemoryInMB
()¶ Gets the value of maxMemoryInMB or its default value.
-
getMinInfoGain
()¶ Gets the value of minInfoGain or its default value.
-
getMinInstancesPerNode
()¶ Gets the value of minInstancesPerNode or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
impurity
= Param(parent='undefined', name='impurity', doc='Criterion used for information gain calculation (case-insensitive). Supported options: variance')¶ param for Criterion used for information gain calculation (case-insensitive).
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name')¶
-
maxBins
= Param(parent='undefined', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.')¶
-
maxDepth
= Param(parent='undefined', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.')¶
-
maxMemoryInMB
= Param(parent='undefined', name='maxMemoryInMB', doc='Maximum memory in MB allocated to histogram aggregation.')¶
-
minInfoGain
= Param(parent='undefined', name='minInfoGain', doc='Minimum information gain for a split to be considered at a tree node.')¶
-
minInstancesPerNode
= Param(parent='undefined', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name')¶
-
setCacheNodeIds
(value)¶ Sets the value of
cacheNodeIds
.
-
setCheckpointInterval
(value)¶ Sets the value of
checkpointInterval
.
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setMaxMemoryInMB
(value)¶ Sets the value of
maxMemoryInMB
.
-
setMinInfoGain
(value)¶ Sets the value of
minInfoGain
.
-
setMinInstancesPerNode
(value)¶ Sets the value of
minInstancesPerNode
.
-
setParams
(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="variance")[source]¶ Sets params for the DecisionTreeRegressor.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
-
class
pyspark.ml.regression.
DecisionTreeRegressionModel
(java_model)[source]¶ Model fitted by DecisionTreeRegressor.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
depth
¶ Return depth of the decision tree.
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
numNodes
¶ Return number of nodes of the decision tree.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.regression.
GBTRegressor
(*args, **kwargs)[source]¶ http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees (GBTs) learning algorithm for regression. It supports both continuous and categorical features.
>>> from numpy import allclose >>> from pyspark.mllib.linalg import Vectors >>> df = sqlContext.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> gbt = GBTRegressor(maxIter=5, maxDepth=2) >>> model = gbt.fit(df) >>> allclose(model.treeWeights, [1.0, 0.1, 0.1, 0.1, 0.1]) True >>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"]) >>> model.transform(test0).head().prediction 0.0 >>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"]) >>> model.transform(test1).head().prediction 1.0
-
cacheNodeIds
= Param(parent='undefined', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees.')¶
-
checkpointInterval
= Param(parent='undefined', name='checkpointInterval', doc='checkpoint interval (>= 1)')¶
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getCacheNodeIds
()¶ Gets the value of cacheNodeIds or its default value.
-
getCheckpointInterval
()¶ Gets the value of checkpointInterval or its default value.
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getMaxBins
()¶ Gets the value of maxBins or its default value.
-
getMaxDepth
()¶ Gets the value of maxDepth or its default value.
-
getMaxIter
()¶ Gets the value of maxIter or its default value.
-
getMaxMemoryInMB
()¶ Gets the value of maxMemoryInMB or its default value.
-
getMinInfoGain
()¶ Gets the value of minInfoGain or its default value.
-
getMinInstancesPerNode
()¶ Gets the value of minInstancesPerNode or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name')¶
-
lossType
= Param(parent='undefined', name='lossType', doc='Loss function which GBT tries to minimize (case-insensitive). Supported options: squared, absolute')¶ param for Loss function which GBT tries to minimize (case-insensitive).
-
maxBins
= Param(parent='undefined', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.')¶
-
maxDepth
= Param(parent='undefined', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.')¶
-
maxIter
= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0)')¶
-
maxMemoryInMB
= Param(parent='undefined', name='maxMemoryInMB', doc='Maximum memory in MB allocated to histogram aggregation.')¶
-
minInfoGain
= Param(parent='undefined', name='minInfoGain', doc='Minimum information gain for a split to be considered at a tree node.')¶
-
minInstancesPerNode
= Param(parent='undefined', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name')¶
-
setCacheNodeIds
(value)¶ Sets the value of
cacheNodeIds
.
-
setCheckpointInterval
(value)¶ Sets the value of
checkpointInterval
.
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setMaxMemoryInMB
(value)¶ Sets the value of
maxMemoryInMB
.
-
setMinInfoGain
(value)¶ Sets the value of
minInfoGain
.
-
setMinInstancesPerNode
(value)¶ Sets the value of
minInstancesPerNode
.
-
setParams
(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, lossType="squared", maxIter=20, stepSize=0.1)[source]¶ Sets params for Gradient Boosted Tree Regression.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
setSubsamplingRate
(value)[source]¶ Sets the value of
subsamplingRate
.
-
stepSize
= Param(parent='undefined', name='stepSize', doc='Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator')¶ Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of
-
subsamplingRate
= Param(parent='undefined', name='subsamplingRate', doc='Fraction of the training data used for learning each decision tree, in range (0, 1].')¶ Fraction of the training data used for learning each decision tree, in range (0, 1].
-
-
class
pyspark.ml.regression.
GBTRegressionModel
(java_model)[source]¶ Model fitted by GBTRegressor.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
treeWeights
¶ Return the weights for each tree
-
-
class
pyspark.ml.regression.
LinearRegression
(*args, **kwargs)[source]¶ Linear regression.
The learning objective is to minimize the squared error, with regularization. The specific squared error loss function used is: L = 1/2n ||A weights - y||^2^
- This support multiple types of regularization:
- none (a.k.a. ordinary least squares)
- L2 (ridge regression)
- L1 (Lasso)
- L2 + L1 (elastic net)
>>> from pyspark.mllib.linalg import Vectors >>> df = sqlContext.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> lr = LinearRegression(maxIter=5, regParam=0.0) >>> model = lr.fit(df) >>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"]) >>> model.transform(test0).head().prediction -1.0 >>> model.weights DenseVector([1.0]) >>> model.intercept 0.0 >>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"]) >>> model.transform(test1).head().prediction 1.0 >>> lr.setParams("vector") Traceback (most recent call last): ... TypeError: Method setParams forces keyword arguments.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
elasticNetParam
= Param(parent='undefined', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.')¶ param for the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getMaxIter
()¶ Gets the value of maxIter or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
getRegParam
()¶ Gets the value of regParam or its default value.
-
getTol
()¶ Gets the value of tol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name')¶
-
maxIter
= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0)')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name')¶
-
regParam
= Param(parent='undefined', name='regParam', doc='regularization parameter (>= 0)')¶
-
setElasticNetParam
(value)[source]¶ Sets the value of
elasticNetParam
.
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setParams
(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, regParam=0.0, elasticNetParam=0.0, tol=1e-6)[source]¶ Sets params for linear regression.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
tol
= Param(parent='undefined', name='tol', doc='the convergence tolerance for iterative algorithms')¶
-
class
pyspark.ml.regression.
LinearRegressionModel
(java_model)[source]¶ Model fitted by LinearRegression.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
intercept
¶ Model intercept.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
weights
¶ Model weights.
-
-
class
pyspark.ml.regression.
RandomForestRegressor
(*args, **kwargs)[source]¶ http://en.wikipedia.org/wiki/Random_forest Random Forest learning algorithm for regression. It supports both continuous and categorical features.
>>> from numpy import allclose >>> from pyspark.mllib.linalg import Vectors >>> df = sqlContext.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42) >>> model = rf.fit(df) >>> allclose(model.treeWeights, [1.0, 1.0]) True >>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"]) >>> model.transform(test0).head().prediction 0.0 >>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"]) >>> model.transform(test1).head().prediction 0.5
-
cacheNodeIds
= Param(parent='undefined', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees.')¶
-
checkpointInterval
= Param(parent='undefined', name='checkpointInterval', doc='checkpoint interval (>= 1)')¶
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
featureSubsetStrategy
= Param(parent='undefined', name='featureSubsetStrategy', doc='The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2')¶ param for The number of features to consider for splits at each tree node
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getCacheNodeIds
()¶ Gets the value of cacheNodeIds or its default value.
-
getCheckpointInterval
()¶ Gets the value of checkpointInterval or its default value.
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getMaxBins
()¶ Gets the value of maxBins or its default value.
-
getMaxDepth
()¶ Gets the value of maxDepth or its default value.
-
getMaxMemoryInMB
()¶ Gets the value of maxMemoryInMB or its default value.
-
getMinInfoGain
()¶ Gets the value of minInfoGain or its default value.
-
getMinInstancesPerNode
()¶ Gets the value of minInstancesPerNode or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
getSeed
()¶ Gets the value of seed or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
impurity
= Param(parent='undefined', name='impurity', doc='Criterion used for information gain calculation (case-insensitive). Supported options: variance')¶ param for Criterion used for information gain calculation (case-insensitive).
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name')¶
-
maxBins
= Param(parent='undefined', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.')¶
-
maxDepth
= Param(parent='undefined', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.')¶
-
maxMemoryInMB
= Param(parent='undefined', name='maxMemoryInMB', doc='Maximum memory in MB allocated to histogram aggregation.')¶
-
minInfoGain
= Param(parent='undefined', name='minInfoGain', doc='Minimum information gain for a split to be considered at a tree node.')¶
-
minInstancesPerNode
= Param(parent='undefined', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.')¶
-
numTrees
= Param(parent='undefined', name='numTrees', doc='Number of trees to train (>= 1)')¶ param for Number of trees to train (>= 1)
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name')¶
-
seed
= Param(parent='undefined', name='seed', doc='random seed')¶
-
setCacheNodeIds
(value)¶ Sets the value of
cacheNodeIds
.
-
setCheckpointInterval
(value)¶ Sets the value of
checkpointInterval
.
-
setFeatureSubsetStrategy
(value)[source]¶ Sets the value of
featureSubsetStrategy
.
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setMaxMemoryInMB
(value)¶ Sets the value of
maxMemoryInMB
.
-
setMinInfoGain
(value)¶ Sets the value of
minInfoGain
.
-
setMinInstancesPerNode
(value)¶ Sets the value of
minInstancesPerNode
.
-
setParams
(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, seed=None, impurity="variance", numTrees=20, featureSubsetStrategy="auto")[source]¶ Sets params for linear regression.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
setSubsamplingRate
(value)[source]¶ Sets the value of
subsamplingRate
.
-
subsamplingRate
= Param(parent='undefined', name='subsamplingRate', doc='Fraction of the training data used for learning each decision tree, in range (0, 1].')¶ param for Fraction of the training data used for learning each decision tree,
-
-
class
pyspark.ml.regression.
RandomForestRegressionModel
(java_model)[source]¶ Model fitted by RandomForestRegressor.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java model with extra params. So both the Python wrapper and the Java model get copied. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
treeWeights
¶ Return the weights for each tree
-
pyspark.ml.tuning module¶
-
class
pyspark.ml.tuning.
ParamGridBuilder
[source]¶ Builder for a param grid used in grid search-based model selection.
>>> from pyspark.ml.classification import LogisticRegression >>> lr = LogisticRegression() >>> output = ParamGridBuilder() \ ... .baseOn({lr.labelCol: 'l'}) \ ... .baseOn([lr.predictionCol, 'p']) \ ... .addGrid(lr.regParam, [1.0, 2.0]) \ ... .addGrid(lr.maxIter, [1, 5]) \ ... .build() >>> expected = [ ... {lr.regParam: 1.0, lr.maxIter: 1, lr.labelCol: 'l', lr.predictionCol: 'p'}, ... {lr.regParam: 2.0, lr.maxIter: 1, lr.labelCol: 'l', lr.predictionCol: 'p'}, ... {lr.regParam: 1.0, lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'}, ... {lr.regParam: 2.0, lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'}] >>> len(output) == len(expected) True >>> all([m in expected for m in output]) True
-
class
pyspark.ml.tuning.
CrossValidator
(*args, **kwargs)[source]¶ K-fold cross validation.
>>> from pyspark.ml.classification import LogisticRegression >>> from pyspark.ml.evaluation import BinaryClassificationEvaluator >>> from pyspark.mllib.linalg import Vectors >>> dataset = sqlContext.createDataFrame( ... [(Vectors.dense([0.0]), 0.0), ... (Vectors.dense([0.4]), 1.0), ... (Vectors.dense([0.5]), 0.0), ... (Vectors.dense([0.6]), 1.0), ... (Vectors.dense([1.0]), 1.0)] * 10, ... ["features", "label"]) >>> lr = LogisticRegression() >>> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() >>> evaluator = BinaryClassificationEvaluator() >>> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) >>> cvModel = cv.fit(dataset) >>> evaluator.evaluate(cvModel.transform(dataset)) 0.8333...
-
estimator
= Param(parent='undefined', name='estimator', doc='estimator to be cross-validated')¶ param for estimator to be cross-validated
-
estimatorParamMaps
= Param(parent='undefined', name='estimatorParamMaps', doc='estimator param maps')¶ param for estimator param maps
-
evaluator
= Param(parent='undefined', name='evaluator', doc='evaluator used to select hyper-parameters that maximize the cross-validated metric')¶ param for the evaluator used to select hyper-parameters that maximize the cross-validated metric
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
- dataset – input dataset, which is an instance of
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
numFolds
= Param(parent='undefined', name='numFolds', doc='number of folds for cross validation')¶ param for number of folds for cross validation
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
setEstimatorParamMaps
(value)[source]¶ Sets the value of
estimatorParamMaps
.
-
-
class
pyspark.ml.tuning.
CrossValidatorModel
(bestModel)[source]¶ Model from k-fold cross validation.
-
bestModel
= None¶ best model from cross validation
-
copy
(extra=None)[source]¶ Creates a copy of this instance with a randomly generated uid and some extra params. This copies the underlying bestModel, creates a deep copy of the embedded paramMap, and copies the embedded and extra parameters over. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
- dataset – input dataset, which is an instance of
-
pyspark.ml.evaluation module¶
-
class
pyspark.ml.evaluation.
Evaluator
[source]¶ Base class for evaluators that compute metrics from predictions.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
evaluate
(dataset, params=None)[source]¶ Evaluates the output with optional parameters.
Parameters: - dataset – a dataset that contains labels/observations and predictions
- params – an optional param map that overrides embedded params
Returns: metric
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isLargerBetter
()[source]¶ Indicates whether the metric returned by
evaluate()
should be maximized (True, default) or minimized (False). A given evaluator may support multiple metrics which may be maximized or minimized.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
-
class
pyspark.ml.evaluation.
BinaryClassificationEvaluator
(*args, **kwargs)[source]¶ Evaluator for binary classification, which expects two input columns: rawPrediction and label.
>>> from pyspark.mllib.linalg import Vectors >>> scoreAndLabels = map(lambda x: (Vectors.dense([1.0 - x[0], x[0]]), x[1]), ... [(0.1, 0.0), (0.1, 1.0), (0.4, 0.0), (0.6, 0.0), (0.6, 1.0), (0.6, 1.0), (0.8, 1.0)]) >>> dataset = sqlContext.createDataFrame(scoreAndLabels, ["raw", "label"]) ... >>> evaluator = BinaryClassificationEvaluator(rawPredictionCol="raw") >>> evaluator.evaluate(dataset) 0.70... >>> evaluator.evaluate(dataset, {evaluator.metricName: "areaUnderPR"}) 0.83...
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
evaluate
(dataset, params=None)¶ Evaluates the output with optional parameters.
Parameters: - dataset – a dataset that contains labels/observations and predictions
- params – an optional param map that overrides embedded params
Returns: metric
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getRawPredictionCol
()¶ Gets the value of rawPredictionCol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isLargerBetter
()¶ Indicates whether the metric returned by
evaluate()
should be maximized (True, default) or minimized (False). A given evaluator may support multiple metrics which may be maximized or minimized.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name')¶
-
metricName
= Param(parent='undefined', name='metricName', doc='metric name in evaluation (areaUnderROC|areaUnderPR)')¶ param for metric name in evaluation (areaUnderROC|areaUnderPR)
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
rawPredictionCol
= Param(parent='undefined', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name')¶
-
setMetricName
(value)[source]¶ Sets the value of
metricName
.
-
setParams
(self, rawPredictionCol="rawPrediction", labelCol="label", metricName="areaUnderROC")[source]¶ Sets params for binary classification evaluator.
-
setRawPredictionCol
(value)¶ Sets the value of
rawPredictionCol
.
-
-
class
pyspark.ml.evaluation.
RegressionEvaluator
(*args, **kwargs)[source]¶ Evaluator for Regression, which expects two input columns: prediction and label.
>>> scoreAndLabels = [(-28.98343821, -27.0), (20.21491975, 21.5), ... (-25.98418959, -22.0), (30.69731842, 33.0), (74.69283752, 71.0)] >>> dataset = sqlContext.createDataFrame(scoreAndLabels, ["raw", "label"]) ... >>> evaluator = RegressionEvaluator(predictionCol="raw") >>> evaluator.evaluate(dataset) 2.842... >>> evaluator.evaluate(dataset, {evaluator.metricName: "r2"}) 0.993... >>> evaluator.evaluate(dataset, {evaluator.metricName: "mae"}) 2.649...
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
evaluate
(dataset, params=None)¶ Evaluates the output with optional parameters.
Parameters: - dataset – a dataset that contains labels/observations and predictions
- params – an optional param map that overrides embedded params
Returns: metric
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isLargerBetter
()¶ Indicates whether the metric returned by
evaluate()
should be maximized (True, default) or minimized (False). A given evaluator may support multiple metrics which may be maximized or minimized.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name')¶
-
metricName
= Param(parent='undefined', name='metricName', doc='metric name in evaluation (mse|rmse|r2|mae)')¶ param for metric name in evaluation (mse|rmse|r2|mae)
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name')¶
-
setMetricName
(value)[source]¶ Sets the value of
metricName
.
-
setParams
(self, predictionCol="prediction", labelCol="label", metricName="rmse")[source]¶ Sets params for regression evaluator.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
-
class
pyspark.ml.evaluation.
MulticlassClassificationEvaluator
(*args, **kwargs)[source]¶ Evaluator for Multiclass Classification, which expects two input columns: prediction and label. >>> scoreAndLabels = [(0.0, 0.0), (0.0, 1.0), (0.0, 0.0), ... (1.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (2.0, 2.0), (2.0, 0.0)] >>> dataset = sqlContext.createDataFrame(scoreAndLabels, [“prediction”, “label”]) ... >>> evaluator = MulticlassClassificationEvaluator(predictionCol=”prediction”) >>> evaluator.evaluate(dataset) 0.66... >>> evaluator.evaluate(dataset, {evaluator.metricName: “precision”}) 0.66... >>> evaluator.evaluate(dataset, {evaluator.metricName: “recall”}) 0.66...
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient. :param extra: Extra parameters to copy to the new instance :return: Copy of this instance
-
evaluate
(dataset, params=None)¶ Evaluates the output with optional parameters.
Parameters: - dataset – a dataset that contains labels/observations and predictions
- params – an optional param map that overrides embedded params
Returns: metric
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. :param extra: extra param values :return: merged param map
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isLargerBetter
()¶ Indicates whether the metric returned by
evaluate()
should be maximized (True, default) or minimized (False). A given evaluator may support multiple metrics which may be maximized or minimized.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name')¶
-
metricName
= Param(parent='undefined', name='metricName', doc='metric name in evaluation (f1|precision|recall|weightedPrecision|weightedRecall)')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name')¶
-
setMetricName
(value)[source]¶ Sets the value of
metricName
.
-
setParams
(self, predictionCol="prediction", labelCol="label", metricName="f1")[source]¶ Sets params for multiclass classification evaluator.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-