Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
1.6.0
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
1.6.0
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
1.6.0
Computes the given aggregation, returning a Dataset of tuples for each unique key and the result of computing this aggregation over all elements in the group.
Computes the given aggregation, returning a Dataset of tuples for each unique key and the result of computing this aggregation over all elements in the group.
1.6.0
Internal helper function for building typed aggregations that return tuples.
Internal helper function for building typed aggregations that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.
(Java-specific) Applies the given function to each cogrouped data.
(Java-specific)
Applies the given function to each cogrouped data. For each unique group, the function will
be passed the grouping key and 2 iterators containing all elements in the group from
Dataset this
and other
. The function can return an iterator containing elements of an
arbitrary type which will be returned as a new Dataset.
1.6.0
(Scala-specific) Applies the given function to each cogrouped data.
(Scala-specific)
Applies the given function to each cogrouped data. For each unique group, the function will
be passed the grouping key and 2 iterators containing all elements in the group from
Dataset this
and other
. The function can return an iterator containing elements of an
arbitrary type which will be returned as a new Dataset.
1.6.0
Returns a Dataset that contains a tuple with each key and the number of items present for that key.
Returns a Dataset that contains a tuple with each key and the number of items present for that key.
1.6.0
(Java-specific) Applies the given function to each group of data.
(Java-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all
the data in the Dataset. If an application intends to perform an aggregation over each
key, it is best to use the reduce function or an
org.apache.spark.sql.expressions#Aggregator
.
Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling toList
) unless they are sure that this is possible given the memory
constraints of their cluster.
1.6.0
(Scala-specific) Applies the given function to each group of data.
(Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all
the data in the Dataset. If an application intends to perform an aggregation over each
key, it is best to use the reduce function or an
org.apache.spark.sql.expressions#Aggregator
.
Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling toList
) unless they are sure that this is possible given the memory
constraints of their cluster.
1.6.0
::Experimental:: (Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
::Experimental::
(Java-specific)
Applies the given function to each group of data, while maintaining a user-defined per-group
state. The result Dataset will represent the objects returned by the function.
For a static batch Dataset, the function will be invoked once per group. For a streaming
Dataset, the function will be invoked for each group repeatedly in every trigger, and
updates to each group's state will be saved across invocations.
See GroupState
for more details.
The type of the user-defined state. Must be encodable to Spark SQL types.
The type of the output objects. Must be encodable to Spark SQL types.
Function to be called on every group.
The output mode of the function.
Encoder for the state type.
Encoder for the output type.
Timeout configuration for groups that do not receive data for a while. See Encoder for more details on what types are encodable to Spark SQL.
2.2.0
::Experimental:: (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
::Experimental::
(Scala-specific)
Applies the given function to each group of data, while maintaining a user-defined per-group
state. The result Dataset will represent the objects returned by the function.
For a static batch Dataset, the function will be invoked once per group. For a streaming
Dataset, the function will be invoked for each group repeatedly in every trigger, and
updates to each group's state will be saved across invocations.
See GroupState
for more details.
The type of the user-defined state. Must be encodable to Spark SQL types.
The type of the output objects. Must be encodable to Spark SQL types.
The output mode of the function.
Timeout configuration for groups that do not receive data for a while. See Encoder for more details on what types are encodable to Spark SQL.
Function to be called on every group.
2.2.0
Returns a new KeyValueGroupedDataset where the type of the key has been mapped to the specified type.
Returns a new KeyValueGroupedDataset where the type of the key has been mapped to the
specified type. The mapping of key columns to the type follows the same rules as as
on
Dataset.
1.6.0
Returns a Dataset that contains each unique key.
Returns a Dataset that contains each unique key. This is equivalent to doing mapping over the Dataset to extract the keys and then running a distinct operation on those.
1.6.0
(Java-specific) Applies the given function to each group of data.
(Java-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an element of arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all
the data in the Dataset. If an application intends to perform an aggregation over each
key, it is best to use the reduce function or an
org.apache.spark.sql.expressions#Aggregator
.
Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling toList
) unless they are sure that this is possible given the memory
constraints of their cluster.
1.6.0
(Scala-specific) Applies the given function to each group of data.
(Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an element of arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all
the data in the Dataset. If an application intends to perform an aggregation over each
key, it is best to use the reduce function or an
org.apache.spark.sql.expressions#Aggregator
.
Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling toList
) unless they are sure that this is possible given the memory
constraints of their cluster.
1.6.0
::Experimental:: (Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
::Experimental::
(Java-specific)
Applies the given function to each group of data, while maintaining a user-defined per-group
state. The result Dataset will represent the objects returned by the function.
For a static batch Dataset, the function will be invoked once per group. For a streaming
Dataset, the function will be invoked for each group repeatedly in every trigger, and
updates to each group's state will be saved across invocations.
See GroupState
for more details.
The type of the user-defined state. Must be encodable to Spark SQL types.
The type of the output objects. Must be encodable to Spark SQL types.
Function to be called on every group.
Encoder for the state type.
Encoder for the output type.
Timeout configuration for groups that do not receive data for a while. See Encoder for more details on what types are encodable to Spark SQL.
2.2.0
::Experimental:: (Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
::Experimental::
(Java-specific)
Applies the given function to each group of data, while maintaining a user-defined per-group
state. The result Dataset will represent the objects returned by the function.
For a static batch Dataset, the function will be invoked once per group. For a streaming
Dataset, the function will be invoked for each group repeatedly in every trigger, and
updates to each group's state will be saved across invocations.
See GroupState
for more details.
The type of the user-defined state. Must be encodable to Spark SQL types.
The type of the output objects. Must be encodable to Spark SQL types.
Function to be called on every group.
Encoder for the state type.
Encoder for the output type. See Encoder for more details on what types are encodable to Spark SQL.
2.2.0
::Experimental:: (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
::Experimental:: (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See org.apache.spark.sql.streaming.GroupState for more details.
The type of the user-defined state. Must be encodable to Spark SQL types.
The type of the output objects. Must be encodable to Spark SQL types.
Timeout configuration for groups that do not receive data for a while. See Encoder for more details on what types are encodable to Spark SQL.
Function to be called on every group.
2.2.0
::Experimental:: (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
::Experimental:: (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See org.apache.spark.sql.streaming.GroupState for more details.
The type of the user-defined state. Must be encodable to Spark SQL types.
The type of the output objects. Must be encodable to Spark SQL types.
Function to be called on every group. See Encoder for more details on what types are encodable to Spark SQL.
2.2.0
Returns a new KeyValueGroupedDataset where the given function func
has been applied
to the data.
Returns a new KeyValueGroupedDataset where the given function func
has been applied
to the data. The grouping key is unchanged by this.
// Create Integer values grouped by String key from a Dataset> Dataset<Tuple2<String, Integer>> ds = ...; KeyValueGroupedDataset<String, Integer> grouped = ds.groupByKey(t -> t._1, Encoders.STRING()).mapValues(t -> t._2, Encoders.INT());
2.1.0
Returns a new KeyValueGroupedDataset where the given function func
has been applied
to the data.
Returns a new KeyValueGroupedDataset where the given function func
has been applied
to the data. The grouping key is unchanged by this.
// Create values grouped by key from a Dataset[(K, V)] ds.groupByKey(_._1).mapValues(_._2) // Scala
2.1.0
(Java-specific) Reduces the elements of each group of data using the specified binary function.
(Java-specific) Reduces the elements of each group of data using the specified binary function. The given function must be commutative and associative or the result may be non-deterministic.
1.6.0
(Scala-specific) Reduces the elements of each group of data using the specified binary function.
(Scala-specific) Reduces the elements of each group of data using the specified binary function. The given function must be commutative and associative or the result may be non-deterministic.
1.6.0
:: Experimental :: A Dataset has been logically grouped by a user specified grouping key. Users should not construct a KeyValueGroupedDataset directly, but should instead call
groupByKey
on an existing Dataset.2.0.0