pyspark.sql.functions.tuple_sketch_agg_integer#

pyspark.sql.functions.tuple_sketch_agg_integer(key, summary, lgNomEntries=None, mode=None)[source]#

Aggregate function: returns the compact binary representation of the Datasketches TupleSketch with integer summaries built from the key and summary columns.

New in version 4.2.0.

Parameters
keyColumn or column name

The column containing key values

summaryColumn or column name

The column containing integer summary values

lgNomEntriesColumn or int, optional

The log-base-2 of nominal entries (must be between 4 and 26, defaults to 12)

modeColumn or str, optional

The summary mode: “sum” (default), “min”, “max”, or “alwaysone”

Returns
Column

The binary representation of the TupleSketch.

Examples

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(1, 10), (2, 20), (2, 30)], ["key", "value"])
>>> df.agg(sf.tuple_sketch_estimate_integer(
...     sf.tuple_sketch_agg_integer("key", "value"))).show()
+----------------------------------------------------------------------------+
|tuple_sketch_estimate_integer(tuple_sketch_agg_integer(key, value, 12, sum))|
+----------------------------------------------------------------------------+
|                                                                         2.0|
+----------------------------------------------------------------------------+