pyspark.sql.functions.min_by#

pyspark.sql.functions.min_by(col, ord, k=None)[source]#

Returns the value(s) from the col parameter that are associated with the minimum value(s) from the ord parameter. This function is often used to find the col parameter value corresponding to the minimum ord parameter value within each group when used with groupBy().

When k is specified, returns an array of up to k values associated with the bottom k minimum values from ord.

New in version 3.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Changed in version 4.2.0: Added optional k parameter to return bottom-k values.

Parameters

colColumn or column name: The column representing the values that will be returned. This could be the column instance or the column name as string.
ordColumn or column name: The column that needs to be minimized. This could be the column instance or the column name as string.
kint, optional: If specified, returns an array of up to k values associated with the bottom k minimum ordering values, sorted in ascending order by the ordering column. Must be a positive integer literal <= 100000.

Returns

Column: Column object that represents the value from col associated with the minimum value from ord. If k is specified, returns an array of values.

Notes

The function is non-deterministic so the output order can be different for those associated the same values of col.

Examples

Example 1: Using min_by with groupBy:

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([
...     ("Java", 2012, 20000), ("dotNET", 2012, 5000),
...     ("dotNET", 2013, 48000), ("Java", 2013, 30000)],
...     schema=("course", "year", "earnings"))
>>> df.groupby("course").agg(sf.min_by("year", "earnings")).sort("course").show()
+------+----------------------+
|course|min_by(year, earnings)|
+------+----------------------+
|  Java|                  2012|
|dotNET|                  2012|
+------+----------------------+

Example 2: Using min_by with different data types:

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([
...     ("Marketing", "Anna", 4), ("IT", "Bob", 2),
...     ("IT", "Charlie", 3), ("Marketing", "David", 1)],
...     schema=("department", "name", "years_in_dept"))
>>> df.groupby("department").agg(
...     sf.min_by("name", "years_in_dept")
... ).sort("department").show()
+----------+---------------------------+
|department|min_by(name, years_in_dept)|
+----------+---------------------------+
|        IT|                        Bob|
| Marketing|                      David|
+----------+---------------------------+

Example 3: Using min_by where ord has multiple minimum values:

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([
...     ("Consult", "Eva", 6), ("Finance", "Frank", 5),
...     ("Finance", "George", 9), ("Consult", "Henry", 7)],
...     schema=("department", "name", "years_in_dept"))
>>> df.groupby("department").agg(
...     sf.min_by("name", "years_in_dept")
... ).sort("department").show()
+----------+---------------------------+
|department|min_by(name, years_in_dept)|
+----------+---------------------------+
|   Consult|                        Eva|
|   Finance|                      Frank|
+----------+---------------------------+

Example 4: Using min_by with k to get bottom-k values

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([
...     ("a", 10), ("b", 50), ("c", 20), ("d", 40)],
...     schema=("x", "y"))
>>> df.select(sf.min_by("x", "y", 2)).show()
+---------------+
|min_by(x, y, 2)|
+---------------+
|         [a, c]|
+---------------+