pyspark.sql.functions.max_by#
- pyspark.sql.functions.max_by(col, ord, k=None)[source]#
Returns the value(s) from the col parameter that are associated with the maximum value(s) from the ord parameter. This function is often used to find the col parameter value corresponding to the maximum ord parameter value within each group when used with groupBy().
When k is specified, returns an array of up to k values associated with the top k maximum values from ord.
New in version 3.3.0.
Changed in version 3.4.0: Supports Spark Connect.
Changed in version 4.2.0: Added optional k parameter to return top-k values.
- Parameters
- col
Columnor column name The column representing the values to be returned. This could be the column instance or the column name as string.
- ord
Columnor column name The column that needs to be maximized. This could be the column instance or the column name as string.
- kint, optional
If specified, returns an array of up to k values associated with the top k maximum ordering values, sorted in descending order by the ordering column. Must be a positive integer literal <= 100000.
- col
- Returns
ColumnA column object representing the value from col that is associated with the maximum value from ord. If k is specified, returns an array of values.
Notes
The function is non-deterministic so the output order can be different for those associated the same values of col.
Examples
Example 1: Using max_by with groupBy
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([ ... ("Java", 2012, 20000), ("dotNET", 2012, 5000), ... ("dotNET", 2013, 48000), ("Java", 2013, 30000)], ... schema=("course", "year", "earnings")) >>> df.groupby("course").agg(sf.max_by("year", "earnings")).sort("course").show() +------+----------------------+ |course|max_by(year, earnings)| +------+----------------------+ | Java| 2013| |dotNET| 2013| +------+----------------------+
Example 2: Using max_by with different data types
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([ ... ("Marketing", "Anna", 4), ("IT", "Bob", 2), ... ("IT", "Charlie", 3), ("Marketing", "David", 1)], ... schema=("department", "name", "years_in_dept")) >>> df.groupby("department").agg( ... sf.max_by("name", "years_in_dept") ... ).sort("department").show() +----------+---------------------------+ |department|max_by(name, years_in_dept)| +----------+---------------------------+ | IT| Charlie| | Marketing| Anna| +----------+---------------------------+
Example 3: Using max_by where ord has multiple maximum values
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([ ... ("Consult", "Eva", 6), ("Finance", "Frank", 5), ... ("Finance", "George", 9), ("Consult", "Henry", 7)], ... schema=("department", "name", "years_in_dept")) >>> df.groupby("department").agg( ... sf.max_by("name", "years_in_dept") ... ).sort("department").show() +----------+---------------------------+ |department|max_by(name, years_in_dept)| +----------+---------------------------+ | Consult| Henry| | Finance| George| +----------+---------------------------+
Example 4: Using max_by with k to get top-k values
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([ ... ("a", 10), ("b", 50), ("c", 20), ("d", 40)], ... schema=("x", "y")) >>> df.select(sf.max_by("x", "y", 2)).show() +---------------+ |max_by(x, y, 2)| +---------------+ | [b, d]| +---------------+