pyspark.sql.DataFrame.scalar#

DataFrame.scalar()[source]#

Return a Column object for a SCALAR Subquery containing exactly one row and one column.

The scalar() method is useful for extracting a Column object that represents a scalar value from a DataFrame, especially when the DataFrame results from an aggregation or single-value computation. This returned Column can then be used directly in select clauses or as predicates in filters on the outer DataFrame, enabling dynamic data filtering and calculations based on scalar values.

New in version 4.0.0.

Returns

Column: A Column object representing a SCALAR subquery.

Examples

Setup a sample DataFrame.

>>> data = [
...     (1, "Alice", 45000, 101), (2, "Bob", 54000, 101), (3, "Charlie", 29000, 102),
...     (4, "David", 61000, 102), (5, "Eve", 48000, 101),
... ]
>>> employees = spark.createDataFrame(data, ["id", "name", "salary", "department_id"])

Example 1 (non-correlated): Filter for employees with salary greater than the average salary.

>>> from pyspark.sql import functions as sf
>>> employees.where(
...     sf.col("salary") > employees.select(sf.avg("salary")).scalar()
... ).select("name", "salary", "department_id").orderBy("name").show()
+-----+------+-------------+
| name|salary|department_id|
+-----+------+-------------+
|  Bob| 54000|          101|
|David| 61000|          102|
|  Eve| 48000|          101|
+-----+------+-------------+

Example 2 (correlated): Filter for employees with salary greater than the average salary in their department.

>>> from pyspark.sql import functions as sf
>>> employees.alias("e1").where(
...     sf.col("salary")
...     > employees.alias("e2").where(
...         sf.col("e2.department_id") == sf.col("e1.department_id").outer()
...     ).select(sf.avg("salary")).scalar()
... ).select("name", "salary", "department_id").orderBy("name").show()
+-----+------+-------------+
| name|salary|department_id|
+-----+------+-------------+
|  Bob| 54000|          101|
|David| 61000|          102|
+-----+------+-------------+

Example 3 (in select): Select the name, salary, and the proportion of the salary in the department.

>>> from pyspark.sql import functions as sf
>>> employees.alias("e1").select(
...     "name", "salary", "department_id",
...     sf.format_number(
...         sf.lit(100) * sf.col("salary") /
...             employees.alias("e2").where(
...                 sf.col("e2.department_id") == sf.col("e1.department_id").outer()
...             ).select(sf.sum("salary")).scalar().alias("avg_salary"),
...         1
...     ).alias("salary_proportion_in_department")
... ).orderBy("name").show()
+-------+------+-------------+-------------------------------+
|   name|salary|department_id|salary_proportion_in_department|
+-------+------+-------------+-------------------------------+
|  Alice| 45000|          101|                           30.6|
|    Bob| 54000|          101|                           36.7|
|Charlie| 29000|          102|                           32.2|
|  David| 61000|          102|                           67.8|
|    Eve| 48000|          101|                           32.7|
+-------+------+-------------+-------------------------------+