Prefer implicit column selection to direct access, except for disambiguation
df1['colA']
is just as difficult to write as F.col('colA')
;F.col('colA')
will always reference a column designated colA
in the dataframe being operated on, named df
, in this case. It does not require keeping track of other dataframes’ states at all, so the code becomes more local and less susceptible to “spooky interaction at a distance,” which is often challenging to debug.df.join(df2, on=(df.key == df2.key), how='left')
. In such cases it is fine to reference columns by their dataframe directly. You can also disambiguate joins using dataframe aliases (see more in the Joins section in this guide).Refactor complex logical operations
.filter()
or F.when()
, need to be readable. We apply the same rule as with chaining functions, keeping logic expressions inside the same code block to three (3) expressions at most. If they grow longer, it is often a sign that the code can be simplified or extracted out. Extracting out complex logical operations into variables makes the code easier to read and reason about, which also reduces bugs.F.when
expression is now succinct and readable and the desired behavior is clear to anyone reviewing this code. The reader only needs to visit the individual expressions if they suspect there is an error. It also makes each chunk of logic easy to test if you have unit tests in your code, and want to abstract them as functions.Use `select` statements to specify a schema contract
select
statement specifies the contract with both the reader and the code about the expected dataframe schema for inputs and outputs. Any select should be seen as a cleaning operation that is preparing the dataframe for consumption by the next step in the transform.Keep select statements as simple as possible. Due to common SQL idioms, allow only one function from spark.sql.function
to be used per selected column, plus an optional .alias()
to give it a meaningful name. Keep in mind that this should be used sparingly. If there are more than three such uses in the same select, refactor it into a separate function like clean_<dataframe name>()
to encapsulate the operation.Expressions involving more than one dataframe, or conditional operations like .when()
are discouraged to be used in a select, unless required for performance reasons.select()
statement redefines the schema of a dataframe, so it naturally supports the inclusion or exclusion of columns, old and new, as well as the redefinition of pre-existing ones. By centralising all such operations in a single statement, it becomes much easier to identify the final schema, which makes code more readable. It also makes code more concise.Instead of calling withColumnRenamed()
, use aliases:withColumn()
to redefine type, cast in the select:.drop()
since it guarantees that schema mutations won’t cause unexpected columns to bloat your dataframe. However, dropping columns isn’t inherintly discouraged in all cases; for instance- it is commonly appropriate to drop columns after joins since it is common for joins to introduce redundant columns.Finally, instead of adding new columns via the select statement, using .withColumn()
is recommended instead for single columns. When adding or manipulating tens or hundreds of columns, use a single .select()
for performance reasons.Empty columns
F.lit(None)
for populating that column. Never use an empty string or some other string signalling an empty value (such as NA
).Beyond being semantically correct, one practical reason for using F.lit(None)
is preserving the ability to use utilities like isNull
, instead of having to verify empty strings, nulls, and 'NA'
, etc.Using Comments Effectively
UDFs (user defined functions)
Joins
how
explicitly, even if you are using the default value (inner)
:right
joins. If you are about to use a right
join, switch the order of your dataframes and use a left
join instead. It is more intuitive since the dataframe you are doing the operation on is the one that you are centering your join around..dropDuplicates()
or .distinct()
as a crutch. If unexpected duplicate rows are observed, there’s almost always an underlying reason for why those duplicate rows appear. Adding .dropDuplicates()
only masks this problem and adds overhead to the runtime.Window Functions
Dealing with nulls
F.sum()
and F.max()
), they will generally impact the result of analytic functions (like F.first()
and F.lead()
):ignorenulls
flag:Empty `partitionBy()`
W.partitionBy()
).Code like this should be avoided, however, as it forces Spark to combine all data into a single partition, which can be extremely harmful for performance.Prefer to use aggregations whenever possible:Chaining of expressions
df = df...
.cmd + shift + up
in pycharm)Multi-line expressions
\
: