Pyspark: Concat Function Generated Columns Into New Dataframe
I have a pyspark dataframe (df) with n cols, I would like to generate another df of n cols, where each column records the percentage difference b/w consecutive rows in the correspo
Solution 1:
In this case, you can do a list comprehension inside of a call to select.
To make the code a little more compact, we can first get the columns we want to diff in a list:
diff_columns = [c for c in df.columns if c != 'index']
Next select the index and iterate over diff_columns to compute the new column. Use .alias() to rename the resulting column:
df_diff = df.select(
    'index',
    *[(func.log(func.col(c)) - func.log(func.lag(func.col(c)).over(w))).alias(c + "_diff")
      for c in diff_columns]
)
df_diff.show()
#+-----+------------------+-------------------+-------------------+#|index|         col1_diff|          col2_diff|          col3_diff|#+-----+------------------+-------------------+-------------------+#|    1|              null|               null|               null|#|    2| 0.693147180559945| 0.6931471805599454| 0.6931471805599454|#|    3|0.4054651081081646|0.40546510810816416|0.40546510810816416|#+-----+------------------+-------------------+-------------------+
Post a Comment for "Pyspark: Concat Function Generated Columns Into New Dataframe"