Recently I found a good way that has a really good performance (both memory and speed):

1) .orderBy your DF by some id, dates, etc.;

2) .collect() your DF - result is the list of Row-objects (Row-object is a tuple); now you can iterate over the list, compare sequential rows, calculate values, etc. and decide if new rows have to be added;

3) if a missing record has to be created, take the respective row-tuple, transfer and copy it to dictionary-object .asDict().copy() this dictionary into a new dictionary and now change any value within the new dictionary (this may include complex functions, and Pandas isn't required);

4) .extend([new_row_dict]) (.extend is faster than .append) to a small list that is empty in the beginning,

5) once the list size is big enough (say 1GB), transfer your list of dictionaries into a small spark-DF (dictionaries already keep the structure of the sparkDF) by small_df=sc.sqlContext.createDataFrame(small_list, schema=schemaDF) and

6) add this small DF to the original DF big_DF=big_DF.union(small_df); create new empty small_list and return to step 3)

H, I have a question about what would be done in scala or pyspark a reading of a verticalized file, that is, the records instead of reading from left to right, should be read in vertical blocks. This can be an example in which we have 5 rows, the first one would be the header that indicates how many rows make up the record.