Memory efficient Django Queryset Iterator

While checking up on some cronjobs at YouTellMe we had some problems with large cronjobs that took way too much memory. Since Django normally loads all objects into it's memory when iterating over a queryset (even with .iterator, although in that case it's not Django holding it in it's memory, but your database client) I needed a solution that chunks the querysets so they're only keeping a small subset in memory.

Example on how to use it:
my_queryset = queryset_iterator(MyItem.objects.all())
for item in my_queryset:
item.do_something()

importgcdefqueryset_iterator(queryset,chunksize=1000):''''' Iterate over a Django Queryset ordered by the primary key This method loads a maximum of chunksize (default: 1000) rows in it's memory at the same time while django normally would load all rows in it's memory. Using the iterator() method only causes it to not preload all the classes. Note that the implementation of the iterator does not support ordered query sets. '''pk=0last_pk=queryset.order_by('-pk')[0].pkqueryset=queryset.order_by('pk')whilepk<last_pk:forrowinqueryset.filter(pk__gt=pk)[:chunksize]:pk=row.pkyieldrowgc.collect()

More like this

Comments

Django does not load all rows in memory, but it caches the result while iterating over the result. At the end you have
everything in memory (if you don't use .iterator()). For most cases this is no problem.

I had memory problems
when looping over huge querysets. I solved them with this:

Check connection.queries is empty. settings.DEBUG==True will store all queries there. (Or replace the list with a dummy object, which does not store anything):