I'm using xml.etree.ElementTree to parse large XML file, while the memory keep increasing consistently.
You can run attached test script to reproduce it. From 'top' in Linux or 'Task Manager' in Windows, the memory usage of python is not decreased as expected when 'Done' is printed.
Tested with Python 2.5/3.1 in Windows 7, and Python 2.5 in CentOS 5.3.

> To sum up, python is returning memory, but your libc is not.
> You can force it using malloc_trim, see the attached patch (I'm not at
> all suggesting its inclusion, it's just an illustration).
That's an interesting thing, perhaps you want to open a feature request as a separate issue?

Found a minor defect of Python 3.2 / 3.3: line 1676 of xml/etree/ElementTree.py
was:
del self.target, self._parser # get rid of circular references
should be:
del self.target, self._target, self.parser, self._parser # get rid of circular references
While it doesn't help this issue...

> kaifeng <cafeeee@gmail.com> added the comment:
>
> I added 'malloc_trim' to the test code and rerun the test with Python 2.5 / 3.2 on CentOS 5.3. The problem still exists.
>
Well, malloc_trim can fail, but how did you "add" it ? Did you use
patch to apply the diff ?
Also, could you post the output of a
ltrace -e malloc_trim python <test script>
For info, the sample outputs I posted above come from a RHEL6 box.
Anyway, I'm 99% sure this isn't a leak but a malloc issue (valgrind
--tool=memcheck could confirm this if you want to try, I could be
wrong, it wouldn't be the first time ;-) ).
By the way, look at what I just found:
http://mail.gnome.org/archives/xml/2008-February/msg00003.html
> Antoine Pitrou <pitrou@free.fr> added the comment:
> That's an interesting thing, perhaps you want to open a feature request as a separate issue?
Dunno.
Memory management is a domain which belongs to the operating
system/libc, and I think applications should mess with it (apart from
specific cases) .
I don't have time to look at this precise problem in greater detail
right now, but AFAICT, this looks either like a glibc bug, or at least
a corner case with default malloc parameters (M_TRIM_THRESHOLD and
friends), affecting only RHEL and derived distributions.
malloc_trim should be called automatically by free if the amount of
memory that could be release is above M_TRIM_THRESHOLD.
Calling it systematically can have a non-negligible performance impact.

> BTW, after utilize lxml instead of ElementTree, such phenomenon of increasing memory usage disappeared.
If you looked at the link I posted, you'll see that lxml had some similar issues and solved it by calling malloc_trim systematically when freeing memory.
It could also be heap fragmentation, though.
To go further, it'd be nice if you could provide the output of
valgrind --tool=memcheck --leak-check=full --suppressions=Misc/valgrind-python.supp python <test script>
after uncommenting relevant lines in Misc/valgrind-python.supp (see http://svn.python.org/projects/python/trunk/Misc/README.valgrind ).
It will either confirm a memory leak or malloc issue (I still favour the later).
By the way, does
while True:
XML(gen_xml())
lead to a constant memory usage increase ?

> By the way, I noticed that dictionnaries are never allocated through
> pymalloc, since a new dictionnary takes more than 256B...
On 64-bit builds indeed. pymalloc could be improved to handle allocations up to 512B. Want to try and write a patch?

Sorry for the later update.
Valgrind shows there is no memory leak (see attached valgrind.log).
The following code,
while True:
XML(gen_xml())
has an increasing memory usage in the first 5~8 iterations, and waves around a constant level afterwards.
So I guess there's a component, maybe libc, Python interpreter, ElementTree/pyexpat module or someone else, hold some memory until process ends.

> The MALLOC_MMAP_THRESHOLD improvement is less visible here:
>
Are you running on 64-bit ?
If yes, it could be that you're exhausting M_MMAP_MAX (malloc falls
back to brk when there are too many mmap mappings).
You could try with
MALLOC_MMAP_THRESHOLD_=1024 MALLOC_MMAP_MAX_=16777216 ../opt/python
issue11849_test.py
By the way, never do that in real life, it's a CPU and memory hog ;-)
I think the root cause is that glibc's malloc coalescing of free
chunks is called far less often than in the original ptmalloc version,
but I still have to dig some more.
>> By the way, I noticed that dictionnaries are never allocated through
>> pymalloc, since a new dictionnary takes more than 256B...
>
> On 64-bit builds indeed. pymalloc could be improved to handle allocations up
> to 512B. Want to try and write a patch?
Sure.
I'll open another issue.

> > The MALLOC_MMAP_THRESHOLD improvement is less visible here:
> >
>
> Are you running on 64-bit ?
Yes.
> If yes, it could be that you're exhausting M_MMAP_MAX (malloc falls
> back to brk when there are too many mmap mappings).
> You could try with
> MALLOC_MMAP_THRESHOLD_=1024 MALLOC_MMAP_MAX_=16777216 ../opt/python
> issue11849_test.py
It isn't better.

> It isn't better.
Requests above 256B are directly handled by malloc, so MALLOC_MMAP_THRESHOLD_ should in fact be set to 256 (with 1024 I guess that on 64-bit every mid-sized dictionnary gets allocated with brk).

I've had some time to look at this, and I've written a quick demo
patch that should - hopefully - fix this, and reduce memory
fragmentation.
A little bit of background first:
- a couple years ago (probably true when pymalloc was designed and
merged), glibc's malloc used brk for small and medium allocations, and
mmap for large allocations, to reduce memory fragmentation (also,
because of the processes' VM layout in older Linux 32-bit kernels, you
couldn't have a heap bigger than 1GB). The threshold for routing
requests to mmap was fixed, and had a default of 256KB (exactly the
size of an pymalloc arena). Thus, all arenas were allocated with mmap
- in 2006, a patch was merged to make this mmap threshold dynamic,
see http://sources.redhat.com/ml/libc-alpha/2006-03/msg00033.html for
more details
- as a consequence, with modern glibc/elibc versions, the first
arenas will be allocated through mmap, but as soon as one of them is
freed, subsequent arenas allocation will be allocated from the heap
through brk, and not mmap
- imagine the following happens :
1) program creates many objects
2) to store those objects, many arenas are allocated from the heap
through brk
3) program destroys all the objects created, except 1 which is in
the last allocated arena
4) since the arena has at least one object in it, it's not
deallocated, and thus the heap doesn't shrink, and the memory usage
remains high (with a huge hole between the base of the heap and its
top)
Note that 3) can be a single leaked reference, or just a variable
that doesn't get deallocated immediately. As an example, here's a demo
program that should exhibit this behaviour:
"""
import sys
import gc
# allocate/de-allocate/re-allocate the array to make sure that arenas are
# allocated through brk
tab = []
for i in range(1000000):
tab.append(i)
tab = []
for i in range(1000000):
tab.append(i)
print('after allocation')
sys.stdin.read(1)
# allocate a dict at the top of the heap (actually it works even without) this
a = {}
# deallocate the big array
del tab
print('after deallocation')
sys.stdin.read(1)
# collect
gc.collect()
print('after collection')
sys.stdin.read(1)
"""
You should see that even after the big array has been deallocated and
collected, the memory usage doesn't decrease.
Also, there's another factor coming into play, the linked list of
arenas ("arenas" variable in Object/obmalloc.c), which is expanded
when there are not enough arenas allocated: if this variable is
realloc()ed while the heap is really large and whithout hole in it, it
will be allocated from the top of the heap, and since it's not resized
when the number of used arenas goes down, it will remain at the top of
the heap and will also prevent the heap from shrinking.
My demo patch (pymem.diff) thus does two things:
1) use mallopt to fix the mmap threshold so that arenas are allocated
through mmap
2) increase the maximum size of requests handled by pymalloc from
256B to 512B (as discussed above with Antoine). The reason is that if
a PyObject_Malloc request is not handled by pymalloc from an arena
(i.e. greater than 256B) and is less than the mmap threshold, then we
can't do anything if it's not freed and remains in the middle of the
heap. That's exactly what's happening in the OP case, some
dictionnaries aren't deallocated even after the collection (I couldn't
quite identify them, but there seems to be some UTF-8 codecs and other
stuff)
To sum up, this patch increases greatly the likelihood of Python's
objects being allocated from arenas which should reduce fragmentation
(and seems to speed up certain operations quite a bit), and ensures
that arenas are allocated from mmap so that a single dangling object
doesn't prevent the heap from being trimmed.
I've tested it on RHEL6 64-bit and Debian 32-bit, but it'd be great
if someone else could try it - and of course comment on the above
explanation/proposed solution.
Here's the result on Debian 32-bit:
Without patch:
*** Python 3.3.0 alpha
--- PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND
0 1843 pts/1 S+ 0:00 1 1795 9892 7528 0.5 ./python
/home/cf/issue11849_test.py
1 1843 pts/1 S+ 0:16 1 1795 63584 60928 4.7 ./python
/home/cf/issue11849_test.py
2 1843 pts/1 S+ 0:33 1 1795 112772 109064 8.4
./python /home/cf/issue11849_test.py
3 1843 pts/1 S+ 0:50 1 1795 162140 159424 12.3
./python /home/cf/issue11849_test.py
4 1843 pts/1 S+ 1:06 1 1795 211376 207608 16.0
./python /home/cf/issue11849_test.py
END 1843 pts/1 S+ 1:25 1 1795 260560 256888 19.8
./python /home/cf/issue11849_test.py
GC 1843 pts/1 S+ 1:26 1 1795 207276 204932 15.8
./python /home/cf/issue11849_test.py
With patch:
*** Python 3.3.0 alpha
--- PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND
0 1996 pts/1 S+ 0:00 1 1795 10160 7616 0.5 ./python
/home/cf/issue11849_test.py
1 1996 pts/1 S+ 0:16 1 1795 64168 59836 4.6 ./python
/home/cf/issue11849_test.py
2 1996 pts/1 S+ 0:33 1 1795 114160 108908 8.4
./python /home/cf/issue11849_test.py
3 1996 pts/1 S+ 0:50 1 1795 163864 157944 12.2
./python /home/cf/issue11849_test.py
4 1996 pts/1 S+ 1:07 1 1795 213848 207008 15.9
./python /home/cf/issue11849_test.py
END 1996 pts/1 S+ 1:26 1 1795 68280 63776 4.9 ./python
/home/cf/issue11849_test.py
GC 1996 pts/1 S+ 1:26 1 1795 12112 9708 0.7 ./python
/home/cf/issue11849_test.py
Antoine: since the increasing of the pymalloc threshold is part of the
solution to this problem, I'm attaching a standalone patch here
(pymalloc_threshold.diff). It's included in pymem.diff.
I'll try post some pybench results tomorrow.

This is a very interesting patch, thank you.
I've tested it on Mandriva 64-bit and it indeed fixes the free() issue on the XML workload. I see no regression on pybench, stringbench or json/pickle benchmarks.
I guess the final patch will have to guard the mallopt() call with some #ifdef?
(also, I suppose a portable solution would have to call mmap() ourselves for allocation of arenas, but that would probably be a bit more involved)

> I guess the final patch will have to guard the mallopt() call with some #ifdef?
Yes. See attached patch pymalloc_frag.diff
It's the first time I'm playing with autotools, so please review this part really carefully ;-)
> (also, I suppose a portable solution would have to call mmap() ourselves
> for allocation of arenas, but that would probably be a bit more involved)
Yes. But since it probably only affects glibc/eglibc malloc versions, I guess that target implementations are likely to provide mallopt(M_MMAP_THRESHOLD).
Also, performing an anonymous mappings varies even among Unices (the mmapmodule code is scary). I'm not talking about Windows, which I don't know at all.

For the record, this seems to make large allocations slower:
-> with patch:
$ ./python -m timeit "b'x'*200000"
10000 loops, best of 3: 27.2 usec per loop
-> without patch:
$ ./python -m timeit "b'x'*200000"
100000 loops, best of 3: 7.4 usec per loop
Not sure we should care, though. It's still very fast.
(noticed in http://mail.python.org/pipermail/python-dev/2011-November/114610.html )

More surprising is that, even ignoring the allocation cost, other operations on the memory area seem more expensive:
$ ./python -m timeit -s "b=bytearray(500000)" "b[:] = b"
-> python 3.3:
1000 loops, best of 3: 367 usec per loop
-> python 3.2:
10000 loops, best of 3: 185 usec per loop
(note how this is just a dump memcpy)

> For the record, this seems to make large allocations slower:
>
> -> with patch:
> $ ./python -m timeit "b'x'*200000"
> 10000 loops, best of 3: 27.2 usec per loop
>
> -> without patch:
> $ ./python -m timeit "b'x'*200000"
> 100000 loops, best of 3: 7.4 usec per loop
>
Yes, IIRC, I warned it could be a possible side effect: since we're
now using mmap() instead of brk() for large allocations (between 256B
and 32/64MB), it can be slower (that's the reason adaptive mmap
threadshold was introduced in the first place).
> More surprising is that, even ignoring the allocation cost, other operations on the memory area seem more expensive:
Hum, this it strange.
I see you're comparing 3.2 and default: could you run the same
benchmark on default with and without the patch ?

> I see you're comparing 3.2 and default: could you run the same
> benchmark on default with and without the patch ?
Same results:
-> default branch:
1000 loops, best of 3: 364 usec per loop
-> default branch with patch reverted:
10000 loops, best of 3: 185 usec per loop
(with kernel 2.6.38.8-desktop-8.mga and glibc-2.12.1-11.2.mga1)
And I can reproduce on another machine:
-> default branch:
1000 loops, best of 3: 224 usec per loop
-> default branch with patch reverted:
10000 loops, best of 3: 88 usec per loop
(Debian stable with kernel 2.6.32-5-686 and glibc 2.11.2-10)

> Hmm, quite slow indeed, are you sure you're not running in debug mode?
>
Well, yes, but it's no faster with a non-debug build: my laptop is
really crawling :-)
> If the performance regression is limited to read(), I don't think it's
> really an issue, but using mmap/munmap explicitly would probably benicer
> anyway (1° because it lets the glibc choose whatever heuristic is best,
> 2° because it would help release memory on more systems than just glibc
> systems). I think limiting ourselves to systems which have
> MMAP_ANONYMOUS is good enough.
>
Agreed.
Here's a patch.

I just found this issue from this article:
http://python.dzone.com/articles/diagnosing-memory-leaks-python
Great job! Using mmap() for arenas is the best solution for this issue. I did something similar on a completly different project (also using its own dedicated memory allocator) for workaround the fragmentation of the heap memory.

[@haypo]
> http://python.dzone.com/articles/diagnosing-memory-leaks-python
> Great job! Using mmap() for arenas is the best solution for this issue.
? I read the article, and they stopped when they found "there seemed to be a ton of tiny little objects around, like integers.". Ints aren't allocated from arenas to begin wtih - they have their own (immortal & unbounded) free list in Python2. No change to pymalloc could make any difference to that.

Extract of the "workaround" section:
"You could also run your Python jobs using Jython, which uses the Java JVM
and does not exhibit this behavior. Likewise, you could upgrade to Python
3.3 <http://bugs.python.org/issue11849>,"
Which contains a link to this issue.

Well, memory fragmentation can happen with any allocation scheme, and it's possible even Python 3 isn't immune to this. Backporting performance improvements is a strain on our resources and also constitutes a maintenance threat (what if the bug hides in the new code?). And Python 2.7 is really nearing its end-of-life more and more everyday. So IMHO it's a no-no.

Thanks for your responses to my comments. I'm working as hard as I can to get my customer's systems migrated into the Python 3 world, and I appreciate the efforts of the community to provide incentives (such as the resolution for this failure) for developers to upgrade. However, it's a delicate balancing act sometimes, given that we have critical places in our system for which the same code runs more than twice as slowly on Python 3.6 as on Python 2.7.

> ... jemalloc can reduce memory usage ...
Thanks for the tip. I downloaded the source and successfully built the DLL, then went looking for a way to get it loaded. Unfortunately, DLL injection, which is needed to use this allocator in Python, seems to be much better supported on Linux than on Windows. Basically, Microsoft's documentation [1] for AppInit_DLL, the shim for DLL injection on Windows, says (in effect) "here's how to use this technique, but we don't recommend using it, so here's a link [2] for what we recommend you do instead. That link takes you to "Try searching for what you need. This page doesn’t exist."
[1] https://support.microsoft.com/en-us/help/197571/working-with-the-appinit-dlls-registry-value
[2] https://support.microsoft.com/en-us/help/134655