This is the implementation of codecs.Streamwriter.writelines for all
Python versions that I've checked:
def writelines(self, list):
""" Writes the concatenated list of strings to the stream
using .write().
"""
self.write(''.join(list))
This may be a problem if the 'list' parameter is a generator. The
generator may be returning millions of values, which the join will
concatenate in memory. It can surprise the programmer with large
memory use. I think join should only be used when you know the size of
your input, and this method does not know this. I think the safe
implementation of this method would be:
def writelines(self, list):
""" Writes the concatenated list of strings to the stream
using .write().
"""
write = self.write
for value in list:
write(value)
If a caller knows that it's input list would use a reasonable amount
of memory, it can get the same functionality as before by doing
stream.write(''.join(list)).

For the common case where list is in fact a sequence of strings, the
used implementation is a lot faster and more efficient than the one you
propose.
Note that the method doesn't pretend to support generators for the list
argument, so adding support for iterators would be a feature request,
not a bug report.
Furthermore, the StreamWriter method can easily be overridden by the
codec's own implementation (the version in codecs.py is the base method
defining the interface and providing a default implementation), so
adding support to the base method will not necessarily mean that all
codecs then support iterators as parameter to .writelines().
IMHO, it's better to use the .write() method in a for-loop of your
application if you want to support iterators that generate a lot of output.

In Python's file protocol, readlines and writelines is a protocol for
iterating over a file. In Python's file protocol, if one doesn't want
to iterate over the file, one calls read() with no argument in order
to read the whole file in, or one calls write() with the complete
contents you want to write.
If writelines is using join, then if one passes an iterator as the
parameter to writelines, it will not iteratively write to the file, it
will accumulate everything in memory until the iterator raises
StopIteration, and then write to the file. So, if one is tailing the
output file, one is not going to see anything in the file until the
end, instead of iteratively seeing content. So, it's breaking the
promise of the file protocol's writelines meaning iteratively write.
I think following the protocol is more important than performance. If
the application is having performance problems, it's up to the
application to buffer the data in memory and make a single write call.
However, here is an alternative implementation that is slightly more
complicated, but possibly has better performance for the passed-a-list
case. It covers three cases:
1. Passed an empty sequence; do not call self.write at all.
2. Passed a sequence with a length. That implies that all the data is
available immediately, so one can concantenate and write with one
self.write call.
3. Passed a sequence with no length. That implies that all the data
is not available immediately, so iteratively write it.
def writelines(self, sequence):
""" Writes the sequence of strings to the stream
using .write().
"""
try:
sequence_len = len(sequence)
except TypeError:
write = self.write
for value in sequence:
write(value)
return
if sequence_len:
self.write(''.join(sequence))
I'm not sure which is better. But one last point is that Python is
moving more in the direction of using iterators; e.g., in Py3K,
replacing dict's keys, values, and items with the implementation of
iterkeys, itervalues, and iteritems.

Let me give an example of why it's important that writelines
iteratively writes. For:
rows = (line[:-1].split('\t') for line in in_file)
projected = (keep_fields(row, 0, 3, 7) for row in rows)
filtered = (row for row in projected if row[2]=='1')
out_file.writelines('\t'.join(row)+'\n' for row in filtered)
For a large input file, for a regular out_file object, this will work.
For a codecs.StreamWriter wrapped out_file object, this won't work,
because it's not following the file protocol that writelines should
iteratively write.

On 2009-03-10 16:36, Daniel Lescohier wrote:
> Daniel Lescohier <daniel.lescohier@cbs.com> added the comment:
>
> Let me give an example of why it's important that writelines
> iteratively writes. For:
>
> rows = (line[:-1].split('\t') for line in in_file)
> projected = (keep_fields(row, 0, 3, 7) for row in rows)
> filtered = (row for row in projected if row[2]=='1')
> out_file.writelines('\t'.join(row)+'\n' for row in filtered)
>
> For a large input file, for a regular out_file object, this will work.
> For a codecs.StreamWriter wrapped out_file object, this won't work,
> because it's not following the file protocol that writelines should
> iteratively write.
Of course, it's possible to have a generator producing lots of data,
but that does not warrant making most uses of that method slow.
If you'd like to see such support in .writelines(), please provide
an implementation that follows the approach of the file object
implementation in Python 2.x: it writes the lines in chunks of
1000 lines each, if it finds that the input object is not a sequence
(actually, it's stricter than that for some: it requires a Python
list).
The standard case of passing a list of strings to that method
should not get slower because of this.
BTW: I am not aware of any .writelines() file protocol. If there
were such a protocol, .readlines() would also have to return
an iterator (which it doesn't). The idea behind .writelines()
is to be able to write back a list generated with .readlines().
Thanks,
--
Marc-Andre Lemburg
eGenix.com
________________________________________________________________________
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

OK, I think I see where I went wrong in my perceptions of the file
protocol. I thought that readlines() returned an iterator, not a
list, but I see in the library reference manual on File Objects that
it returns a list. I think I got confused because there is no
equivalent of __iter__ for writing to streams. For input, I'm always
using 'for line in file_object' (in other words,
file_object.__iter__), so I had assumed that writelines was the mirror
image of that, because I never use the readlines method. Then, in my
mind, readlines became the mirror image of writelines, which I had
assumed took an iterator, so I assumed that readlines returned an
iterator. I wonder if this perception problem is common or not.
So, the StreamWriter interface matches the file protocol; readlines()
and writelines() deal with lists. There shouldn't be any change to
it, because it follows the protocol.
Then, the example I wrote would be instead:
rows = (line[:-1].split('\t') for line in in_file)
projected = (keep_fields(row, 0, 3, 7) for row in rows)
filtered = (row for row in projected if row[2]=='1')
formatted = (u'\t'.join(row)+'\n' for row in filtered)
write = out_file.write
for line in formatted:
write(line)
I think it's correct that the file object write C code only does
1000-line chunks for sequences that have a defined length: if it has a
defined length, then that implies that the data exists now, and can be
concatenated and written now. Something without a defined length may
be a generator with items arriving later.

I find this to be a confusing issue.
1. The arg for .join is any iterable of strings, not just a list. (I think Daniel got that.) Since codecs.py still has "self.write(''.join(list))":
a) calling the arg 'list' is doubly wrong.
b) there is no iterable feature to add, at least not to the base class.
2. The codecs.writelines entry says "Writes the concatenated list of strings to the stream (possibly by reusing the write() method)." For the base class, that is overly restrictive, but I gather that Marc-Andre does not want to require that all subclasses be as relaxed. I presume there must be a reason. That means that the feature request of doing so is rejected by him. In that respect, this should be closed unless other developers strongly disagree.
3. The iobase.writelines entry says "Write a list of lines to the stream." For StreamIO, iter(somelist) works just as well. If that applies to .writelines in general, the doc should say 'iterable of lines'. (Antoine, Benjamin?)
If io.writelines does the same (concatenate all together), it should say so, but one comment suggests it does not
4. If an iterable produces millions of lines, it does not matter if it is a list, generator, or any other iterable. So the title does not point to the real problem, if there is one.

On 25.03.2013 23:11, Terry J. Reedy wrote:
> 2. The codecs.writelines entry says "Writes the concatenated list of strings to the stream (possibly by reusing the write() method)." For the base class, that is overly restrictive, but I gather that Marc-Andre does not want to require that all subclasses be as relaxed. I presume there must be a reason. That means that the feature request of doing so is rejected by him. In that respect, this should be closed unless other developers strongly disagree.
The doc string of the base class merely describes what the
.writelines() method will be doing. It's not meant as a requirement.
In one of the previous messages, I've already outlined a workable
approach to solve the case that the original request was addressing.
There isn't much point in making the common case slower just
to address a theoretical corner case.
Since no patch has since been submitted, I'd suggest to close the
ticket as won't fix or out of date.