I want to ignore any lines that contain only whitespace. I know I can do this by by creating a loop and checking each line I read and then adding it to a list if it's not blank.
I was just wondering if there was a more Pythonic way of doing it?

"This will save you from having to call strip on the line twice." - yes; and it's certainly neater in that regard; but you still end up repeating yourself, and I have to wonder if it really makes up performance-wise for going through the overhead of chaining generators like that. Anyone want to do some tests?
–
Karl KnechtelJan 30 '11 at 9:18

2

@aaronsterling: Your first oneliner is so hard to read that neither you nor the merry band of up-voters and commenters noticed that it needs () after l.strip :-)
–
John MachinJan 30 '11 at 10:47

1

+1 for nonblank_lines function. This should be first. The rest is either code golf or a memory hog because it reads whole files into single lists where (sometimes) the whole file isn't needed.
–
S.LottJan 30 '11 at 13:00

When a treatment of text must be done to just extract data from it, I always think first to the regexes, because:

as far as I know, regexes have been invented for that

iterating over lines appears clumsy to me: it essentially consists to search the newlines then to search the data to extract in each line; that makes two searches instead of a direct unique one with a regex

way of bringing regexes into play is easy; only the writing of a regex string to be compiled into a regex object is sometimes hard, but in this case the treatment with an iteration over lines will be complicated too

For the problem discussed here, a regex solution is fast and easy to write:

import re
names = re.findall('\S+',open(filename).read())

I compared the speeds of several solutions:

import re
from time import clock
A,AA,B1,B2,BS,reg = [],[],[],[],[],[]
D,Dsh,C1,C2 = [],[],[],[]
F1,F2,F3 = [],[],[]
def nonblank_lines(f):
for l in f:
line = l.rstrip()
if line: yield line
def short_nonblank_lines(f):
for l in f:
line = l[0:-1]
if line: yield line
for essays in xrange(50):
te = clock()
with open('raa.txt') as f:
names_listA = [line.strip() for line in f if line.strip()] # Felix Kling
A.append(clock()-te)
te = clock()
with open('raa.txt') as f:
names_listAA = [line[0:-1] for line in f if line[0:-1]] # Felix Kling with line[0:-1]
AA.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f_in:
namesB1 = [ name for name in (l.strip() for l in f_in) if name ] # aaronasterling without list()
B1.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
namesB2 = [ name for name in (l[0:-1] for l in f_in) if name ] # aaronasterling without list() and with line[0:-1]
B2.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
namesBS = [ name for name in f_in.read().splitlines() if name ] # a list comprehension with read().splitlines()
BS.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f:
xreg = re.findall('\S+',f.read()) # eyquem
reg.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f_in:
linesC1 = list(line for line in (l.strip() for l in f_in) if line) # aaronasterling
C1.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
linesC2 = list(line for line in (l[0:-1] for l in f_in) if line) # aaronasterling with line[0:-1]
C2.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f_in:
yD = [ line for line in nonblank_lines(f_in) ] # aaronasterling update
D.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
yDsh = [ name for name in short_nonblank_lines(f_in) ] # nonblank_lines with line[0:-1]
Dsh.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f_in:
linesF1 = filter(None, (line.rstrip() for line in f_in)) # aaronasterling update 2
F1.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
linesF2 = filter(None, (line[0:-1] for line in f_in)) # aaronasterling update 2 with line[0:-1]
F2.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
linesF3 = filter(None, f_in.read().splitlines()) # aaronasterling update 2 with read().splitlines()
F3.append(clock()-te)
print 'names_listA == names_listAA==namesB1==namesB2==namesBS==xreg\n is ',\
names_listA == names_listAA==namesB1==namesB2==namesBS==xreg
print 'names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3\n is ',\
names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3,'\n\n\n'
def displ((fr,it,what)): print fr + str( min(it) )[0:7] + ' ' + what
map(displ,(('* ', A, '[line.strip() for line in f if line.strip()] * Felix Kling\n'),
(' ', B1, ' [name for name in (l.strip() for l in f_in) if name ] aaronasterling without list()'),
('* ', C1, 'list(line for line in (l.strip() for l in f_in) if line) * aaronasterling\n'),
('* ', reg, 're.findall("\S+",f.read()) * eyquem\n'),
('* ', D, '[ line for line in nonblank_lines(f_in) ] * aaronasterling update'),
(' ', Dsh, '[ line for line in short_nonblank_lines(f_in) ] nonblank_lines with line[0:-1]\n'),
('* ', F1 , 'filter(None, (line.rstrip() for line in f_in)) * aaronasterling update 2\n'),
(' ', B2, ' [name for name in (l[0:-1] for l in f_in) if name ] aaronasterling without list() and with line[0:-1]'),
(' ', C2, 'list(line for line in (l[0:-1] for l in f_in) if line) aaronasterling with line[0:-1]\n'),
(' ', AA, '[line[0:-1] for line in f if line[0:-1] ] Felix Kling with line[0:-1]\n'),
(' ', BS, '[name for name in f_in.read().splitlines() if name ] a list comprehension with read().splitlines()\n'),
(' ', F2 , 'filter(None, (line[0:-1] for line in f_in)) aaronasterling update 2 with line[0:-1]'),
(' ', F3 , 'filter(None, f_in.read().splitlines() aaronasterling update 2 with read().splitlines()'))
)

Solution with regex is straightforward and neat. Though, it isn't among the fastest ones.
The solution of aaronasterling with filter() is surprisigly fast for me (I wasn't aware of this particular filter()'s speed) and times of optimized solutions go down until 27 % of the biggest time. I wonder what makes the miracle of the filter-splitlines association:

names_listA == names_listAA==namesB1==namesB2==namesBS==xreg
is True
names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3
is True
* 0.08266 [line.strip() for line in f if line.strip()] * Felix Kling
0.07535 [name for name in (l.strip() for l in f_in) if name ] aaronasterling without list()
* 0.06912 list(line for line in (l.strip() for l in f_in) if line) * aaronasterling
* 0.06612 re.findall("\S+",f.read()) * eyquem
* 0.06486 [ line for line in nonblank_lines(f_in) ] * aaronasterling update
0.05264 [ line for line in short_nonblank_lines(f_in) ] nonblank_lines with line[0:-1]
* 0.05451 filter(None, (line.rstrip() for line in f_in)) * aaronasterling update 2
0.04689 [name for name in (l[0:-1] for l in f_in) if name ] aaronasterling without list() and with line[0:-1]
0.04582 list(line for line in (l[0:-1] for l in f_in) if line) aaronasterling with line[0:-1]
0.04171 [line[0:-1] for line in f if line[0:-1] ] Felix Kling with line[0:-1]
0.03265 [name for name in f_in.read().splitlines() if name ] a list comprehension with read().splitlines()
0.03638 filter(None, (line[0:-1] for line in f_in)) aaronasterling update 2 with line[0:-1]
0.02198 filter(None, f_in.read().splitlines() aaronasterling update 2 with read().splitlines()

But this problem is particular, the most simple of all: only one name in each line. So the solutions are only games with lines, splitings and [0:-1] cuts.

On the contrary, regex doesn't matter with lines, it straightforwardly finds the desired data: I consider it is a more natural way of resolution, applying from the simplest to the more complex cases, and hence is often the way to be prefered in treatments of texts.

EDIT

I forgot to say that I use Python 2.7 and I measured the above times with a file containing 500 times the following chain

A couple of points. One would never write [line for line in generator()], one would just write list(generator()). Try to work with builtins whenever possible. They're written in C and everybody knows what they do. Also, I was calling str.rstrip not str.split. I don't know if there will be a performance gain. Finally, filter(None, ....) is so fast because it encapsulates all of the logic in C.
–
aaronasterlingJan 31 '11 at 19:12

@S.Lott No . If I test with print types(lines), I obtain <type 'generator'> . So lines isn't an object containing data, having a little or enormous size. And lines = (line.rstrip() for line in f_in) doesn't put anything alone in memory, it offers the possibility to be iterated. My above code doesn't put the entire file into memory, it just records an integer in the the_strange_sum object that lives in the memory. It seems we don't understand the words the same.
–
eyquemJan 31 '11 at 19:23

Agreed. I could not understand your point until you included <type 'generator'>.
–
S.LottJan 31 '11 at 19:25