I consistently see answers quoting this link stating definitively "Don't parse ls!" This bothers me for a couple of reasons:

It seems the information in that link has been accepted wholesale with little question, though I can pick out at least a few errors in casual reading.

It also seems as if the problems stated in that link have sparked no desire to find a solution.

From the first paragraph:

...when you ask [ls] for a list
of files, there's a huge problem: Unix allows almost any character in
a filename, including whitespace, newlines, commas, pipe symbols, and
pretty much anything else you'd ever try to use as a delimiter except
NUL. ... ls separates filenames with newlines. This is fine
until you have a file with a newline in its name. And since I don't
know of any implementation of ls that allows you to terminate
filenames with NUL characters instead of newlines, this leaves us
unable to get a list of filenames safely with ls.

Bummer, right? How ever can we handle a newline terminated listed dataset for data that might contain newlines? Well, if the people answering questions on this websiite didn't do this kind of thing on a daily basis, I might think we were in some trouble.

The truth is though, most ls implementations actually provide a very simple api for parsing their output and we've all been doing it all along without even realizing it. Not only can you end a filename with null, you can begin one with null as well or with any other arbitrary string you might desire. What's more, you can assign these arbitrary strings per file-type. Please consider:

The problem is that from the output of ls, neither you or the
computer can tell what parts of it constitute a filename. Is it each
word? No. Is it each line? No. There is no correct answer to this
question other than: you can't tell.

Also notice how ls sometimes garbles your filename data (in our
case, it turned the \n character in between the words "a" and
"newline" into a ?question mark...

...

If you just want to iterate over all the files in the current
directory, use a for loop and a glob:

for f in *; do
[[ -e $f ]] || continue
...
done

The author calls it garbling filenames when ls returns a list of filenames containing shell globs and then recommends using a shell glob to retrieve a file list!

-q - Force each instance of non-printable filename characters and <tab>s to be written as the question-mark ( '?' ) character. Implementations
may provide this option by default if the output is to a terminal
device.

-1 - (The numeric digit one.) Force output to be one entry per line.

Globbing is not without its own problems - the ? matches any character so multiple matching ? results in a list will match the same file multiple times. That's easily handled.

Though how to do this thing is not the point - it doesn't take much to do after all and is demonstrated below - I was interested in why not. As I consider it, the best answer to that question has been accepted. I would suggest you try to focus more often on telling people what they can do than on what they can't. You're a lot less likely, as I think, to be proven wrong at least.

But why even try? Admittedly, my primary motivation was that others kept telling me I couldn't. I know very well that ls output is as regular and predictable as you could wish it so long as you know what to look for. Misinformation bothers me more than do most things.

The truth is, though, with the notable exception of both Patrick's and Wumpus Q. Wumbley's answers (despite the latter's awesome handle), I regard most of the information in the answers here as mostly correct - a shell glob is both more simple to use and generally more effective when it comes to searching the current directory than is parsing ls. They are not, however, at least in my regard, reason enough to justify either propagating the misinformation quoted in the article above nor are they acceptable justification to "never parse ls."

Please note that Patrick's answer's inconsistent results are mostly a result of him using zsh then bash. zsh - by default - does not word-split $(command substituted) results in a portable manner. So when he asks where did the rest of the files go? the answer to that question is your shell ate them. This is why you need to set the SH_WORD_SPLIT variable when using zsh and dealing with portable shell code. I regard his failure to note this in his answer as awfully misleading.

Wumpus's answer doesn't compute for me - in a list context the ? character is a shell glob. I don't know how else to say that.

In order to handle a multiple results case you need to restrict the glob's greediness. The following will just create a test base of awful file names and display it for you:

OUTPUT

Now I'll safe every character that isn't a /slash, -dash, :colon, or alpha-numeric character in a shell glob then sort -u the list for unique results. This is safe because ls has already safed-away any non printable characters for us. Watch:

Below I approach the problem again but I use a different methodology. Remember that - besides \0null - the / ASCII character is the only byte forbidden in a pathname. I put globs aside here and instead combine the POSIX specified -d option for ls and the also POSIX specified -exec $cmd {} + construct for find. Because find will only ever naturally emit one / in sequence, the following easily procures a recursive and reliably delimited filelist including all dentry information for every entry. Just imagine what you might do with something like this:

In regards to your most recent update, please stop relying on visual output as determining that your code works. Pass your output to an actual program and have the program try and perform an operation on the file. This is why I was using stat in my answer, as it actually checks that each file exists. Your bit at the bottom with the sed thing does not work.
–
PatrickMay 12 '14 at 4:20

31

You can't be serious. How can jumping through all the hoops your question describes be easier or simpler or in any way better than simply not parsing ls in the first place? What you're describing is very hard. I'll need to deconstruct it to understand all of it and I'm a relatively competent user. You can't possibly expect your average Joe to be able to deal with something like this.
–
terdon♦May 12 '14 at 4:40

29

-1 for using a question to pick an argument. All of the reasons parsing ls output is wrong were covered well in the original link (and in plenty of other places). This question would have been reasonable if OP were asking for help understanding it, but instead OP is simply trying to prove his incorrect usage is ok.
–
R..May 12 '14 at 13:05

8

@mikeserv It's not just that parsing ls is bad. Doing for something in $(command) and relying on word-splitting to get accurate results is bad for the large majority of command's which don't have simple output.
–
BroSlowMay 13 '14 at 14:53

8 Answers
8

I am not at all convinced of this, but let's suppose for the sake of argument that you could, if you're prepared to put in enough effort, parse the output of ls reliably, even in the face of an "adversary" — someone who knows the code you wrote and is deliberately choosing filenames designed to break it.

Even if you could do that, it would still be a bad idea.

Bourne shell is not a good language. It should not be used for anything complicated, unless extreme portability is more important than any other factor (e.g. autoconf).

I claim that if you're faced with a problem where parsing the output of ls seems like the path of least resistance for a shell script, that's a strong indication that whatever you are doing is too complicated for shell and you should rewrite the entire thing in Perl or Python. Here's your last program in Python:

This has no issues whatsoever with unusual characters in filenames -- the output is ambiguous in the same way the output of ls is ambiguous, but that wouldn't matter in a "real" program (as opposed to a demo like this), which would use the result of os.path.join(subdir, f) directly.

Equally important, and in stark contrast to the thing you wrote, it will still make sense six months from now, and it will be easy to modify when you need it to do something slightly different. By way of illustration, suppose you discover a need to exclude dotfiles and editor backups, and to process everything in alphabetical order by basename:

No recursion, just nested for-loops. os.walk is doing some seriously heavy lifting behind the scenes, but you don't have to worry about it any more than you have to worry about how ls or find work internally.
–
zwolMay 12 '14 at 23:04

3

Technically, os.walk returns a generator object. Generators are Python's version of lazy lists. Every time the outer for-loop iterates, the generator is invoked and "yields" the contents of another subdirectory. Equivalent functionality in Perl is File::Find, if that helps.
–
zwolMay 12 '14 at 23:12

2

I should note that in various comments, mikeserv's primary reason for parsing ls was that he can do some additional preprocessing (like sorting or filtering with grep) before the traversal. This alternative does not currently do that.
–
IzkataMay 13 '14 at 18:36

2

This is very misleading. Shell isn't a good programming language, but only because it isn't a programming language. It's a scripting language. And it's a good scripting language.
–
Miles RoutMay 13 '14 at 21:38

10

@MilesRout Shell is not a good any kind of language. Even in situations that it should be good at (e.g. /etc/init.d scripts, which are what it was designed for, insomuch as it was designed at all) the obvious way to do ... everything ... will leave you with subtle bugs, the correct way to do everything is tedious, and you have to read character by fucking character, looking for things that aren't there, to notice the difference. And I don't even want to talk about how tiny the true portable subset is.
–
zwolMay 13 '14 at 23:33

That link is referenced a lot because the information is completely accurate, and it has been there for a very long time.

ls replaces non-printable characters with glob characters yes, but those characters aren't in the actual filename. Why does this matter? 2 reasons:

If you pass that filename to a program, that filename doesn't actually exist. It would have to expand the glob to get the real file name.

The file glob might match more than one file.

For example:

$ touch a$'\t'b
$ touch a$'\n'b
$ ls -1
a?b
a?b

Notice how we have 2 files which look exactly the same. How are you going to distinguish them if they both are represented as a?b?

The author calls it garbling filenames when ls returns a list of filenames containing shell globs and then recommends using a shell glob to retrieve a file list!

There is a difference here. When you get a glob back, as shown, that glob might match more than one file. However when you iterate through the results matching a glob, you get back the exact file, not a glob.

@mikeserv actually his solution doesn't return a glob. I just updated my answer to clarify that point.
–
PatrickMay 12 '14 at 2:02

11

"Not the rest"? It's inconsistent behavior, and unexpected results, how is that not a reason?
–
PatrickMay 12 '14 at 2:32

2

@mikeserv You could avoid the duplicates with something like for f in $(ls -1q | tr " " "?" | sed 's/^/"/; s/$/"/') ; do echo "$f"; done. But why not just for f in *; do echo "$f"; done?
–
terdon♦May 12 '14 at 2:32

6

@mikeserv Did you not see my comment on your question? Shell globbing is 2.5 times faster than ls. I also requested that you test your code as it does not work. What does zsh have to do with any of this?
–
PatrickMay 12 '14 at 4:29

16

@mikeserv No, it all still applies even to bash. Though I'm done with this question as you're not listening to what I'm saying.
–
PatrickMay 12 '14 at 5:37

See? That's already wrong right there. There are 3 files but bash is reporting 4. This is because the set is being given the globs generated by ls which are expanded by the shell before being passed to set. Which is why you get:

I upvoted this. It's good to see your own code bite you. But just because I got it wrong doesn't mean it can't be done right. I showed you a very simple way to do it this morning with ls -1qRi | grep -o '^ *[0-9]*' - that's parsing ls output, man, and it's the fastest and best way of which I know to get a list of inode numbers.
–
mikeservMay 12 '14 at 22:56

21

@mikeserv: It could be done right, if you have the time and patience. But the fact is, it is inherently error-prone. You yourself got it wrong. while arguing about its merits! That's a huge strike against it, if even the one person fighting for it fails to do it correctly. And chances are, you'll probably spend still more time getting it wrong before you get it right. I dunno about you, but most people have better to do with their time than fiddle around for ages with the same line of code.
–
cHaoMay 13 '14 at 1:06

The answer is simple: the special cases of ls you have to handle outweigh any possible use. These special cases can be averted if you don't parse ls output.

The mantra here is never trust the user filesystem (the equivalent to never trust user input). If there's a method that will work always, being 100% certain, it should be the method you prefer even if ls does the same but with less certainty. I won't go into technical details since those were covered by terdon and Patrick extensively. I know that the risks of using ls in an important (and maybe expensive) transaction where my job/prestige is on the line, I will prefer any solution that doesn't have a grade of uncertainly if it can be adverted.

The reason people say never do something isn't necessarily because it absolutely positively cannot be done correctly. We may be able to do so, but it may be more complicated, less efficient both space- or time-wise. For example it would be perfectly fine to say "Never build a large e-commerce backend in x86 assembly".

So now to the issue at hand: As you've demonstrated you can create a solution that parses ls and gives the right result - so correctness isn't an issue.

Is it more complicated? Yes, but we can hide that behind a helper function.

So now to efficiency:

Space-efficiency: Your solution relies on uniq to filter out duplicates, consequently we cannot generate the results lazily. So either O(1) vs. O(n) or both have O(n).

Time-efficiency: Best case uniq uses a hashmap approach so we still have a O(n) algorithm in the number of elements procured, probably though it's O(n log n).

Now the real problem: While your algorithm is still not looking too bad I was really careful to use elements procured and not elements for n. Because that does make a big difference. Say you have a file \n\n that will result in a glob for ?? so match every 2 character file in the listing. Funnily if you have another file \n\r that will also result in ?? and also return all 2 character files.. see where this is going? Exponential instead of linear behavior certainly qualifies as "worse runtime behavior".. it's the difference between a practical algorithm and one you write papers in theoretical CS journals about.

Everybody loves examples right? Here we go. Make a folder called "test" and use this python script in the same directory where the folder is.

thing here to work on Linux mint 16 (which I think speaks volumes for the usability of this method).

Anyhow since the above pretty much only filters the result after it gets it, the earlier solution should be at least as quick as the later (no inode tricks in that one- but those are unreliable so you'd give up correctness).

OP's Stated Intention Addressed

preface and original answer's rationale†updated on 2015-05-18

mikeserv (the OP) stated in latest update to his question: "I do consider it a shame though that I first asked this question to point out a source of misinformation, and, unfortunately, the most upvoted answer here is in large part misleading."

Well, okay; I feel it was rather a shame that I spent so much time trying to figure out how to explain my meaning only to find that as I re-read the question. This question ended up "[generating] discussion rather than answers"‡ and ended up weighing in at ~18K of text (for the question alone, just to be clear) which would be long even for a blog post.

But StackExchange is not your soapbox, and it's not your blog. However, in effect, you have used it as at least bit of both. People ended up spending a lot of time answering your "To-Point-Out" instead of answering people's actual questions. At this point I will be flagging the question as not a good fit for our format, given that the OP has stated explicitly that it wasn't even intended to be a question at all.

At this point I'm not sure whether my answer was to the point, or not; probably not, but it was directed at some of your questions, and maybe it can be a useful answer to someone else; beginners take heart, some of those "do not"s turn into "do sometimes" once you get more experienced. :)

As a General Rule...

please forgive remaining rough edges; i having spent far too much time on this already... rather than quote the OP directly (as originally intended) i will try to summarize and paraphrase.

[largely reworked from my original answer]upon consideration, i believe that i mis-read the emphasis that the OP was placing on the questions i answered; however, the points addressed were brought up, and i have left the answers largely intact as i believe them to be to-the-point and to address issues that i've seen brought up in other contexts as well regarding advice to beginners.

The original post asked, in several ways, why various articles gave advice such as «Don't parse ls output» or «You should never parse ls output», and so forth.

My suggested resolution to the issue is that instances of this kind of statement are simply examples of an idiom, phrased in slightly different ways, in which an absolute quantifier is paired with an imperative [e.g., «don't [ever] X», «[you should] always Y», «[one should] never Z»] to form statements intended to be used as general rules or guidelines, especially when given to those new to a subject, rather than being intended as absolute truths, the apparent form of those statements notwithstanding.

When you're beginning to learn new subject matter, and unless you have some good understanding of why you might need to do else-wise, it's a good idea to simply follow the accepted general rules without exception—unless under guidance from someone more experienced that yourself. With rising skill and experience you become further able to determine when and if a rule applies in any particular situation. Once you do reach a significant level of experience, you will likely understand the reasoning behind the general rule in the first place, and at that point you can begin to use your judgement as to whether and to what level the reasons behind the rule apply in that situation, and also as to whether there are perhaps overriding concerns.

And that's when an expert, perhaps, might choose to do things in violation of "The Rules". But that wouldn't make them any less "The Rules".

And, so, to the topic at hand: in my view, just because an expert might be able to violate this rule without getting completely smacked down, i don't see any way that you could justify telling a beginner that "sometimes" it's okay to parse ls output, because: it's not. Or, at least, certainly it's not right for a beginner to do so.

You always put your pawns in the center; in the opening one piece, one move; castle at the earliest opportunity; knights before bishops; a knight on the rim is grim; and always make sure you can see your calculation through to the end! (Whoops, sorry, getting tired, that's for the chess StackExchange.)

Rules, Meant to Be Broken?

When reading an article on a subject that is targeted at, or likely to be read by, beginners, often you will see things like this:

"You should not ever do X."

"Never do Q!"

"Don't do Z."

"One should always do Y!"

"C, no matter what."

While these statements certainly seem to be stating absolute and timeless rules, they are not; instead this is a way of stating general rules [a.k.a. "guidelines", "rules of thumb", "the basics", etc.] that is at least arguably one appropriate way to state them for the beginners that might be reading those articles. However, just because they are stated as absolutes, the rules certainly don't bind professionals and experts [who were likely the ones who summarized such rules in the first place, as a way to record and pass on knowledge gained as they dealt with recurring issues in their particular craft.]

Those rules certainly aren't going to reveal how an expert would deal with a complex or nuanced problem, in which, say, those rules conflict with each other; or in which the concerns that led to the rule in the first place simply don't apply. Experts are not afraid to (or should not be afraid to!) simply break rules that they happen to know don't make sense in a particular situation. Experts are constantly dealing with balancing various risks and concerns in their craft, and must frequently use their judgement to choose to break those kind of rules, having to balance various factors and not being able to just rely on a table of rules to follow. Take Goto as an example: there's been a long, recurring, debate on whether they are harmful. (Yeah, don't ever use gotos. ;D)

A Modal Proposition

An odd feature, at least in English, and I imagine in many other languages, of general rules, is that they are stated in the same form as a modal proposition, yet the experts in a field are willing to give a general rule for a situation, all the while knowing that they will break the rule when appropriate. Clearly, therefore, these statements aren't meant to be equivalent to the same statements in modal logic.

This is why i say they must simply be idiomatic. Rather than truly being a "never" or an "always" situation, these rules usually serve to codify general guidelines that tend to be appropriate over a wide range of situations, and that, when beginners follow them blindly, are likely to result in far better results than the beginner choosing to go against them without good reason. Sometimes they codify rules simply leading to substandard results rather than the outright failures accompanying incorrect choices when going against the rules.

So, general rules are not the absolute modal propositions they appear to be on the surface, but instead are a shorthand way of giving the rule with a standard boilerplate implied, something like the following:

unless you have the ability to tell that this guideline is incorrect in a particular case, and prove to yourself that you are right, then ${RULE}

where, of course you could substitute "never parse ls output" in place of ${RULE}. :)

Oh Yeah! What About Parsing ls Output?

Well, so, given all that... i think it's pretty clear that this rule is a good one. First of all, the real rule has to be understood to be idiomatic, as explained above...

But furthermore, it's not just that you have to be very good with shell scripting to know whether it can be broken, in some particular case. It's, also, that it's takes just as much skill to tell you got it wrong when you are trying to break it in testing! And, I say confidently that a very large majority of the likely audience of such articles (giving advice like «Don't parse the output of ls!») can't do those things, and those that do have such skill will likely realize that they figure it out on their own and ignore the rule anyway.

But... just look at this question, and how even people that probably do have the skill thought it was a bad call to do so; and how much effort the author of the question spent just getting to a point of the current best example! I guarantee you on a problem that hard, 99% of the people out there would get it wrong, and with potentially very bad results! Even if the method that is decided on turns out to be a good one; until it (or another) ls parsing idea becomes adopted by IT/developer folk as a whole, withstands a lot of testing (especially the test of time) and, finally, manages to graduate to a 'common technique' status, it's likely that a lot of people might try it, and get it wrong... with disastrous consequences.

So, I will reiterate one last time.... that, especially in this case, that is why "never parse ls output!" is decidedly the right way to phrase it.

[UPDATE 2014-05-18: clarified reasoning for answer (above) to respond to a comment from OP; the following addition is in response to the OP's additions to the question from yesterday]

[UPDATE 2014-11-10: added headers and reorganized/refactored content; and also: reformatting, rewording, clarifying, and um... "concise-ifying"... i intended this to simply be a clean-up, though it did turn into a bit of a rework. i had left it in a sorry state, so i mainly tried to give it some order. i did feel it was important to largely leave the first section intact; so only two minor changes there, redundant 'but' removed, and 'that' emphasized.]

† I originally intended this solely as a clarification on my original; but decided on other additions upon reflection

Never isn't idiomatic. This is not an answer to anything.
–
mikeservMay 17 '14 at 17:52

Hmm. Well, I didn't know whether this answer would be satisfying but I absolutely didn't expect it to be controversial. And, I didn't (mean to) argue that 'never' was per se idiomatic; but that "Never do X!" is an idiomatic use. I see two general cases that can show that 'Never/don't parse ls!' is correct advice: 1. demonstrate (to your satisfaction) that every use-case where one might parse ls output has another available solution, superior in some way, without doing so. 2. show that, in the cited cases, the statement is not a literal one.
–
shelleybutterflyMay 18 '14 at 6:50

Looking at your question again, I see that you first mention "don't ..." rather than "never ..." which is well into your analysis, so I'll clarify on that point as well. At this point there's already a solution of the first type, which is apparently demonstrated/explained to your satisfaction, so I won't delve into there much. But I'll try and clarify my answer a bit: like I say, I wasn't trying to be controversial (or confrontational!) but to point out how those statements are generally intended.
–
shelleybutterflyMay 18 '14 at 6:53

I actually posted it since I didn't see it given as an explanation; and, since I have more than once had someone hung up on my choice of a solution because 'but "everyone" says never to do that!' when I made a (non-conventional) judgement that it was nonetheless appropriate given the circumstances. [example: "You can't/shouldn't use C++ in real-time safety-critical software!"] The more thoroughly you understand something, and the more experience you have with it, the more often you will see a superior solution that 'breaks the rules', but that doesn't mean a beginner should do so willy-nilly.
–
shelleybutterflyMay 18 '14 at 7:10

1

Well, i reversed my downvote because, at the very least, youre right about the flagging thing. Ill try to clean it up tonight or tomorrow. My thought is i'll move most of the code examples to an answer i guess. But it still doesnt, as far as im concerned, excuse the inaccuracies in that oft-cited blog post. I wish people would stop citing the bash manual altogether - at least not til after theyve cited the POSIX specs...
–
mikeservMay 18 '14 at 17:44

Is it possible to parse the output of ls in certain cases? Sure. The idea of extracting a list of inode numbers from a directory is a good example - if you know that your implementation's ls supports -q, and therefore each file will produce exactly one line of output, and all you need are the inode numbers, parsing them out of ls -Rai1q output is certainly a possible solution. Of course, if the author hadn't seen advice like "Never parse the output of ls" before, he probably wouldn't think about filenames with newlines in them, and would probably leave off the 'q' as a result, and the code would be subtly broken in that edge case - so, even in cases where parsing ls's output is reasonable, this advice is still useful.

The broader point is that, when a newbie to shell scripting tries to have a script figure out (for instance) what's the biggest file in a directory, or what's the most recently modified file in a directory, his first instinct is to parse ls's output - understandable, because ls is one of the first commands a newbie learns.

Unfortunately, that instinct is wrong, and that approach is broken. Even more unfortunately, it's subtly broken - it will work most of the time, but fail in edge cases that could perhaps be exploited by someone with knowledge of the code.

The newbie might think of ls -s | sort -n | tail -n 1 | awk '{print $2}' as a way to get the biggest file in a directory. And it works, until you have a file with a space in the name.

OK, so how about ls -s | sort -n | tail -n 1 | sed 's/[^ ]* *[0-9]* *//'? Works fine until you have a file with a newline in the name.

Does adding -q to ls's arguments help when there's a newline in the filename? It might look like it does, until you have 2 different files that contain a non-printable character in the same spot in the filename, and then ls's output doesn't let you distinguish which of those was biggest. Worse, in order to expand the "?", he probably resorts to his shell's eval - which will cause problems if he hits a file named, for instance,

foo`/tmp/malicious_script`bar

Does --quoting-style=shell help (if your ls even supports it)? Nope, still displays ? for nonprintable characters, so it's still ambiguous which of multiple matches was the biggest. --quoting-style=literal? Nope, same. --quoting-style=locale or --quoting-style=c might help if you just need to print the name of the biggest file unambiguously, but probably not if you need to do something with the file afterwards - it would be a bunch of code to undo the quoting and get back to the real filename so that you can pass it to, say, gzip.

And at the end of all that work, even if what he has is safe and correct for all possible filenames, it's unreadable and unmaintainable, and could have been done much more easily, safely, and readably in python or perl or ruby.

Or even using other shell tools - off the top of my head, I think this ought to do the trick:

Oh true about the size - i probably could do that if i tried - should i? Im kinda tired or this whole thing - i like your answer because you dont say can't or dont or never but actually give examples of maybe why not and comparable how else - thank you.
–
mikeservMay 16 '14 at 16:44

I think if you tried, you'd discover it's much harder than you think. So, yes, I'd recommend trying. I'll be happy to keep giving filenames that will break for you as long as I can think of them. :)
–
godlygeekMay 16 '14 at 16:50

Comments are not for extended discussion; this conversation has been moved to chat.
–
terdon♦Aug 23 '14 at 10:24

@mikeserv and godlygeek, I have moved this comment thread to chat. Please don't have long discussions like this in the comments, that's what chat is for.
–
terdon♦Aug 23 '14 at 10:28