June 23rd, 2024

Why *not* parse `ls` (and what to do instead)

Parsing 'ls' output in Unix systems is discouraged due to challenges with special characters. Shell globs and for loops are recommended for reliable file handling, criticizing 'ls' parsing for its limitations.

Read original articleLink Icon
Why *not* parse `ls` (and what to do instead)

The article discusses the reasons why parsing the output of the 'ls' command in Unix systems is discouraged and offers alternative solutions. It highlights the challenges posed by filenames containing special characters like whitespace and newlines, which can make parsing 'ls' output unreliable. The author suggests using shell globs or iterating over files using a for loop as more reliable methods to handle file lists. The article also criticizes the practice of parsing 'ls' output and emphasizes the simplicity and effectiveness of using shell globs for file operations. Additionally, it addresses the limitations and potential issues with globbing, such as matching multiple characters. The article concludes by advocating for informing users about practical solutions rather than focusing on the limitations of parsing 'ls' output.

Link Icon 29 comments
By @hawski - 7 months
I think that when someone uses ls instead of a glob it means they most probably don't understand shell. I don't see any advantage of parsing ls output when glob is available. Shell is finicky enough to not invite more trouble. Same with word splitting, one of the reasons to use shell functions, because then you have "$@" which makes sense and any other way to do it is something I can't comprehend.

Maybe I also don't understand shell, but as it was said before: when in doubt switch to a better defined language. Thank heavens for awk.

By @Aerbil313 - 7 months
What to do instead: Use Nushell.

I finally started really using my shell after switching to it. I casually write multiple scripts and small functions per day to automate my stuff. I'm writing scripts I'd otherwise write in python in nu. All because the data needs no parsing. I'm not even annotating my data with types even though Nushell supports it because it turns out structured data with inferred types is more than you need day-to-day. I'm not even talking about all the other nice features other shells simply don't have. See this custom command definiton:

  # A greeting command that can greet the caller
  def greet [
    name: string      # The name of the person to greet
    --age (-a): int   # The age of the person
  ] {
    [$name $age]
  }
Here's the auto-generated output when you run `help greet`:

  A greeting command that can greet the caller

  Usage:
    > greet <name> {flags}

  Parameters:
    <name> The name of the person to greet

  Flags:
    -h, --help: Display this help message
    -a, --age <integer>: The age of the person
It's one of the software that only empowers you, immediately, without a single downside. Except the time spent learning it, but that was about a week for me. Bash or fish is there if I ever need it to paste some shell commands.
By @noobermin - 7 months
Posts like these are like the main character threads on twitter where someone says, "men don't do x" or "women aren't like y." It just feels like people outside of you who have no understanding of your context seem intent on making up rules for how you should code things.

Perhaps it would help to translate this into something more like, "what pitfalls do you run into if you parse `ls`" but it's hard to get past the initial language.

By @probably_wrong - 7 months
I think there's a middle point where you want to do something that's complex enough that a glob won't cut it but simple enough that switching languages is not worth it.

I think the example of "exclude these two types of files" is a good case. I often have to write stuff like `ls P* | grep -Ev "wav|draft"` which doesn't solve a problem I don't have (such as filenames with newlines in them) but does solve the one I do (keeping a subset of files that would be tricky to glob properly).

In my experience 95% of those scripts are going to be discarded in a week, and bringing Python into it means I need to deal with `os.path` and `subprocess.run`. My rule of thumb: if it's not going to be version controlled then Bash is fine.

By @fellerts - 7 months
The title omits the final '?' which is important, because the rant and its replies didn't settle the matter.

Shellcheck's page on parsing ls links to the article the author is nitpicking on, but it also links to the answer to "what to do instead": use find(1), unless you really can't. https://mywiki.wooledge.org/BashFAQ/020

By @badsectoracula - 7 months
I guess this is for shell scripts that need to work with "unsafe" filenames?

I've been using Linux since 1999 and i never came across a filename with newlines. On the other hand, pretty much all "ls parsing" i've done was on the command-line to pipe it to other stuff in files i was 100.1% sure would be fine.

By @geophile - 7 months
I wrote a pipe-objects-instead-of-strings shell: https://marceltheshell.org.

Not piping strings avoids this issue completely. Marcel’s ls produces a stream of File objects, which can be processed without worrying about whitespace, EOL, etc.

In general, this approach avoids parsing the output of any command. You always get a stream of Python values.

By @jcalvinowens - 7 months
Not sure how portable it is, but gnu ls has a flag to solve this problem trivially:

  --zero    end each output line with NUL, not newline
By @billpg - 7 months
Why do you want to put LF bytes into filenames?

Using magic, I've renamed any files you have to remove control characters in the name and made it impossible to make any new ones. (You can thank me later.)

What can't you do now?

By @7bit - 7 months
Or use PowerShell where LS returns a bunch of objects, and say goodbye to string parsing forever.
By @waffletower - 7 months
Borkdude has a wonderful Clojure/Babashka solution in this space: https://github.com/babashka/fs
By @g15jv2dp - 7 months
What to do instead: use pwsh to completely obviate all these issues.
By @teddyh - 7 months
Many people turn to globbing to save them, which is usually better, but has some problems in case of no matches. But, for Bash, you can do this to fix it:

  shopt -s failglob
By @cess11 - 7 months
I don't know, this seems like a lot of words to avoid coming to the conclusion that there are many ways to skin a directory.

Most of the time it's fine to just suck in ls and split it on \n and iterate away, which I do a lot because it's just a nice and simple way forward when names are well-formed. Sometimes it's nicer to figure out a 'find at-place thing -exec do-the-stuff {} \;'. And sometimes one needs some other tool that scours the file system directly and doesn't choke on absolutely bizarre file names and gives a representation that doesn't explode in the subsequent context, whatever that may be, which is quite rare.

A more common issue than file names consisting of line breaks is unclean encodings, non-UTF-8 text that seeps in from lesser operating systems. Renaming makes the problem go away, so one should absolutely do that and then crude techniques are likely very viable again.

By @tmtvl - 7 months
Today I learned how neat find is:

  find ~/Music -iname 'p*' -not -iname '*age*' -not -iname '*etto*'
  find ~/Music -iname 'p*' -not -iregex '.*\(age\|etto\).*'
  find ~/Music -regextype posix-extended -iname 'p*' -not -iregex '.*(age|etto).*'
Not that I'm likely to ever use any of that in anger, but it's good to know if ever I do wind up needing it.
By @zokier - 7 months
I wonder if anyone has implemented kernel module or smth to limit filenames to sane set. Just ensuring that they are valid utf8 and do not contain any non-printables would be huge improvement. Sure some niche applications might break so its not something that can be made default, but still I think it would help on systems I control.
By @Nimitz14 - 7 months
These sorts of pedantic exchanges are so pointless to me. We are programmers. We can control what characters are used in filenames. Then you can use the simplest tool for the job and move on with your life to focus on the stuff that actually matters. Fix the root cause instead of creating workarounds for the symptom.
By @amelius - 7 months
I feel like Unix utilities should provide a standardized way to generate machine-readable output, perhaps using JSON.
By @midjji - 7 months
The bash code which creates the c file which gets the list of null terminated files in a directory and compiles it, and runs it, is easier to write and understand. Bash is a lousy language to do anything in, python is almost always available, and if not, then CC is.
By @Tempest1981 - 7 months
Recent discussion about the original "don't parse" page being referenced:

https://news.ycombinator.com/item?id=40692698 (10 days ago, 83 comments)

By @InsideOutSanta - 7 months
Files and directories, once a reference to them is obtained, should not be identified by their path. This causes all kinds of problems, like the reference breaking when the user moves or renames things, and issues like the ones described in the article, where some "edge case" (and I'm using that term very loosely, because it includes common situations like a space in a file name) causes problems down the line.

You might say that people don't move or rename things while files are open, but they absolutely do, and it absolutely breaks things. Even something as simple as starting to copy a directory in Explorer to a different drive, and then moving it while the copy is ongoing, doesn't work. That's pathetic! There is no technical reason this should not be possible.

And who can forget the case where an Apple installer deleted people's hard disk contents when they had two drives, one with a space character, and another one whose name was the string before the first drive's space character?

Files and directories need to have a unique ID, and references to files need to be that ID, not their path, in almost all cases. MFS got that right in 1984, it's insane that we have failed to properly replicate this simple concept ever since, and actually gone backwards in systems like Mac OS X, which used to work correctly, and now no longer consistently do.

By @lostmsu - 7 months
This is a problem I faced recently on Linux. You can use ip addr to see the list of your IPv6 addresses and their types (temporary or not, etc). But doing it programmatically from a non-C codebase is way more involved.
By @tremon - 7 months
Most of the time I avoid parsing ls, but I haven't found a reliable way to do this one:

  latest="$(ls -1 $pattern | sort --reverse --version-sort | head -1)"
Anyone got a better solution?
By @renewiltord - 7 months
I just solve this by not having files like that on my computer. No spaces. No null chars.
By @bandie91 - 7 months
i searched through the page and have not found `find ... -printf "%M %n %u %g %s ...\0"` mentioned. this way you get ls(1)-like output, yet machine-parseable.
By @TacticalCoder - 7 months
Now of course having scripts and pre-commit hooks enforcing simple rules so that files must only use a subset of Unicode are a thing and do help.

Do you really think that, say, all music streaming services are storing their songs with names allowing Unicode HANGUL fillers and control characters allowing to modify the direction of characters?

Or... Maybe just maybe that Unicode characters belong to metadata and that a strict rule of "only visible ASCII chars are allowed and nothing else or you're fired" does make sense.

I'm not saying you always have control on every single filename you'll ever encounter. But when you've got power over that and can enforce saner rules, sometimes it's a good idea to use it.

You'll thank me later.