Why *not* parse `ls` (and what to do instead)
Parsing 'ls' output in Unix systems is discouraged due to challenges with special characters. Shell globs and for loops are recommended for reliable file handling, criticizing 'ls' parsing for its limitations.
Read original articleThe article discusses the reasons why parsing the output of the 'ls' command in Unix systems is discouraged and offers alternative solutions. It highlights the challenges posed by filenames containing special characters like whitespace and newlines, which can make parsing 'ls' output unreliable. The author suggests using shell globs or iterating over files using a for loop as more reliable methods to handle file lists. The article also criticizes the practice of parsing 'ls' output and emphasizes the simplicity and effectiveness of using shell globs for file operations. Additionally, it addresses the limitations and potential issues with globbing, such as matching multiple characters. The article concludes by advocating for informing users about practical solutions rather than focusing on the limitations of parsing 'ls' output.
Related
Avoiding Emacs Bankruptcy
Avoid "Emacs bankruptcy" by choosing efficient packages, deleting unnecessary configurations, and focusing on Emacs's core benefits. Prioritize power-to-weight ratio to prevent slowdowns and maintenance issues. Regularly reassess for a streamlined setup.
Ruby: A great language for shell scripts
Ruby is praised for shell scripting due to its features like calling external commands, handling status codes, using types, functional constructs, regex matching, threading, and file operations. It's recommended for complex scripts alongside Bash.
Elixir Gotchas
The article highlights common pitfalls in Elixir programming, including confusion between charlists and strings, differences in pattern matching, struct behavior, accessing struct fields, handling keyword lists, and unique data type comparisons.
Start all of your commands with a comma (2009)
The article discusses creating a ~/bin/ directory in Unix to store custom commands, avoiding name collisions with system commands by prefixing custom commands with a comma. This technique ensures unique, easily accessible commands.
Structured logs are the way to start
Structured logs are crucial for system insight, aiding search and aggregation. Despite storage challenges, prioritizing indexing and retention strategies is key. Valuable lessons can be gleaned from email for software processes.
Maybe I also don't understand shell, but as it was said before: when in doubt switch to a better defined language. Thank heavens for awk.
I finally started really using my shell after switching to it. I casually write multiple scripts and small functions per day to automate my stuff. I'm writing scripts I'd otherwise write in python in nu. All because the data needs no parsing. I'm not even annotating my data with types even though Nushell supports it because it turns out structured data with inferred types is more than you need day-to-day. I'm not even talking about all the other nice features other shells simply don't have. See this custom command definiton:
# A greeting command that can greet the caller
def greet [
name: string # The name of the person to greet
--age (-a): int # The age of the person
] {
[$name $age]
}
Here's the auto-generated output when you run `help greet`: A greeting command that can greet the caller
Usage:
> greet <name> {flags}
Parameters:
<name> The name of the person to greet
Flags:
-h, --help: Display this help message
-a, --age <integer>: The age of the person
It's one of the software that only empowers you, immediately, without a single downside. Except the time spent learning it, but that was about a week for me. Bash or fish is there if I ever need it to paste some shell commands.Perhaps it would help to translate this into something more like, "what pitfalls do you run into if you parse `ls`" but it's hard to get past the initial language.
I think the example of "exclude these two types of files" is a good case. I often have to write stuff like `ls P* | grep -Ev "wav|draft"` which doesn't solve a problem I don't have (such as filenames with newlines in them) but does solve the one I do (keeping a subset of files that would be tricky to glob properly).
In my experience 95% of those scripts are going to be discarded in a week, and bringing Python into it means I need to deal with `os.path` and `subprocess.run`. My rule of thumb: if it's not going to be version controlled then Bash is fine.
Shellcheck's page on parsing ls links to the article the author is nitpicking on, but it also links to the answer to "what to do instead": use find(1), unless you really can't. https://mywiki.wooledge.org/BashFAQ/020
I've been using Linux since 1999 and i never came across a filename with newlines. On the other hand, pretty much all "ls parsing" i've done was on the command-line to pipe it to other stuff in files i was 100.1% sure would be fine.
Not piping strings avoids this issue completely. Marcel’s ls produces a stream of File objects, which can be processed without worrying about whitespace, EOL, etc.
In general, this approach avoids parsing the output of any command. You always get a stream of Python values.
--zero end each output line with NUL, not newline
Using magic, I've renamed any files you have to remove control characters in the name and made it impossible to make any new ones. (You can thank me later.)
What can't you do now?
shopt -s failglob
Most of the time it's fine to just suck in ls and split it on \n and iterate away, which I do a lot because it's just a nice and simple way forward when names are well-formed. Sometimes it's nicer to figure out a 'find at-place thing -exec do-the-stuff {} \;'. And sometimes one needs some other tool that scours the file system directly and doesn't choke on absolutely bizarre file names and gives a representation that doesn't explode in the subsequent context, whatever that may be, which is quite rare.
A more common issue than file names consisting of line breaks is unclean encodings, non-UTF-8 text that seeps in from lesser operating systems. Renaming makes the problem go away, so one should absolutely do that and then crude techniques are likely very viable again.
find ~/Music -iname 'p*' -not -iname '*age*' -not -iname '*etto*'
find ~/Music -iname 'p*' -not -iregex '.*\(age\|etto\).*'
find ~/Music -regextype posix-extended -iname 'p*' -not -iregex '.*(age|etto).*'
Not that I'm likely to ever use any of that in anger, but it's good to know if ever I do wind up needing it.https://news.ycombinator.com/item?id=40692698 (10 days ago, 83 comments)
You might say that people don't move or rename things while files are open, but they absolutely do, and it absolutely breaks things. Even something as simple as starting to copy a directory in Explorer to a different drive, and then moving it while the copy is ongoing, doesn't work. That's pathetic! There is no technical reason this should not be possible.
And who can forget the case where an Apple installer deleted people's hard disk contents when they had two drives, one with a space character, and another one whose name was the string before the first drive's space character?
Files and directories need to have a unique ID, and references to files need to be that ID, not their path, in almost all cases. MFS got that right in 1984, it's insane that we have failed to properly replicate this simple concept ever since, and actually gone backwards in systems like Mac OS X, which used to work correctly, and now no longer consistently do.
latest="$(ls -1 $pattern | sort --reverse --version-sort | head -1)"
Anyone got a better solution?Do you really think that, say, all music streaming services are storing their songs with names allowing Unicode HANGUL fillers and control characters allowing to modify the direction of characters?
Or... Maybe just maybe that Unicode characters belong to metadata and that a strict rule of "only visible ASCII chars are allowed and nothing else or you're fired" does make sense.
I'm not saying you always have control on every single filename you'll ever encounter. But when you've got power over that and can enforce saner rules, sometimes it's a good idea to use it.
You'll thank me later.
Related
Avoiding Emacs Bankruptcy
Avoid "Emacs bankruptcy" by choosing efficient packages, deleting unnecessary configurations, and focusing on Emacs's core benefits. Prioritize power-to-weight ratio to prevent slowdowns and maintenance issues. Regularly reassess for a streamlined setup.
Ruby: A great language for shell scripts
Ruby is praised for shell scripting due to its features like calling external commands, handling status codes, using types, functional constructs, regex matching, threading, and file operations. It's recommended for complex scripts alongside Bash.
Elixir Gotchas
The article highlights common pitfalls in Elixir programming, including confusion between charlists and strings, differences in pattern matching, struct behavior, accessing struct fields, handling keyword lists, and unique data type comparisons.
Start all of your commands with a comma (2009)
The article discusses creating a ~/bin/ directory in Unix to store custom commands, avoiding name collisions with system commands by prefixing custom commands with a comma. This technique ensures unique, easily accessible commands.
Structured logs are the way to start
Structured logs are crucial for system insight, aiding search and aggregation. Despite storage challenges, prioritizing indexing and retention strategies is key. Valuable lessons can be gleaned from email for software processes.