Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to-lines should support a configurable EOL char #1070

Closed
krader1961 opened this issue Jul 10, 2020 · 15 comments · Fixed by #1308
Closed

to-lines should support a configurable EOL char #1070

krader1961 opened this issue Jul 10, 2020 · 15 comments · Fixed by #1308

Comments

@krader1961
Copy link
Contributor

krader1961 commented Jul 10, 2020

I was working on a change to the elvish unit test framework to avoid deadlocks due to the shell writing more output than can be buffered by a pipe. So I wrote a test that did, among other things, this:

repeat (* 128 1024) x | str:join '' | to-lines

What I want is exactly 128 KiB of the letter "x", without a trailing newline, written to stdout. What I get as a consequence of the to-lines is the 128 KiB of "x"s plus a newline. The solution was to replace the above with this:

repeat (* 128 1024) x | each [c]{ print $c }

I'm perfectly happy with the solution above. What I'm wondering is how many other people would have reached for the to-lines command like I did and perhaps not notice they were getting an unwanted newline? Also, should there be an alternative, such as to-lines &n, to suppress the newline that might be easier to discover and more intuitive for some people?

@krader1961
Copy link
Contributor Author

I'm perfectly happy with the solution above.

That's not entirely true since it is considerably slower than the str:join '' | to-lines solution. As in ~47 times slower (halfway between one and two orders of magnitude):

> time { repeat (* 128 1024) x | str:join '' | to-lines >/dev/null }
61.622559ms
> time { repeat (* 128 1024) x | each [c]{ print $c } >/dev/null }
2.884555874s

It's hard to see how the each ... solution will ever be faster than the alternative that uses optimized builtins written in Go.

@krader1961
Copy link
Contributor Author

krader1961 commented Jul 10, 2020

Note that if to-lines &n worked I could simply do repeat (* 128 1024) x | to-lines &n and eliminate the need for the str:join '' in the pipeline. Which would be even more efficient. I'm not thrilled that the meaning of that pipeline is, perhaps, slightly confusing since it isn't converting each value to a distinct line in the bytes output stream. But it does provide a potentially useful, and highly efficient, mechanism. In fact, it would probably be useful to add a &zero (and/or &0) option to terminate each value with a zero byte rather than a newline or nothing at all.

@krader1961
Copy link
Contributor Author

On further reflection I think to-lines should have a &eol="\n" option rather than a &n (no newline) option. This would make it possible to convert values to "lines" without a terminating newline by doing to-lines &eol=''. More importantly it would make it possible to convert values to nul terminated strings via `to-lines &eol="\000"'. This would be an extremely useful feature when working with external commands that can be told to accept nul, rather than newline, terminated lines.

@krader1961 krader1961 changed the title Need a way to convert value to byte output without adding a newline to-lines should support a configurable EOL char Jul 21, 2020
@xiaq
Copy link
Member

xiaq commented Jul 25, 2020

print (str:join '') works and is quite fast:

~> time { repeat (* 128 1024) x | print (str:join '') >/dev/null }
49.249075ms
~> time { repeat (* 128 1024) x | str:join '' | to-lines >/dev/null }
124.185692ms

However, one property that it doesn't have is streaming - print (str:join '') will join all the values first before printing it, whereas to-lines will print the values as they come.

The point of to-lines is to convert values into lines, so I'd rather not make the EOL character configurable; another command with a configurable terminator can do, but I am not sure about the name yet.

@krader1961
Copy link
Contributor Author

The point of to-lines is to convert values into lines, so I'd rather not make the EOL character configurable; another command with a configurable terminator can do, but I am not sure about the name yet.

I don't understand why introducing a new command is preferable to adding a &eol="\n" option. We don't have two find commands. We have one which has options -print and -print0. The desire to add distinct commands is what has lead to the existence of echo and print. That increases the cognitive cost of using the language. Is a nul terminated line fundamentally different from a newline terminated line, thus warranting a distinct command?

I'm also dismayed that the second example is 1.5 times slower than the first example. I would have expected the two examples to be very close in execution time. With the second example possibly a smidge slower due to the second pipe but not 1.5x slower. Having said that, those aren't particularly realistic examples. More realistic would be comparing 1024 instances of a 128 byte literal:

> time { repeat 1024 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | print (str:join '') >/dev/null }
1.368276ms
> time { repeat 1024 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | to-lines >/dev/null }
2.260397ms

The to-lines formulation is still 0.6 times slower which is inexplicable on its face.

@xiaq
Copy link
Member

xiaq commented Oct 28, 2020

Is a nul terminated line fundamentally different from a newline terminated line, thus warranting a distinct command?

You can of course bend the language and generalize the concept of "line" - but would a new user be able to guess that? When a new user wants add NUL to each string in a stream, would they look for to-lines? IMO the answer is a firm no.

@krader1961
Copy link
Contributor Author

... would a new user be able to guess that? When a new user wants add NUL to each string in a stream, would they look for to-lines?

They might if something like an &eol option was used consistently in related contexts. Such as the echo and from-lines commands. The obvious corollary is that if to-nul-terminated (or some such) is introduced then symmetry would seem to require a from-nul-terminated. Too, while newbies might not think of a line being terminated by anything other than newline (or crnl if you're on Windows) experienced UNIX users are certainly familiar with the idea -- think of commands like find and xargs which provide an option for handling nul terminated "lines" (or blocks of text).

krader1961 added a commit to krader1961/elvish that referenced this issue May 16, 2021
This change, combined with other changes to `edit:command-history`,
makes feeding its output into `fzf` extremely fast. Which makes it a
practical alternative to using the builtin Ctrl-R binding for searching
command history. It's also just generally useful to have efficient ways
to process "lines" that are null terminated rather than newline (or
cr-nl on Windows).

Resolves elves#1070
Related elves#1053
krader1961 added a commit to krader1961/elvish that referenced this issue May 17, 2021
This change, combined with other changes to `edit:command-history`,
makes feeding its output into `fzf` extremely fast. Which makes it a
practical alternative to using the builtin Ctrl-R binding for searching
command history. It's also just generally useful to have efficient ways
to process "lines" that are null terminated rather than newline (or
cr-nl on Windows).

Resolves elves#1070
Related elves#1053
@xiaq
Copy link
Member

xiaq commented May 30, 2021

My opinion remains that this functionality should be supported by a different pair of commands. {from to}-terminated would be reasonable, and they can take a positional argument for the terminating character (e.g. to-terminated "\x00")

@krader1961
Copy link
Contributor Author

krader1961 commented May 31, 2021

@xiaq, I think adding another pair of commands is a mistake for a couple of reasons. First, despite what I wrote earlier I no longer think there is any point in supporting arbitrary "line" terminators. I can't think of a single instance (at least in the recent past) where I have needed to process lines that were not null or (cr)nl terminated. Second, I think you're too hung up on narrow definition of "line". You wrote earlier:

You can of course bend the language and generalize the concept of "line" - but would a new user be able to guess that? When a new user wants add NUL to each string in a stream, would they look for to-lines? IMO the answer is a firm no.

I disagree. And I would bet a survey of Elvish users would agree with me. Would a new user who wants to add, or split on, NUL when writing or reading a byte stream know to look for from-terminated and to-terminated? That seems even less likely.

@krader1961
Copy link
Contributor Author

Also, if we add the *-terminated variants should they allow multi-character terminators so that you could split on, for example, "\r\n". If not, why not?

@krader1961
Copy link
Contributor Author

@xiaq wrote on IRC that the goal is to avoid conflicts between builtins and "widely used programs"; therefore, there isn't much risk in new {from to}-terminated commands. I agree with the sentiment but it seems like the criteria is really "widely used programs... on Linux" given the references in that discussion solely to Linux sources of potential command conflict like https://manpages.debian.org/buster/. What about people using Elvish on macOS, Windows, Plan9, etcetera? Also, what does "widely used" mean? The fact a man page exists does not imply "widely used" no matter how loosely that term is defined. Unless, of course, "widely used" means "documented".

It still seems to me that introducing new commands in the builtin namespace should be avoided if possible. Such as by introducing new behavior to an existing command that already does, mostly, what we want other than with respect to some trivial detail. Such as whether a "line" is terminated by nl (newline, UNIX), crnl (carriage-return newline, Windows), or nul (widely used to avoid problems when a pathname or other string might include a newline character. Alternatively, putting the builtin in a namespace that is not implicitly searched to resolve a command.

@xiaq
Copy link
Member

xiaq commented Jun 6, 2021

Please see #1070 (comment), the problem remains unaddressed in how new users would discover that a command to parse null-terminated strings is called from-lines.

@xiaq
Copy link
Member

xiaq commented Jun 6, 2021

They might if something like an &eol option was used consistently in related contexts. Such as the echo and from-lines commands.

I don't think that's a good idea either - this is now twisting the definition of "end of lines".

Too, while newbies might not think of a line being terminated by anything other than newline (or crnl if you're on Windows) experienced UNIX users are certainly familiar with the idea -- think of commands like find and xargs which provide an option for handling nul terminated "lines" (or blocks of text).

They do but would they call them "lines"?

I understand that it makes sense for you to generalize the concept of "lines" and it may even seem quite elegant. If Elvish is some kind of proprietary text-processing software that specializes in text parsing and generation, I will be convinced that it's better to say "lines" instead of saying "strings that are terminated by some character" every time.

But Elvish is not that, and this is really an unnecessary deviation from common sense. You can argue with me but you can't argue with common sense.

@krader1961
Copy link
Contributor Author

...how new users would discover that a command to parse null-terminated strings is called from-lines.

How will new users discover that they should use from-terminated "\x00"? That is, I don't think either is particularly obvious for this situation but at least someone who has used Elvish for a non-trivial amount of time is likely to have used from-lines and thus wonder if it could be used for handling NUL terminated lines.

I understand that it makes sense for you to generalize the concept of "lines" and it may even seem quite elegant.

Because a "line" in this context is merely a sequence of bytes with a particular sentinel: "\n" on UNIX, "\r\n" on Windows, and sometimes NUL ("\x00"), or even RS ("\x1e"); although the latter is archaic and probably doesn't warrant supporting.

I'm going to cave in since it seems I'm not going to be able to convince you of the correctness of my perspective. 😞 Getting this feature implemented is more important for the improved performance of common scenarios than the specific names of the commands that provide the feature.

@xiaq
Copy link
Member

xiaq commented Jun 7, 2021

How will new users discover that they should use from-terminated "\x00"? That is, I don't think either is particularly obvious for this situation but at least someone who has used Elvish for a non-trivial amount of time is likely to have used from-lines and thus wonder if it could be used for handling NUL terminated lines.

If I look at a list of functions - for example, by pressing Tab at the beginning of a command - and I see to-terminated, I will think that this command can probably be used to generate null-terminated strings. If I see to-lines, I wouldn't think of that.

I'm going to cave in since it seems I'm not going to be able to convince you of the correctness of my perspective.

Whether or not to generalize the concept of "lines" to null-terminated strings, is a design choice, and "correctness" is always relative to some extent when it comes to design choices. It depends on the designer's past experience of what worked and what didn't, and the designer's intended usage scenarios. I'm sure that this generalization makes a ton of sense to you, and I've acknowledged that your perspective is perfectly valid. It's just not my perspective, even after me taking your arguments into account. It's certainly not great when people draw different conclusions when presented with the same argument. But that's part of human nature.

But there's also something in general I wish to talk about. I find it really counter-productive for you to frame design discussions as one person convincing the other that their perspective is "correct". Elvish is not the "correct" shell. It's never intended to be. I wouldn't even claim that Elvish is more "correct" than POSIX shell. It just makes much more sense for me. Obviously it is not entirely subjective - I listen to other people's arguments and try understand them to the extent allowed by my experience. But at the end of the day I draw my conclusion and that's what ended up being committed into Elvish's codebase.

If you are on a quest to find the shell that feels most correct to you, the respectful thing to do is starting your own project, not trying to argue me into building Elvish in your preferred manner. The Elvish project is not some kind of Unix shell elysium, and I'm no gatekeeper to such elysium. The project is a small garden I take care of as a hobby. I open it up for other people to enjoy because that's part of the hobby, but ultimately I decide which flowers are grown in the garden.

If you do decide to continue contributing to Elvish, please understand that your "correct" shell is never going to be exactly the same as my "correct" shell, and please respect everyone else's individual agency.

krader1961 added a commit to krader1961/elvish that referenced this issue Jun 8, 2021
This change makes feeding output to commands which handle NUL terminated
"lines" (e.g., `fzf -read0` or `xargs -0`) extremely fast compared to
using an explicit Elvish loop that does `print $val"\x00"`. Similarly for
handling input from commands that produce NUL terminated "lines" (e.g.,
`find . -print0`) compared to an Elvish loop using `read-upto "\x00"`.

Resolves elves#1070
Related elves#1053
xiaq pushed a commit that referenced this issue Jun 10, 2021
This change makes feeding output to commands which handle NUL terminated
"lines" (e.g., `fzf -read0` or `xargs -0`) extremely fast compared to
using an explicit Elvish loop that does `print $val"\x00"`. Similarly for
handling input from commands that produce NUL terminated "lines" (e.g.,
`find . -print0`) compared to an Elvish loop using `read-upto "\x00"`.

Resolves #1070
Related #1053
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants