Tips for golfing in sed

19

8

What general tips do you have for golfing in sed? I'm looking for ideas which can be applied to code-golf problems and which are also at least somewhat specific to sed (e.g. "remove comments" is not an answer).

Please post one tip per answer.

Toby Speight

Posted 2015-06-05T11:20:10.567

Reputation: 5 058

4Not really a golfing tip (but still a tip for golfing): linefeeds consume just as many bytes as semicolons, so you can keep your code short and readable. – Dennis – 2015-07-22T17:48:31.577

Not a tip either, but a problem: I have GNU sed, yet the F command never worked. Does anyone know why? – seshoumara – 2016-08-30T16:15:44.630

@seshoumara F works on my GNU sed (Debian testing). It just prints - if reading from stdin, of course, but that's expected. What do you get from sed -e 'F;Q' /etc/hostname? – Toby Speight – 2016-08-30T16:28:37.640

@TobySpeight That gives this error: char 1: unknown command: F. I have to update sed maybe; what version do you have? The L command also doesn't work, but it's useless anyway since -l n exists. Everything else mentioned on GNU sed's site works. – seshoumara – 2016-08-30T16:44:20.453

@seshoumara, my results are on sed (GNU sed) 4.2.2. Just to check, you don't have POSIXLY_CORRECT set in your environment, do you? That would turn off most GNU extensions. – Toby Speight – 2016-08-30T16:49:29.130

@TobySpeight I have sed 4.2.1 so this could be why (updating). And no, I didn't had that set in my environment. Thanks for all the help today. – seshoumara – 2016-08-30T17:05:19.603

1I opened the chat room bash, sed and dc for all who want to talk and ask about these languages. Let's make a community! – seshoumara – 2016-08-30T17:10:40.947

Answers

11

If you need to use labels then for sure you'll want your label names to be as short as possible. In fact taken to the extreme, you may even use the empty string as a label name:

:    # define label ""
p    # print pattern space
b    # infinite loop! - branch to label ""

Digital Trauma

Posted 2015-06-05T11:20:10.567

Reputation: 64 644

4As of gnu sed 4.3, this behavior was removed. : now requires a label. – Kevin – 2017-02-17T20:06:49.487

Indeed, here is also the actual git commit link. I guess for PPCG this won't change much, since we are allowed to post answers for GNU sed 4.2.x, but it's good to know, though regrettably, that this trick won't officially work anymore.

– seshoumara – 2017-02-17T20:40:52.060

8

The GNU sed documentation describes the s command as "sed's Swiss Army Knife". But if all you want to do is replace all instances of one character with another, then the y command is what you need:

y/a/b/

is one char shorter than:

s/a/b/g

Digital Trauma

Posted 2015-06-05T11:20:10.567

Reputation: 64 644

its also way faster, and can swap chars in place: y/12/21/ – mikeserv – 2015-12-23T18:12:12.260

6

When repeatedly replacing in a loop:

loop:
s/foo/bar/g
tloop

it's usually unnecessary to replace globally, as the loop will eventually replace all occurrences:

# GNU sed
:
s/foo/bar/
t

Note also the GNU extension above: a label can have an empty name, saving more precious bytes. In other implementations, a label cannot be empty, and jumping without a label transfers flow to the end of script (i.e. same as n).

Toby Speight

Posted 2015-06-05T11:20:10.567

Reputation: 5 058

1The empty label name is GNU-specific, POSIX requires branches with no argument to jump to the end of the script (seems to be the behavior in the BSDs and Busybox, also in GNU sed if you don't add an empty :) – ninjalj – 2015-11-03T01:30:08.767

2

The nameless label was always a bug in GNU sed, not an extension, and in version 4.3 and higher this bug was, regrettably, fixed. See here.

– seshoumara – 2017-02-17T20:45:23.047

6

Consider using extended regex syntax (in GNU sed). The -r option costs one byte in scoring, but using it just once to eliminate the backslashes from a pair of \(...\) has already paid for itself.

Toby Speight

Posted 2015-06-05T11:20:10.567

Reputation: 5 058

2With the additional note that -r seems to be GNU sed specific. – manatwork – 2015-06-05T11:28:19.823

@manat - added (but it's a Community Wiki answer, so you could have edited yourself). – Toby Speight – 2015-06-05T11:51:57.517

Of course. I just didn't considered it part of the tip, only an additional note. – manatwork – 2015-06-05T11:55:17.033

And it keeps paying for itself when using +, ?, {} and | in regex matches, since no backslashes are needed either. – seshoumara – 2016-08-29T17:52:41.440

-E works as an alias to -r in many sed implementations if I remember correctly. – phk – 2019-02-28T15:08:39.557

5

There's no built-in arithmetic, but calculations can be done in unary or in unary-coded decimal. The following code converts decimal to UCD, with x as the unit and 0 as the digits separator:

s/[1-9]/0&/g
s/[5-9]/4&/g
y/8/4/
s/9/4&/g
s/4/22/g
s/[37]/2x/g
s/[26]/xx/g
s/[1-9]/x/g

and here's the conversion back to decimal:

s/0x/-x/g
s/xx/2/g
y/x/1/
s/22/4/g
s/44/8/g
s/81/9/g
s/42/6/g
s/21/3/g
s/61/7/g
s/41/5/g
s/-//g

These are both taken from an answer to "Multiply two numbers without using any numbers".

Plain old unary can be converted using this pair of loops from this answer to "{Curly Numbers};", where the unit is ;. I've used v and x to match Roman for 5 and 10; b comes from "bis".

# unary to decimal
:d
/;/{
s/;;;;;/v/g
s/vv/x/g
/[;v]/!s/x\+/&0/
s/;;/b/g
s/bb/4/
s/b;/3/
s/v;/6/
s/vb/7/
s/v3/8/
s/v4/9/
y/;bvx/125;/
td
}

# Decimal to unary
:u
s/\b9/;8/
s/\b8/;7/
s/\b7/;6/
s/\b6/;5/
s/\b5/;4/
s/\b4/;3/
s/\b3/;2/
s/\b2/;1/
s/\b1/;0/
s/\b0//
/[^;]/s/;/&&&&&&&&&&/g
tu

Toby Speight

Posted 2015-06-05T11:20:10.567

Reputation: 5 058

1...and if you have to use either of these, you've almost certainly already lost the code golf, though you might still be competitive with Java answers ;-) Still fun to use though. – Digital Trauma – 2015-06-05T17:36:30.153

The conversion from plain unary to decimal gives wrong answers for unary input equivalent of decimal form X0X, for example 108. The line responsible for this is /[;v]/!s/\b/0/2, which needs to be changed to /[;v]/!s:x\+:&0: for it to work. See here.

– seshoumara – 2017-04-06T14:20:18.277

@seshoumara, your link seems to be an empty page. But it's entirely plausible that I made an error when extracting that code from the referenced answer, so I'll just apply your fix. – Toby Speight – 2017-04-06T14:48:52.963

The link loads correctly, but I was expecting something other than a grey page with "TIO" and something that looks like the Ubuntu logo - is that what's intended? And I was referring to the second of the answers I referenced (58007), as that's where the plain-unary sample originated.

– Toby Speight – 2017-04-06T15:30:33.950

The TIO link should have contained the corrected code, plus an example input, 108 in unary. On running the code you should have seen the correct result 108, and not 180, as previously generated by that now fixed line of code. Updating the referenced answer is entirely up to you. This is a community wiki. – seshoumara – 2017-04-06T15:52:49.420

4

Expanding upon this tip answer, regarding the conversions between decimal and plain unary number formats, I present the following alternative methods, with their advantages and disadvantages.

Decimal to plain unary: 102 + 1(r flag) = 103 bytes. I counted \t as a literal tab, as 1 byte.

h
:
s:\w::2g
y:9876543210:87654321\t :
/ /!s:$:@:
/\s/!t
x;s:-?.::;x
G;s:\s::g
/\w/{s:@:&&&&&&&&&&:g;t}

Try it online!

Advantage: it is 22 bytes shorter and as extra, it works with negative integers as input

Disadvantage: it overwrites the hold space. However, since it's more likely that you'd need to convert the input integer right at the start of the program, this limitation is rarely felt.

Plain unary to decimal: 102 + 1(r flag) = 103 bytes

s:-?:&0:
/@/{:
s:\b9+:0&:
s:.9*@:/&:
h;s:.*/::
y:0123456789:1234567890:
x;s:/.*::
G;s:\n::
s:@::
/@/t}

Try it online!

Advantage: it is 14 bytes shorter. This time both tip versions work for negative integers as input.

Disadvantage: it overwrites the hold space

For a complicated challenge, you'll have to adapt these snippets to work with other information that may exist in the pattern space or hold space, besides the number to convert. The code can be golfed more, if you know you only work with positive numbers or that zero alone is not going to be a valid input / output.

An example of such challenge answer, where I created and used these snippets, is the Reciprocal of a number (1/x).

seshoumara

Posted 2015-06-05T11:20:10.567

Reputation: 2 878

For unary-to-decimal you can save two bytes by combining the last two substitutions: s:\n|@$::g. https://tio.run/##K05N@f@/2ErX3krNwIpL30G/2oqr2ComyVLbykANxNSz1HKw0gcyM6yBHC19KyuuSisDQyNjE1MzcwtLKzgLqL0CqERfTwuoxB3IismrcVCxskoHmVpS@/@/Awj8yy8oyczPK/6vWwQA

– Jordan – 2017-06-19T16:49:23.157

I had my own try at the decimal to unary converter. Here's 97 bytes :) Try it online! (also doesn't require -r, but with new consensus, flags do not count towards the bytecount anyways, and it doesn't mess up the hold space)

– user41805 – 2018-05-21T12:25:16.120

Actually if you change the last line from /\n/ta to /\n/t, you save 1 byte to get 96 – user41805 – 2018-05-22T10:22:57.090

@Cowsquack Thanks, 96 is great! Don't have time now, will look on it this weekend. – seshoumara – 2018-05-22T15:05:11.613

Sure, do send me a ping on chat then :) – user41805 – 2018-05-22T16:37:57.037

4

If not explicitly banned by the question, the consensus for this meta question is that numerical input may be in unary. This saves you the 86 bytes of decimal to unary as per this answer.

Digital Trauma

Posted 2015-06-05T11:20:10.567

Reputation: 64 644

Isn't that meta consensus for sed referring to plain old unary format? I have several answers where an input in UCD would help me, in case it's either way. – seshoumara – 2017-02-15T08:32:18.120

@seshoumara I meant unary, not UCD – Digital Trauma – 2017-02-15T16:07:10.600

Then the conversion from decimal to plain old unary saves you 126 bytes as per that answer you linked. The 86 bytes is for the conversion to UCD. – seshoumara – 2017-02-17T02:32:38.863

4

As mentioned in man sed (GNU), you can use any character as a delimiter for regular expressions by using the syntax

\%regexp%

where % is a placeholder for any character.

This is useful for commands like

/^http:\/\//

which are shorter as

\%^http://%

What is mentioned in the GNU sed manual but not in man sed is that you can change the delimiters of s/// and y/// as well.

For example, the command

ss/ssg

removes all slashes from the pattern space.

Dennis

Posted 2015-06-05T11:20:10.567

Reputation: 196 637

3

Let's talk about the t and T commands, that although they are explained in the man page, it's easy to forget about it and introduce bugs accidently, especially when the code gets complicated.

Man page statement for t:

If a s/// has done a successful substitution since the last input line was read and since the last t or T command, then branch to label.

Example showing what I mean: Let's say you have a list of numbers and you want to count how many negatives there are. Partial code below:

1{x;s/.*/0/;x}                   # initialize the counter to 0 in hold space
s/-/&/                           # check if number is negative
t increment_counter              # if so, jump to 'increment_counter' code block
b                                # else, do nothing (start a next cycle)

:increment_counter
#function code here

Looks ok, but it's not. If the first number is positive, that code will still think it was negative, because the jump done via t for the first line of input is performed regardless, since there was a successful s substitution when we initialized the counter! Correct is: /-/b increment_counter.

If this seemed easy, you could still be fooled when doing multiple jumps back and forth to simulate functions. In our example the increment_counter block of code for sure would use a lot of s commands. Returning back with b main might cause another check in "main" to fall in the same trap. That is why I usually return from code blocks with s/.*/&/;t label. It's ugly, but useful.

seshoumara

Posted 2015-06-05T11:20:10.567

Reputation: 2 878

2

I know this is an old thread, but I just found those clumsy decimal to UCD converters, with almost a hundred bytes, some even messing the hold space or requiring special faulty sed versions.

For decimal to UCD I use (68 bytes; former best posted here 87 bytes)

s/$/\n9876543210/
:a
s/\([1-9]\)\(.*\n.*\)\1\(.\)/\3x\2\1\3/
ta
P;d

UCD to decimal is (also 66 bytes; former best posted here 96)

s/$/\n0123456789/
:a      
s/\([0-8]\)x\(.*\n.*\)\1\(.\)/\3\2\1\3/
ta      
P;d
  • \n in the replacement is not portable. You can use a different character instead and save two bytes, but you'll need more bytes to remove the appendix instead of P;d; see next remark. Or, if your hold space is empty, do G;s/$/9876543210/ without byte penalty.
  • If you need further processing, you'll need some more bytes for s/\n.*// instead of P;d.
  • You could save two bytes each for those buggy old GNU sed versions
  • No, you can't save those six backslashes as extended regular expressions don't do backreferences

Philippos

Posted 2015-06-05T11:20:10.567

Reputation: 121

There are no decimal to UCD and back converters posted in this thread that mess the hold space or require faulty sed versions. – seshoumara – 2017-11-11T08:49:21.130

Your own answer from April 6th uses the gold space and will only run with old sed versions that violate the POSIX standard. – Philippos – 2017-11-11T09:08:17.433

I'm not doing decimal to UCD conversions! Read the thread again carefully. UCD means that 12 is converted to 0x0xx (what your answer calculates), while plain unary (what my answer calculates) means that 12 is converted to xxxxxxxxxxxx. I chosed @ as symbol, but you get the idea. And further, on PPCG one doesn't need to adhere to the POSIX standard. – seshoumara – 2017-11-11T09:15:11.327

If it pleases you, sheriff – Philippos – 2017-11-11T09:55:32.207

2

Read the whole input at once with -z

Often you need to operate on the whole input at once instead of one line at a time. The N command is useful for that:

:
$!{N;b}

...but usually you can skip it and use the -z flag instead.

The -z flag makes sed use NUL (\0) as its input line separator instead of \n, so if you know your input won’t contain \0, it will read all of the input at once as a single “line”:

$ echo 'foo
> bar
> baz' | sed -z '1y/ao/eu/'
fuu
ber
bez

Try it online!

Jordan

Posted 2015-06-05T11:20:10.567

Reputation: 5 001

2

Append a newline in one byte

The G command appends a newline and the contents of the hold space to the pattern space, so if your hold space is empty, instead of this:

s/$/\n/

You can do this:

G

Prepend a newline in three bytes

The H command appends a newline and the contents of the pattern space to the hold space, and x swaps the two, so if your hold space is empty, instead of this:

s/^/\n/

You can do this:

H;x

This will pollute your hold space, so it only works once. For two more bytes, though, you could clear your pattern space before swapping, which is still a savings of two bytes:

H;z;x

Jordan

Posted 2015-06-05T11:20:10.567

Reputation: 5 001

2

Instead of clearing the pattern space with s/.*//, use the z command (lowercase) if you go with GNU sed. Besides the lower bytes count, it has the advantage that it won't start the next cycle as the command d does, which can be useful in certain situations.

seshoumara

Posted 2015-06-05T11:20:10.567

Reputation: 2 878

1May also be of benefit if you have invalid multi-byte sequences (which aren't matched by .). – Toby Speight – 2016-08-30T17:08:10.393

1

Empty regexes are equivalent to the previously encountered regex

(thanks to Riley for discovering this from an anagol submission)

Here is an example where we are tasked with creating 100 @s in an empty buffer.

s/$/@@@@@@@@@@/;s/.*/&&&&&&&&&&/ # 31 bytes
s/.*/@@@@@@@@@@/;s//&&&&&&&&&&/  # 30 bytes

The second solution is 1 byte shorter and uses the fact that empty regexes are filled in with the last encountered regex. Here, for the second substitution, the last regex was .*, so the empty regex here will be filled with .*. This also works with regexes in /conditionals/.

Note that it is the previously encountered regex, so the following would also work.

s/.*/@@@@@@@@@@/;/@*/!s/$/@/;s//&&&&&&&&&&/

The empty regex gets filled with @* instead of $ because s/$/@/ is never reached.

user41805

Posted 2015-06-05T11:20:10.567

Reputation: 16 320

Yes, good answer. I've even made regexes longer so that they can be re-matched like this (thus making the program shorter).

– Toby Speight – 2018-06-22T11:53:03.487

1

In sed, the closest thing to a function that you can have is a label. A function is useful because you can execute its code multiple times, thus saving a lot of bytes. In sed however you would need to specify the return label and as such you can't simply call this "function" multiple times throughout your code the way you would do it in other languages.

The workaround I use is to add in one of the two memories a flag, which is used to select the return label. This works best when the function code only needs a single memory space (the other one).

Example showing what I mean: taken from a project of mine to write a small game in sed

# after applying the player's move, I overwrite the pattern space with the flag "P"
s/.*/P/
b check_game_status
:continue_turn_from_player
#code

b calculate_bot_move
:return_bot_move
# here I call the same function 'check_game_status', but with a different flag: "B"
s/.*/B/
b check_game_status
:continue_turn_from_bot
#code (like say 'b update_screen')

:check_game_status   # this needs just the hold space to run
#code
/^P$/b continue_turn_from_player
/^B$/b continue_turn_from_bot

The labels should be golfed of course to just one letter, I used full names for a better explanation.

seshoumara

Posted 2015-06-05T11:20:10.567

Reputation: 2 878

0

Mostly useless step:

y|A-y|B-z|

This will only translate A to B and y to z (... and - to - ;), but nothing else, so

sed -e 'y|A-y|B-z|' <<<'Hello world!'

will just return:

Hello world!

You could ensure this will be useless, for sample by using this on lower-case hexadecimal values (containing only 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e or f.)

F. Hauri

Posted 2015-06-05T11:20:10.567

Reputation: 2 654

2Is this something you found out the hard way?! ;-) – Toby Speight – 2015-09-15T19:06:22.290

I like useless scripts: sed '; ;/s/b;y|A-y|B-z|;s ;s/ //; ; ;' <<<'Hello world' (Why do this not suppress the space?) – F. Hauri – 2015-09-15T19:19:00.167