Sed Script to Capitalize "I"s in a Text File

2

I am trying to create a sed command that capitalizes the pronoun I in a text file. For example "i like dogs." should be "I like dogs." So far I have:

sed 's/ i / I /g'

This doesn't work in a number of different scenarios. Like if there is punctuation around the i.

Here's a list of scenarios that I have thought of that the command should be able to handle:

  • There are multiple 'i's on one line of text. I think this can be addressed just by having the g flag at the end.
  • The 'i' has punctuation around it. For example a comma or period after it, or a quote or parenthesis before or after it.
  • The 'i' is the first or last character on the line. Meaning you couldn't just check for whitespace or punctuation around it.
  • Any regular 'i's in a word are left alone. For example "firefighter" shouldn't be turned into "fIrefIghter".

nickeb96

Posted 2018-10-05T18:47:53.690

Reputation: 123

Answers

5

Assuming you are using GNU sed, one way is

sed 's/\([[:space:]]\|[[:punct:]]\)i\([[:space:]]\|[[:punct:]]\)/\1I\2/g'

or something like that. This still leaves the case of the line starting with 'i like dogs' because there are no space before the pronoun. One way to fix this is

sed 's/\(^\|[[:space:]]\|[[:punct:]]\)i\([[:space:]]\|[[:punct:]]\)/\1I\2/g'

This still leaves the case when you have consecutive 'i' as in "i i" but I can't think of any reason why this would occur in English text, except when one mistakenly wrote 'i i sir' when the correct phrase is 'aye aye sir'.

There are also rough edges if you also use lowercase roman numerals. The sed script won't be able to tell whether 'i' is a pronoun or the roman numeral, but there are really no good solution to that one.

user10354138

Posted 2018-10-05T18:47:53.690

Reputation: 166

A workaround to the i i case is to apply the transformation twice. This can be achieved by one command: sed -e 's…' -e 's…'. – Kamil Maciorowski – 2018-10-05T19:31:16.277

I was trying to avoid doing things twice but I suppose if push comes to shove that is the only way. – user10354138 – 2018-10-05T19:34:40.207

2

A simple solution (with GNU sed):

sed 's/\bi\b/I/g'

This is basically the same concept as the other answer — replace “i” with “I” when it’s not part of a larger word.  \b seems not to be mentioned in the sed man page, but it is explained in the GNU sed Manual:

\b

    Matches a word boundary; that is it matches if the character to the left is a “word” character and the character to the right is a “non-word” character, or vice-versa.
$ echo "abc %-= def." | sed 's/\b/X/g'
XabcX %-= XdefX.

Even the manual doesn’t explicitly say (but the example shows) that \b matches the beginning and end of the line.  It doesn’t match any characters; it matches the null string that appears between a “word” character and a “non-word” character (in either order), or at the beginning and end of the line (like ^ and $).  So we don’t have to worry about capturing (with \(\)) the character(s) that they match, and replacing them with \1 and \2.  And, since \b doesn’t match any characters, this command works on i i (changing it to I I).

G-Man Says 'Reinstate Monica'

Posted 2018-10-05T18:47:53.690

Reputation: 6 509