How can I tag text based on its indentation?

Statement of the problem:

An input file has text, indented by zero or more tab characters. Specifically, each line in the input is one of these:

Blank, or
Zero or more tabs (up to a limit; see below), followed by a character that is neither a space nor a tab (followed by zero or more of any character).

There are no lines that

Begin with zero or more tabs, followed by a space. (This implies that there are no lines that begin with a space.)
or
Consist entirely of one or more tabs (and nothing else).
or
Begin with more than a specified number of tabs.

The input shall be logically decomposed into groups of lines that are all either

Blank, or
Indented with the same number of tabs.

Blank lines shall be passed through to the output, unmodified.

A list of tags shall be specified; e.g., x, y, and zz. A group of (non-blank) lines that are indented with zero tabs (i.e., not indented) shall be bracketed by <x> and </x>. A group of lines that are indented with one tab shall be bracketed by <y> and </y>. A group of lines that are indented with two tabs shall be bracketed by <zz> and </zz>. (Lines will not be indented with more than two tabs.)

The first line of a group (of non-blank line(s)) shall have the beginning tag inserted between the tabs and the text. The last line of a group shall have the end tag appended at the end of the text. A group may consist of a single line, so the first line may also be the last line. All lines of a group other than the first shall be additionally indented (with spaces inserted between the tabs and the text) by the width of the beginning tag.

For example (using  ―→  to represent a tab), this INPUT:

aaa
 ―→ Once upon a midnight dreary,
 ―→ while I pondered, weak and weary,

Quoth the Raven, “Nevermore.”

 ―→  ―→ The quick brown fox
 ―→  ―→ jumps over the lazy dog.
 ―→ It was a dark and stormy night.
 ―→ Suddenly a shot rang out.

shall be translated into this OUTPUT:

<x>aaa</x>
 ―→ <y>Once upon a midnight dreary,
 ―→    while I pondered, weak and weary,</y>

<x>Quoth the Raven, “Nevermore.”</x>

 ―→  ―→ <zz>The quick brown fox
 ―→  ―→     jumps over the lazy dog.</zz>
 ―→ <y>It was a dark and stormy night.
 ―→    Suddenly a shot rang out.</y>

Solution:

Obviously, we don’t quite know what to do with a line of input until we’ve read the next line. This problem is typically addressed by saving the content of one line to be processed after the next one has been read.

So, here it is:

awk '
  BEGIN {
        num_tags = split("x y zz", tags)
        for (i=1; i<=num_tags; i++)
            {
                len = length(tags[i]) + 2
                tag_pad[i] = ""
                for (j=1; j<=len; j++) tag_pad[i] = tag_pad[i] " "
            }
    }
    {
        if (NF == 0)
                indent_num = 0
        else
            {
                indent_num = index($0, $1)
                indent_str = substr($0, 1, indent_num-1)
                restOfLine = substr($0, indent_num)
            }
        if (indent_num != saved_indent_num  &&  saved != "")
            {
                print saved "</" tags[saved_indent_num] ">"
                saved = ""
            }
        if (NF == 0)
                print
        else if (indent_num > num_tags)
            {
                errmsg = "Error: line %d has an indent level of %d.\n"
                printf errmsg, NR, indent_num > "/dev/stderr"
                exit 1
            }
        else if (indent_num == saved_indent_num)
            {
                print saved
                saved = indent_str   tag_pad[indent_num]    restOfLine
            }
        else
                saved = indent_str "<" tags[indent_num] ">" restOfLine
        saved_indent_num = indent_num
    }
   END {
        if (saved != "")
                print saved "</" tags[saved_indent_num] ">"
    }
    '

The BEGIN block initializes the tags (x, y, and zz) by splitting a space-separated string. The tag_pad array contains enough spaces to match the width of the tags (including the < and >): tag_pad[1] and tag_pad[2] are three spaces; tag_pad[3] is four spaces.

Upon reading a line of input, we parse it. If it has no fields (NF == 0), it must be blank (since we have specified that no line consists entirely of spaces and tabs), so set indent_num to 0. Otherwise, measure the indent by finding the location of $1 (the first word) in $0 (the entire line). index returns a value starting at 1, so this is actually one more than the number of whitespace characters before the first non-whitespace character (and, remember, we are assuming that these are all tabs). This is lucky, because now indent_num corresponds to entries in the tags and tag_pad arrays. Then we break the line apart into an indent_str (the whitespace) and restOfLine (everything after the indent).

Now we rely on saved information. If this line has a different indentation from the previous one, we’re starting a new group. If there is a saved line, write it out, with the appropriate ending tag at the end of the line.

If the current line is blank, just print it. Check whether the current indentation level is too high, and bail if it is. If the current indentation is the same as the previous one, this is a continuation line of an already-started group, so just print the saved (previous) line, and build a new saved string that is the current line with the width of the current tag inserted between the indent and the text. Otherwise, we’re starting a new group, so build a saved string that is the current line with the beginning tag (itself) inserted between the indent and the text.

When we get to the end of the input, end the current group as we did before.

Scott

Posted 2017-04-29T07:08:38.870

Reputation: 17 653

If you want blank lines to be added (as you have shown), you should say so. And your example is inconsistent — compare the two <b> stanzas. – Scott – 2017-04-29T08:39:21.490

I have edited the question based on your opinion. – Ramaprakasha – 2017-04-29T09:37:55.740

Thanks for fixing the inconsistent indentation of the tags vs. the data, but you missed my point about blank lines.   Your input dataset has ten lines: line 1 is “111”, lines 2 and 3 are both “    222”, line 4 is blank, etc.   Your output is twelve lines long, not only preserving the pre-existing blank lines, but adding one at 1½ (between “222” and “111”) and another at 8½ (between “333” and “222”).   Do you want for that to happen?   (Or is it a typo in your question?) – Scott – 2017-04-29T18:15:59.307

Oh Sorry. Yes it was a typo. I have corrected it now. – Ramaprakasha – 2017-04-30T03:01:05.283

Since I wanted tagging of text I changed numerals to a,b, c. – Ramaprakasha – 2017-05-01T08:14:55.353

So you want to remove all <> characters and all characters in between them while retaining line feeds and carriage returns and all other characters and line positions minus the characters removed only affecing the line positions... this helps clarify what you were asking I think. Probably a simple regex with sed or grep or something perhaps for a good starting point. – Pimp Juice IT – 2017-05-06T05:28:15.007

No it is opposite of it. I have now clarified by adding input and output headings. – Ramaprakasha – 2017-05-06T06:04:46.280

You might be able to write a script but it's going to be quite complex. You would need remember what the last tag was you opened and read the file line by line. – Seth – 2017-05-06T09:21:14.533

Have you seen my answer to your question? Does it work for you? (See What should I do when someone answers my question?)

– Scott – 2017-05-12T02:38:48.983

Sorry I had a crash and had to install my OS anew so couldn't answer you. I have accepted your answer. It works perfectly – Ramaprakasha – 2017-05-12T03:32:37.847

How can I tag text based on its indentation?

Answers

Solution: