Statement of the problem:
An input file has text, indented by zero or more tab characters.
Specifically, each line in the input is one of these:
- Blank, or
- Zero or more tabs (up to a limit; see below),
followed by a character that is neither a space nor a tab
(followed by zero or more of any character).
There are no lines that
- Begin with zero or more tabs, followed by a space.
(This implies that there are no lines that begin with a space.)
or
- Consist entirely of one or more tabs (and nothing else).
or
- Begin with more than a specified number of tabs.
The input shall be logically decomposed into groups of lines
that are all either
- Blank, or
- Indented with the same number of tabs.
Blank lines shall be passed through to the output, unmodified.
A list of tags shall be specified; e.g., x
, y
, and zz
.
A group of (non-blank) lines that are indented with zero tabs
(i.e., not indented) shall be bracketed by <x>
and </x>
.
A group of lines that are indented with one tab
shall be bracketed by <y>
and </y>
.
A group of lines that are indented with two tabs
shall be bracketed by <zz>
and </zz>
.
(Lines will not be indented with more than two tabs.)
The first line of a group (of non-blank line(s))
shall have the beginning tag inserted between the tabs and the text.
The last line of a group shall have the end tag
appended at the end of the text.
A group may consist of a single line,
so the first line may also be the last line.
All lines of a group other than the first shall be additionally indented
(with spaces inserted between the tabs and the text)
by the width of the beginning tag.
For example (using ―→
to represent a tab), this INPUT:
aaa
―→ Once upon a midnight dreary,
―→ while I pondered, weak and weary,
Quoth the Raven, “Nevermore.”
―→ ―→ The quick brown fox
―→ ―→ jumps over the lazy dog.
―→ It was a dark and stormy night.
―→ Suddenly a shot rang out.
shall be translated into this OUTPUT:
<x>aaa</x>
―→ <y>Once upon a midnight dreary,
―→ while I pondered, weak and weary,</y>
<x>Quoth the Raven, “Nevermore.”</x>
―→ ―→ <zz>The quick brown fox
―→ ―→ jumps over the lazy dog.</zz>
―→ <y>It was a dark and stormy night.
―→ Suddenly a shot rang out.</y>
Solution:
Obviously, we don’t quite know what to do with a line of input
until we’ve read the next line.
This problem is typically addressed by saving the content of one line
to be processed after the next one has been read.
So, here it is:
awk '
BEGIN {
num_tags = split("x y zz", tags)
for (i=1; i<=num_tags; i++)
{
len = length(tags[i]) + 2
tag_pad[i] = ""
for (j=1; j<=len; j++) tag_pad[i] = tag_pad[i] " "
}
}
{
if (NF == 0)
indent_num = 0
else
{
indent_num = index($0, $1)
indent_str = substr($0, 1, indent_num-1)
restOfLine = substr($0, indent_num)
}
if (indent_num != saved_indent_num && saved != "")
{
print saved "</" tags[saved_indent_num] ">"
saved = ""
}
if (NF == 0)
print
else if (indent_num > num_tags)
{
errmsg = "Error: line %d has an indent level of %d.\n"
printf errmsg, NR, indent_num > "/dev/stderr"
exit 1
}
else if (indent_num == saved_indent_num)
{
print saved
saved = indent_str tag_pad[indent_num] restOfLine
}
else
saved = indent_str "<" tags[indent_num] ">" restOfLine
saved_indent_num = indent_num
}
END {
if (saved != "")
print saved "</" tags[saved_indent_num] ">"
}
'
The BEGIN block initializes the tags (x
, y
, and zz
)
by splitting a space-separated string.
The tag_pad
array contains enough spaces
to match the width of the tags (including the <
and >
):
tag_pad[1]
and tag_pad[2]
are three spaces; tag_pad[3]
is four spaces.
Upon reading a line of input, we parse it.
If it has no fields (NF == 0
), it must be blank
(since we have specified that no line consists entirely of spaces and tabs),
so set indent_num
to 0.
Otherwise, measure the indent
by finding the location of $1
(the first word) in $0
(the entire line).
index
returns a value starting at 1,
so this is actually one more than the number of whitespace characters
before the first non-whitespace character
(and, remember, we are assuming that these are all tabs).
This is lucky, because now indent_num
corresponds to entries in the tags
and tag_pad
arrays.
Then we break the line apart into an indent_str
(the whitespace)
and restOfLine
(everything after the indent).
Now we rely on saved information.
If this line has a different indentation from the previous one,
we’re starting a new group.
If there is a saved line,
write it out, with the appropriate ending tag at the end of the line.
If the current line is blank, just print it.
Check whether the current indentation level is too high, and bail if it is.
If the current indentation is the same as the previous one,
this is a continuation line of an already-started group,
so just print the saved (previous) line,
and build a new saved
string that is the current line
with the width of the current tag inserted between the indent and the text.
Otherwise, we’re starting a new group,
so build a saved
string that is the current line
with the beginning tag (itself) inserted between the indent and the text.
When we get to the end of the input, end the current group as we did before.
If you want blank lines to be added (as you have shown), you should say so. And your example is inconsistent — compare the two
<b>
stanzas. – Scott – 2017-04-29T08:39:21.490I have edited the question based on your opinion. – Ramaprakasha – 2017-04-29T09:37:55.740
Thanks for fixing the inconsistent indentation of the tags vs. the data, but you missed my point about blank lines. Your input dataset has ten lines: line 1 is “111”, lines 2 and 3 are both “ 222”, line 4 is blank, etc. Your output is twelve lines long, not only preserving the pre-existing blank lines, but adding one at 1½ (between “222” and “111”) and another at 8½ (between “333” and “222”). Do you want for that to happen? (Or is it a typo in your question?) – Scott – 2017-04-29T18:15:59.307
Oh Sorry. Yes it was a typo. I have corrected it now. – Ramaprakasha – 2017-04-30T03:01:05.283
Since I wanted tagging of text I changed numerals to
a
,b
,c
. – Ramaprakasha – 2017-05-01T08:14:55.353So you want to remove all
<>
characters and all characters in between them while retaining line feeds and carriage returns and all other characters and line positions minus the characters removed only affecing the line positions... this helps clarify what you were asking I think. Probably a simple regex with sed or grep or something perhaps for a good starting point. – Pimp Juice IT – 2017-05-06T05:28:15.007No it is opposite of it. I have now clarified by adding input and output headings. – Ramaprakasha – 2017-05-06T06:04:46.280
You might be able to write a script but it's going to be quite complex. You would need remember what the last tag was you opened and read the file line by line. – Seth – 2017-05-06T09:21:14.533
Have you seen my answer to your question? Does it work for you? (See What should I do when someone answers my question?)
– Scott – 2017-05-12T02:38:48.983Sorry I had a crash and had to install my OS anew so couldn't answer you. I have accepted your answer. It works perfectly – Ramaprakasha – 2017-05-12T03:32:37.847