arithmetic operation on a nested column in a tab delimited file

awk '$1=="ID" {print $0 "\tINFOextra"; next}; NF { info=$6; gsub(/.*;DP4=|;MQ=.*/, "", info); split(info, a, /,/); print $0 "\t" (a[3]+a[4])/(a[1]+a[2]+a[3]+a[4])}' inputfile.txt > outputfile.txt
# then check the content outputfile.txt and rename it if important

As you can see, the solution is very similar to the answer given to your previous question, and not longer at all. It's because awk is extremely fine-tuned to serve in problems like yours. I do suggest having a look at its manual page (http://linux.die.net/man/1/awk) to see how simple it is compared to other (and more general-purpose) languages.

If multiple input files are to be processed into multiple output files with the appropriate names, the options are the following:

create a loop in shell and starting an awk process for each file, one by one
make awk write the output in files whose names are dependant on the name of the current input file which information is stored in the FILENAME variable of awk that is automatically set during the process. Inside the awk code redirection can be used with the same syntax and similar outcome as the shell uses:

awk '$1=="ID" {print $0 "\tINFOextra" > FILENAME ".out"; next}; NF { info=$6; gsub(/.*;DP4=|;MQ=.*/, "", info); split(info, a, /,/); print $0 "\t" (a[3]+a[4])/(a[1]+a[2]+a[3]+a[4]) > FILENAME ".out"}' inputfile1.txt inputfile2.txt

Here each instance of inputfileN.txt will have a corresponding inputfileN.txt.out file. FILENAME is a simple string so any kind of manipulation of the output files are valid.

When the specification gets complicated such that the additional field must appear at an inner position (not at the beginning or the end), creating a subroutine (called function in awk) should be considered to create the output lines. That function here iterates through all the fields, print them as usual, but where the additional field should appeare, it writes that after the n-1-th and before the n-th field, making it the n-th one. Now, it's worth to put the awk code in its own file:

$ cat bio.awk


function myprint( str) {
   for (i=1; i<=NF; ++i) {
      printf "%s", $i  > FILENAME ".out"
      if (i==44)
         printf "\t%s", str  >> FILENAME ".out"
      if (i!=NF)
         printf "\t"  >> FILENAME ".out"
   }
   print ""  >> FILENAME ".out"
}

$1=="ID" {
   myprint( "INFOextra")
   next
}

NF {
   info=$6
   gsub(/.*;DP4=|;MQ=.*/, "", info)
   split(info, a, /,/)
   myprint( (a[3]+a[4])/(a[1]+a[2]+a[3]+a[4]) )
}

Calling it then results in a shorter and clean command line:

awk -f bio.awk inputfile1.txt inputfile2.txt

Gombai Sándor

Posted 2016-03-12T21:05:27.690

Reputation: 3 325

Thanks a lot. Awk is awesome. I will explore it further. Seems to be of very useful for my tasks. How to print the column name 'INFOextra' instead of '0'? Secondly, if I want to run this over multiple input files in batch, {input}_1.txt to {input}_100.txt, how could I do it? – panbar – 2016-03-12T22:35:28.540

It should (must) write "INFOextra" literally in the first line if you did not forget to put the quotes as written, that is like this "\tINFOextra". If you wish to process several input files to one output, then just write each input files (using wild cards at your will) BEFORE the "> outputfile.txt" part. If you want N output files for N inputs, then I will have to modify the answer. Should I? – Gombai Sándor – 2016-03-12T22:52:29.647

Yes, I want N output files for N inputs separately. I tried several times with print $0 "\tINFOextra". However, it did not print the column name 'INFOextra' instead of '0' in the outfile. Otherwise it performs the arithmetics correctly and print the values in the last column. – panbar – 2016-03-12T23:07:37.750

Answer modified. I tried the code with all the 3 kinds awk I know to exist on linux but could not catch your 0-problem. If you can copy some sample in text (not image) somewhere, I may be able show you a picture of befores and afters. – Gombai Sándor – 2016-03-12T23:34:37.520

Thanks, the single input file single out file script is working fine and giving correct output for (a[3]+a[4])/(a[1]+a[2]+a[3]+a[4]) including the column header. However, for the multiple input files and multiple output files script the (a[3]+a[4])/(a[1]+a[2]+a[3]+a[4]) calculated values are wrong. I used same 3 input files for the single script and multiple script. – panbar – 2016-03-14T10:10:52.467

Without real samples, I really can't figure it out. – Gombai Sándor – 2016-03-14T10:19:27.227

Thanks, I found the issue. It was in the pattern gsub(/.;DP4=|;MQ=./, "", info). So, the script for multiple input and multiple out works with the following script awk '$1=="ID" {print $0 "\tINFOextra" > FILENAME ".out"; next}; NF { info=$6; gsub(/.;DP4=|;MQ=./, "", info); split(info, a, /,/); print $0 "\t" (a[3]+a[4])/(a[1]+a[2]+a[3]+a[4]) > FILENAME ".out"}' inputfile1.txt inputfile2.txt – panbar – 2016-03-14T11:46:21.670

Really, I did not catch it while pasting that the formatting engine interpreted the asterisks instead if believing them to be the part of the code. But let's look at the bright side of life: since you could fix that, you started to taste the flavour of scripts like this. :) Nevertheless, I corrected it to show how what I want it to show. – Gombai Sándor – 2016-03-14T11:51:36.287

Gombai Sándor, Thanks a lot ! For getting me into the exciting world of AWK. I will explore more. – panbar – 2016-03-14T12:05:26.407

May be I can ask another point, if I want to paste the newly calculated field 'INFOextra' on column number 45, instead of the last column, how can I specify it? – panbar – 2016-03-14T12:35:24.983

arithmetic operation on a nested column in a tab delimited file

Answers