awk '$1=="ID" {print $0 "\tINFOextra"; next}; NF { info=$6; gsub(/.*;DP4=|;MQ=.*/, "", info); split(info, a, /,/); print $0 "\t" (a[3]+a[4])/(a[1]+a[2]+a[3]+a[4])}' inputfile.txt > outputfile.txt
# then check the content outputfile.txt and rename it if important
As you can see, the solution is very similar to the answer given to your previous question, and not longer at all. It's because awk is extremely fine-tuned to serve in problems like yours. I do suggest having a look at its manual page (http://linux.die.net/man/1/awk) to see how simple it is compared to other (and more general-purpose) languages.
If multiple input files are to be processed into multiple output files with the appropriate names, the options are the following:
- create a loop in shell and starting an awk process for each file, one by one
make awk write the output in files whose names are dependant on the name of the current input file which information is stored in the FILENAME variable of awk that is automatically set during the process. Inside the awk code redirection can be used with the same syntax and similar outcome as the shell uses:
awk '$1=="ID" {print $0 "\tINFOextra" > FILENAME ".out"; next}; NF { info=$6; gsub(/.*;DP4=|;MQ=.*/, "", info); split(info, a, /,/); print $0 "\t" (a[3]+a[4])/(a[1]+a[2]+a[3]+a[4]) > FILENAME ".out"}' inputfile1.txt inputfile2.txt
Here each instance of inputfileN.txt will have a corresponding inputfileN.txt.out file. FILENAME is a simple string so any kind of manipulation of the output files are valid.
When the specification gets complicated such that the additional field must appear at an inner position (not at the beginning or the end), creating a subroutine (called function in awk) should be considered to create the output lines. That function here iterates through all the fields, print them as usual, but where the additional field should appeare, it writes that after the n-1-th and before the n-th field, making it the n-th one. Now, it's worth to put the awk code in its own file:
$ cat bio.awk
function myprint( str) {
for (i=1; i<=NF; ++i) {
printf "%s", $i > FILENAME ".out"
if (i==44)
printf "\t%s", str >> FILENAME ".out"
if (i!=NF)
printf "\t" >> FILENAME ".out"
}
print "" >> FILENAME ".out"
}
$1=="ID" {
myprint( "INFOextra")
next
}
NF {
info=$6
gsub(/.*;DP4=|;MQ=.*/, "", info)
split(info, a, /,/)
myprint( (a[3]+a[4])/(a[1]+a[2]+a[3]+a[4]) )
}
Calling it then results in a shorter and clean command line:
awk -f bio.awk inputfile1.txt inputfile2.txt
Please add your sample input to the question. I'm not going to waste my time retyping it to test my code. – glenn jackman – 2016-03-12T22:27:28.087