how to select and copy columns that are not space separated to new file?

1

I have seen your (jcbermu) script on how to select columns and copy to new file. It works perfect if all the columns are space delimited (any number of spaces will work fine). I would like to improve tour script to work in a file with this special format:

  • First row has the headers for each column (animalid snp1 snp2 etc.) they are space separated.
  • Second raw: FirstColumn has the animalid, space, then all the SNP genotypes (one cell per SNP) and they are all together as a block (genotypes are not space separated).

So, how to tell in the script that after row 1 and after column 1, we have columns data not separated (each cell is one field).

Example of my file:

animalid snp1 snp2 snp3 snp4 snp5 snp6 snp7 ....... snp8000
fish1 2020121 ....... 2
fish2 0202102 ....... 1
fish3 1201201 ....... 2
.
.
Fish1500 2010211 ...... 1

Roger Vallejo

Posted 2017-10-26T17:46:36.853

Reputation: 11

lines 2 and beyond have only 2 space-separated words, right? – glenn jackman – 2017-10-26T18:17:18.097

Hi Glenn: yes, after line 1 and after column 1 each field has only one character and they are not separated. – Roger Vallejo – 2017-10-26T19:32:56.433

Answers

0

with awk:

awk 'NR >= 2 {gsub(/./, "& ", $2)} 1' <<DATA
foo bar baz
abc 1234
def 5678
DATA

outputs

foo bar baz
abc 1 2 3 4
def 5 6 7 8

glenn jackman

Posted 2017-10-26T17:46:36.853

Reputation: 18 546

0

If I understood correctly, the input file is this

cat fish-genotypes
animalid snp1 snp2 snp3 snp4 snp5 snp6 snp7 snp8000
fish1 20201212
fish2 02021021
fish3 12012012
Fish1500 20102111

and the output would be like this

animalid snp1 snp2 snp3 snp4 snp5 snp6 snp7 snp8000
fish1    2    0    2    0    1    2    1    2       
fish2    0    2    0    2    1    0    2    1       
fish3    1    2    0    1    2    0    1    2       
Fish1500 2    0    1    0    2    1    1    1

Bash script with cli tools

#!/bin/bash
vert=$(cut -d' ' -f1 fish-genotypes |wc -L)
while read -a line; do
    printf "%-$((vert+1))s" ${line[0]}
    [[ "${line[1]}" =~ [[:alpha:]] ]] && {
        unset line[0]
        header=(${line[@]})
        echo ${header[@]}
        continue
    }
    for ((i=0;i<${#header[@]};i++)) {
        printf "%-$((${#header[i]}+1))s" ${line[1]:i:1}
    }
    echo
done <fish-genotypes

Paulo

Posted 2017-10-26T17:46:36.853

Reputation: 606