Insert a string or blank line after specific search criteria, in a loop

0

I'm wondering if someone could help me with a specific coding question. I have a DNA sequencing file that reads something like this (as an example):

Plate1A1_R1_AGTAGTACGACTAGCATCAGCATACGATCAGCATCAGCATCAG
Plate1A1_R1_GTAGATCGATGCATGCATGCTAGCTAGCTAGCTAGCTAGCTAA
Plate1A1_R1_AGCTAGCATCGATCGATGCTAGCATGCATCGATCGATGCATGC
Plate1A1_R2_AGCATCGATGCAGCATGCTAGCTAGCTAGCTAGCAGCTAGTCT
Plate1A1_R2_AGCATGCATCGATCGTAGCTAGCAGCGAGCGGCATCGATCGAT
Plate1A2_R1_CAGCTAGATGCATCGATCGATCGATCGATCGATGCTAGCTTAC
Plate1A2_R1_CAGTAGCATGCATGCATGCATGCATGCATCGATGCTAGCTAGC
Plate1A2_R1_ACAACGTAGCTAGCTAGCTACTACTAGTCATCATCGATGCTAG
Plate1A2_R1_CAGCTAGCTAGCTAGCTAGGCTACATCGATCGTAGCTAGTCGA
Plate1A2_R1_CAGTCAGCATGCTATCGATCGTAGCTAGTCATCGATGTAGTGA
....etc.

You can see that there are lines that belong to the same similar starting pattern (here: Plate1A1_R1, Plate1A1_R2, Plate1A2_R1). I'd like to place a blank line after each grouping, e.g.:

Plate1A1_R1_AGTAGTACGACTAGCATCAGCATACGATCAGCATCAGCATCAG
Plate1A1_R1_GTAGATCGATGCATGCATGCTAGCTAGCTAGCTAGCTAGCTAA
Plate1A1_R1_AGCTAGCATCGATCGATGCTAGCATGCATCGATCGATGCATGC

Plate1A1_R2_AGCATCGATGCAGCATGCTAGCTAGCTAGCTAGCAGCTAGTCT
Plate1A1_R2_AGCATGCATCGATCGTAGCTAGCAGCGAGCGGCATCGATCGAT

Plate1A2_R1_CAGCTAGATGCATCGATCGATCGATCGATCGATGCTAGCTTAC
Plate1A2_R1_CAGTAGCATGCATGCATGCATGCATGCATCGATGCTAGCTAGC
Plate1A2_R1_ACAACGTAGCTAGCTAGCTACTACTAGTCATCATCGATGCTAG
Plate1A2_R1_CAGCTAGCTAGCTAGCTAGGCTACATCGATCGTAGCTAGTCGA
Plate1A2_R1_CAGTCAGCATGCTATCGATCGTAGCTAGTCATCGATGTAGTGA

....etc.

This means I need to be able to grab the first 11 characters of each line, search for where that pattern no longer occurs in the line below, and insert a blank line at that point.

I've tried sed and awk efforts with 'while read line' loops, but can't seem to find a way to hold the first 11 characters in a search variable to be used through the consecutive lines of a text file, if that search variable is 'stuck' in the processing of an individual line.

I'm hoping someone can help with a solution that would allow the referenced file to be accessed with a redirect (<) (with hundreds of lines of DNA sequence data in this format, and a couple of hundred distinct 'plate names' defined as the script moves through the file line-by-line), e.g. while read line ; do echo "${line:0:11}" ; done < filename.txt

kehmsen

Posted 2016-03-25T01:17:27.683

Reputation: 1

Please take a look at: What should I do when someone answers my question?

– Cyrus – 2016-03-25T08:28:13.547

Answers

1

I managed this using only bash commands:

p=; while read l; do [ "$p" -a "${l:0:11}" != "${p:0:11}" ] && echo; echo "$l"; p="$l"; done < FileName

Here l is the current line and p is the previous, adding "$p" -a prevents an initial blank line, and the && is a more compact way to express if.

AFH

Posted 2016-03-25T01:17:27.683

Reputation: 15 470

0

an awk solution (similar to AFH's)

awk 'NR == 1 { prev=substr($0,1,11) ; } 
     NR >  1 { pref=substr($0,1,11) ; if ( prev != pref ) printf "\n" ; prev=pref ; } 
     {print ; } ' file

where

  • prev/pref stand for previous/prefix
  • NR : number of record (that is line number if there is one file)

Archemar

Posted 2016-03-25T01:17:27.683

Reputation: 1 547