awk, sed, or other text processing suggestions, please

1

I have the following repeating pattern of text that needs to be reformatted.

Normally this should be easy, even with a standard text editor, but in this case I need to expand the information in the parenthesis and enumerate them.

Best I give an example:

"Gene Code (1A - 1F) D2 fragment, D74F"

I need to be able to have the final product look like this:

Gene Code, 1A, D2 fragment, D74F
Gene Code, 1B, D2 fragment, D74F
Gene Code, 1C, D2 fragment, D74F
Gene Code, 1D, D2 fragment, D74F
Gene Code, 1E, D2 fragment, D74F
Gene Code, 1F, D2 fragment, D74F

The snag is that the initial string contained in the parenthesis, could be anything like 1A-1F, or 3D-3H, etc. That is the only shifting bits of information. The numeric in the parenthesis is always the same, just the alphabetic letters need expansion with their associated number.

So someway of correlating the alphabet with the numbers is needed.

This looks like a mind-bender to me. Any help much appreciated. New to this, by the way.

jeffschips

Posted 2018-12-30T21:49:40.860

Reputation: 21

Is this performance-sensitive? An easy solution with a for loop would be not very fast. – Eugen Rieck – 2018-12-30T21:53:04.223

Answers

2

This bash script

#!/bin/bash

PART1=$(echo "$1" | sed 's/\(.*\)\s(.*/\1/')
PART3=$(echo "$1" | sed 's/.*)\(.*\)/\1/')
PART2=$(echo "$1" | sed 's/.*(\s*\(.*\)).*/\1/')

START=$(echo "$PART2" | sed 's/\s*-.*//')
END=$(echo "$PART2" | sed 's/.*-\s*//')

STARTNUM=$(echo "$START" | sed 's/^\(.\).*/\1/')
ENDNUM=$(echo "$END" | sed 's/^\(.\).*/\1/')
if test "$STARTNUM" '!=' "$ENDNUM"; then
    echo "Error: Numeral is different"
    exit 1
fi

STARTLETTER=$(echo "$START" | sed 's/^.\(.\).*/\1/')
ENDLETTER=$(echo "$END" | sed 's/^.\(.\).*/\1/')

OUTPUT=''
for LETTER in A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ; do
    test "$LETTER" '==' "$STARTLETTER" && OUTPUT='yes'
    test -n "$OUTPUT" && echo "$PART1, $STARTNUM$LETTER,$PART3"
    test "$LETTER" '==' "$ENDLETTER" && OUTPUT=''
done

Will do what you need, albeit not in a very performant way when called with the original text as $1

EDIT

As requested a few words about the sed expressions:

  • I isolate PART1 by taking everything before whitespace and an opening (
  • I isolate PART3 by taking everything from the closing ) onwards
  • I isolate PART2 by taking what is between ( and ), ignoring whitespace
  • START and END are isolated by the dash, again ignoring whitespace
  • Number and Letter are isolated by being first and second character

Eugen Rieck

Posted 2018-12-30T21:49:40.860

Reputation: 15 128

A breakdown of the sed expressions would be fantastic, looks like some sub-expressions, and a \s that does...? – Xen2050 – 2018-12-30T22:48:01.617

@Xen2050 The \s is just for robustness: Ignore or correctly process whitespace around the relevant parts. Everything else should be quite self-explaining. – Eugen Rieck – 2018-12-30T22:52:05.163

1I wouldn't count on it being self-explaining to someone looking for "awk, sed, or basically anything," every hint helps +1 – Xen2050 – 2018-12-30T22:58:04.797

1

If GNU sed is available

sed -r 's/([^(]+) \((.)(.) - .(.)\)(.*)/printf \x27\1, \2%s,\5\\n\x27 {\3..\4}/e' <<<'Gene Code (1A - 1F) D2 fragment, D74F'
Gene Code, 1A, D2 fragment, D74F
Gene Code, 1B, D2 fragment, D74F
Gene Code, 1C, D2 fragment, D74F
Gene Code, 1D, D2 fragment, D74F
Gene Code, 1E, D2 fragment, D74F
Gene Code, 1F, D2 fragment, D74F

If not, run it sending as pipe to the shell

sed -r 's/([^(]+) \((.)(.) - .(.)\)(.*)/printf \x27\1, \2%s,\5\\n\x27 {\3..\4}/' <<<'Gene Code (1A - 1F) D2 fragment, D74F'|bash
Gene Code, 1A, D2 fragment, D74F
Gene Code, 1B, D2 fragment, D74F
Gene Code, 1C, D2 fragment, D74F
Gene Code, 1D, D2 fragment, D74F
Gene Code, 1E, D2 fragment, D74F
Gene Code, 1F, D2 fragment, D74F

(with sh and ksh the output is the same)

Paulo

Posted 2018-12-30T21:49:40.860

Reputation: 606

1

A perl way:

#!/usr/bin/perl
use feature 'say';

my $str = '"Gene Code (3D - 3H) D2 fragment, D74F"';
# get begin number, begin letter, end number, end letter
my ($bn,$bl,$en,$el) = $str =~ /\((.)(.) - (.)(.)\)/;
# loop from begin letter to end letter
for my $i ($bl .. $el) {
    # do the substitution and print
    ($_ = $str) =~ s/ \(.. - ..\)/, $bn$i,/ && say;
}

Output:

"Gene Code, 3D, D2 fragment, D74F"
"Gene Code, 3E, D2 fragment, D74F"
"Gene Code, 3F, D2 fragment, D74F"
"Gene Code, 3G, D2 fragment, D74F"
"Gene Code, 3H, D2 fragment, D74F"

Toto

Posted 2018-12-30T21:49:40.860

Reputation: 7 722

Thank you everyone for providing these great solutions. I'm really awed by the generosity and professionalism. It works! I didn't know sed was so powerful. Now I need to figure out how to pass over the entries that don't match this specific pattern. Thank you all and have a great New Year!! – jeffschips – 2018-12-31T22:33:10.520

@jeffschips: You're welcome.Feel free to mark one of the answers as accepted, see: https://superuser.com/help/someone-answers

– Toto – 2019-01-01T11:00:34.597

0

A version that doesn't require looping, and uses only four calls to sed. Granted though, my version doesn't check that the two numerics are equal. In fact, the second one is ignored and can even be omitted, as with "Gene Code (91K - Q) D2 fragment, D74F". Also the low bound and high bound can appear in either order. If the low bound is greater than the high bound, then the output sequence is reversed.

$ cat foo
#!/usr/bin/env bash

# Script to expand $1 passed as:

# "Gene Code (91K - 91Q) D2 fragment, D74F"
# 
# into the output:
# 
# Gene Code, 91K, D2 fragment, D74F
# Gene Code, 91L, D2 fragment, D74F
# Gene Code, 91M, D2 fragment, D74F
# Gene Code, 91N, D2 fragment, D74F
# Gene Code, 91O, D2 fragment, D74F
# Gene Code, 91P, D2 fragment, D74F
# Gene Code, 91Q, D2 fragment, D74F


# Copy $1 into FMT_STRING, replacing the " (91K - 91Q)" bit with a ', %s,' 
# printf directive, such as 'Gene Code, %s, D2 fragment, D74F':

FMT_STRING="$(sed -e 's/ (.* - .*)/, %s,/' <<< "$1")"

# Parse the beginning and ending bounds and format them with just a 
# space between, such as '91K 91Q':

BOUNDS="$(sed -e 's/^[^(]*(\(.*\) - \(.*\)) .*/\1 \2/' <<< "$1")"

# Extract the (first) static numeric part from BOUNDS, e.g. '91'

NUMERIC="$(sed -e 's/[^0-9].*//' <<< "$BOUNDS")"

# remove all digits [0-9] from BOUNDS, e.g. 'K Q'
BOUNDS="$(sed -e 's/[0-9]//g' <<< "$BOUNDS")"

FMT_STRING="$(printf "$FMT_STRING" "${NUMERIC}%c")"

jot -w "$FMT_STRING" - $BOUNDS

Sample output:

$ ./foo "Gene Code (737L - 737X) D2 fragment, D74F"
Gene Code, 737L, D2 fragment, D74F
Gene Code, 737M, D2 fragment, D74F
Gene Code, 737N, D2 fragment, D74F
Gene Code, 737O, D2 fragment, D74F
Gene Code, 737P, D2 fragment, D74F
Gene Code, 737Q, D2 fragment, D74F
Gene Code, 737R, D2 fragment, D74F
Gene Code, 737S, D2 fragment, D74F
Gene Code, 737T, D2 fragment, D74F
Gene Code, 737U, D2 fragment, D74F
Gene Code, 737V, D2 fragment, D74F
Gene Code, 737W, D2 fragment, D74F
Gene Code, 737X, D2 fragment, D74F

Reversing the bounds reverses the output:

$ ./foo "Gene Code (737X - 737L) D2 fragment, D74F"
Gene Code, 737X, D2 fragment, D74F
Gene Code, 737W, D2 fragment, D74F
Gene Code, 737V, D2 fragment, D74F
Gene Code, 737U, D2 fragment, D74F
Gene Code, 737T, D2 fragment, D74F
Gene Code, 737S, D2 fragment, D74F
Gene Code, 737R, D2 fragment, D74F
Gene Code, 737Q, D2 fragment, D74F
Gene Code, 737P, D2 fragment, D74F
Gene Code, 737O, D2 fragment, D74F
Gene Code, 737N, D2 fragment, D74F
Gene Code, 737M, D2 fragment, D74F
Gene Code, 737L, D2 fragment, D74F

Jim L.

Posted 2018-12-30T21:49:40.860

Reputation: 669