remove words containing non-alpha characters

4

Given a text file with space separated string and a tab separated integer, I'd like to get rid of all words that have non-alpha characters but keep words consisting of alpha only characters and the tab plus the integer afterwards.

My attempts like the ones below didin't yield any good. What I was trying to express is something like: "replace anything within word boundaries that starts and ends with 0 or more whatever and there is at least one :digits: or :punct: in between".

sed 's/\b.*[:digits::punct:]+.*\b//g'
sed 's/\b.*[^:alpha:]+.*\b//g'

What am I missing? See sample input data below.

Thank you!

Input:

asdf 754m   563  
a2a 754mm   291  
754n    463  
754 ppp 1409  
754pin  4652  
pin pin 462  
754pins 652  
754 ppp </D>    1409  
<D> 754pin  4652  
pi$n pin    462  
754/p ins   652  
754 pp+p    1409  
754 p=in    4652  

Desired output:

asdf    563  
    291  
    463  
ppp 1409  
    4652  
pin pin 462  
    652  
 ppp    1409  
    4652  
 pin    462  
 ins    652  
    1409  
    4652  

dnkb

Posted 2010-05-05T00:57:40.093

Reputation: 347

Answers

0

Basically this becomes a long list of things to delete:

sed -r 's/(^[[:digit:]]+\b|\b[[:digit:]]+[[:punct:]]*[[:alpha:]]+\b|\b[[:alpha:]]+[[:digit:]]+[[:alpha:]]+\b|\b[[:alpha:]]+[[:punct:]]+[[:alpha:]]+\b|[[:punct:]]+.*[[:punct:]]+)//g' file

Delete these:

  • digits at the beginning of the line
  • words that start with digits, may include punctuation, and end in alpha characters
  • words that consist of alpha chars, followed by digits, followed by alpha
  • words that consist of alpha, punct, alpha
  • sequences that begin and end with punct chars

Paused until further notice.

Posted 2010-05-05T00:57:40.093

Reputation: 86 075

0

Wouldn't this best be solved with regular expressions?

([A-Z]+tab[0-9]+ ) or something like that

Daisetsu

Posted 2010-05-05T00:57:40.093

Reputation: 5 195

Not exactly because there may be multiple space separated strings, some of which I'd like to keep, while others need to go away. – dnkb – 2010-05-05T02:07:02.663

0

So if I understand correctly you want to keep words that have either all words or all digits. But nothing else, if so something like this should work:

(^|\s+)([A-Za-z]+|\d+)((?=\s)|(?=$))

(Use with the multiline flag)

When run over your example input it will find every input that is either all digits or all words. This is an easier solution compared to finding every word that doesn't match, however you can use this to extract the data as opposed to replacing the invalid data.

VoDurden

Posted 2010-05-05T00:57:40.093

Reputation:

Thank you, but it's not exactly what I was looking for. I only want to keep the number after the tab at the end of the lines. – dnkb – 2010-05-05T14:10:12.570