Bash script with grep -w

2

I am using the following command: x.txt | grep -w 'in' and I am getting answers like: in into ... etc.

I only want the answer: in

How should i modify the command?

NewBo

Posted 2012-04-08T12:18:36.213

Reputation: 21

You should rephrase your original question to specify that the problem has to do with UTF-8 specific characters. Also, look at "EDIT4" in my answer below where I use sed to work around the problem. – Daniel Andersson – 2012-04-08T14:11:00.253

Answers

2

First, the command should be

grep -w in x.txt

Your current pipe doesn't work, and it is unnecessary to cat the file just to pipe it. grep can read files directly.

Second, the -w does exactly what you want. From the man page:

-w, --word-regexp

Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.

grep will return the complete lines where the word occurs though; that is the function of grep; I mention it to check that you don't get confused because of that.

If you just want to return the word, as you say, you can do

grep -ow in x.txt

since -o returns only the matching part, but that seems quite unfruitful. What are you really trying to do?


EDIT: An explicit example:

$ cat test
word in word
within word
word word
$ grep -w in test
word in word

"within" is not matched.

EDIT2: Another example:

$ grep '\<in\>' test 
word in word

EDIT3: It was given that the problem was with Swedish characters. I can reproduce this, even with the environment variable LANG set to sv_SE.UTF-8. https://stackoverflow.com/questions/9260293/egrep-accented-characters-not-recognised-as-part-of-a-word suggests using Perl for UTF-8 specific tasks as the easiest solution.


EDIT4: It seems I can use sed to get this working with Swedish characters:

$ cat test
word den word
avträden word
word word
$ sed -n '/\bden\b/p' test
word den word
$ sed -n '/\<den\>/p' test
word den word

It is a pragmatic solution, but hopefully it works for this task.

Daniel Andersson

Posted 2012-04-08T12:18:36.213

Reputation: 20 465

What I meant was that I only want the line where the complete word is present, i.e. a line where "in" is present but not a line where "within" is present. – NewBo – 2012-04-08T12:41:41.770

2Yes, but that is exactly what -w does. Does it not work? Give an explicit example where it does not work. I did a test case just now, and it works just as you want, from all I can tell. – Daniel Andersson – 2012-04-08T12:44:30.937

grep -w "den" ./sv_enb.txt gives the result den;it avträden;privies (I am using OSX) – NewBo – 2012-04-08T13:30:38.040

OSX could very well be relevant. Have you looked at the manual for grep on your system? Is -w described? Otherwise, you can exchange -w in for '\<in\>', as was trying to be described in the now deleted answer. It could also be some strange unicode error on OSX, but try the above first. – Daniel Andersson – 2012-04-08T13:34:28.693

grep "<den>" ./sv_enb.txt gives den;it avträden;privies. When I read the man pages they are the same as other man pages for grep, no special information about the command in OSX (or OpenBSD). – NewBo – 2012-04-08T13:36:06.287

@NewBo: Post a minimal text file in your original question which exhibits the behavior you say. – Daniel Andersson – 2012-04-08T13:38:37.533

I am new in this forum and can't seem to find how to add a file. However, it's a UTF-8 txt file with the following two lines (to give you an exmaple): den;it avträden;privies and the command is as the samples I gave above. Could it be the Swedish characters? – NewBo – 2012-04-08T13:44:48.633

@NewBo: Yes, it sounds like it treats the ä as a word delimiter. It is most likely a UTF-8 issue. The easiest fix is probably to abandon grep for Perl in this case. – Daniel Andersson – 2012-04-08T13:49:49.093