Find repeated words in a text

4

4

One of most common typos is to repeat the same word twice, as as here. I need an automatic procedure to remove all the repeated words in a text file. This should not be a strange feature for a modern editor or spell-checker, for example I remember that MS Word introduced this feature several years ago! Apparently, the default spell-check on my OS (hun-spell) can't do this, as it only finds words not in the dictionary.

It would be OK to have a solution valid for a specific text editor editor for linux (pluma/gedit2 or Sublime-text) and a solution based on a bash script.

altroware

Posted 2014-11-22T22:46:59.257

Reputation: 176

1Is perl an acceptable alternative to bash? Because that'd be my first port of call. – Sobrique – 2014-11-22T23:01:31.927

@Sobrique Please, feel free to add it! I would favor bash-based answers though – altroware – 2014-11-23T01:13:07.370

You asked for a script to remove repeated words but you accepted an answer that just prints them and only recognizes even repetitions (it'd fail on abc foo foo foo def for example). If you still need to know how to do what you originally asked for then please do post a new question and tag it with awk. – Ed Morton – 2020-02-09T14:53:01.443

@EdMorton I’m actually happy with the solution, I still use it to find words repeated twice in a line. – altroware – 2020-02-09T15:09:48.053

sounds good, if you ever need more than that, just ask again and tag with awk. – Ed Morton – 2020-02-09T15:29:05.763

Answers

12

With GNU grep:

echo 'Hi! Hi, same word twice twice, as as here here! ! ,123 123 need' |  grep -Eo '(\b.+) \1\b'

Output:

twice twice
as as
here here
123 123

Options:

-E: Interpret (\b.+) \1\b as an extended regular expression.

-o: Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

Regex:

\b: Is a zero-width word boundary.

.+: Matches one or more characters.

\1: The parentheses () mark a capturing group and \1 means use here the value from first capturing group.


Reference: The Stack Overflow Regular Expressions FAQ

Cyrus

Posted 2014-11-22T22:46:59.257

Reputation: 4 356

Your grep command fails for the following type of example:

echo "the thesis" | grep -Eo '(\b.+\b) \1'

outputs: the the. grep -Eo '(\b.+) \1\b' seems to work though. Any idea why? – el_tenedor – 2015-03-24T14:38:28.077

@el_tenedor: Thank you for this hint. I've added a \b after second string to fix this substring bug. – Cyrus – 2015-03-24T18:21:42.393

Still, the second \b in the parenthesis seems to be redundant. Why do we need it? – el_tenedor – 2015-03-27T16:16:12.720

Yes, it's redundant and may be omitted. – Cyrus – 2015-03-27T17:42:51.547

I've updated my answer. – Cyrus – 2015-05-09T07:05:46.290

1I was thinking if there is any way of improving this answer, supplementing the case where the repeated words are not on the same line in separate lines as in: same word twice\n twice – altroware – 2016-08-29T11:56:41.677

@Cyrus: thank you this answer. I was wondering whether you could explain the parameters of grep to the uninitiated. – shuhalo – 2017-02-24T08:39:26.090

@shuhalo: I‘ve updated my answer. – Cyrus – 2017-02-24T18:01:16.590

1@altroware found a solution when repeated words are not on the same line? – om-ha – 2020-02-09T09:38:56.310

1@om-ha no I haven’t found it! – altroware – 2020-02-09T10:02:48.050

1

@altroware Done! You can see the solution here. I've edited an already-existing answer so you'll see the changes when they're approved.

– om-ha – 2020-02-09T12:30:07.923

@om-ha thanks! But The solution in the link seems not good to find typos in a text, I wanted to spot repeated but consecutive words. Thanks anyway! – altroware – 2020-02-09T13:01:29.700

@altroware: I suggest to start a new question. – Cyrus – 2020-02-09T13:30:21.027

Sounds plausible. Here it is. Find Duplicate/Repeated or Unique words spanning across multiple lines in a file Feel free to edit or answer as well.

– om-ha – 2020-02-09T13:58:46.007

@altroware Does my answer in my previous comment solve your issue? – om-ha – 2020-02-09T14:11:46.293

1

It sounds like something like this is what you want (using any awk in any shell on every UNIX box):

$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
{
    head = prev = ""
    tail = $0
    while ( match(tail,/[[:alpha:]]+/) ) {
        word = substr(tail,RSTART,RLENGTH)
        head = head substr(tail,1,RSTART-1) (word == prev ? "" : word)
        tail = substr(tail,RSTART+RLENGTH)
        prev = word
    }
    print head tail
}

$ cat file
the quick quick brown
fox jumped jumped
jumped over the lazy
lazy dogs back

$ awk -f tst.awk file
the quick  brown
fox jumped
 over the lazy
 dogs back

but please ask a new question with more truly representative sample input and expected output including punctuation, differences in capitalization, multiple paragraphs, duplicated words at the start/end of sentences and various other non-trivial cases are shown.

Ed Morton

Posted 2014-11-22T22:46:59.257

Reputation: 131

1

Perlishly, I'd be thinking:

use strict;
use warnings;

local $/;

my $slurp = <DATA>;
$slurp =~ s/\b(\w+)\W\1/$1/go;
print $slurp;

__DATA__
Hi! Hi, same same? word twice twice, as as here here! ! ,123 123 need
need as here 

Bear in mind though - a lot of pattern matching is line oriented, so you've got to be careful if you cross line boundaries. If you can exclude that case, then you've got an easier job because you can parse one line at a time. I'm not doing that, so you'll end up reading the whole file into memory.

Sobrique

Posted 2014-11-22T22:46:59.257

Reputation: 416

That's great, I preferred bash-based answer, but this is OK as well. – altroware – 2014-12-15T20:32:03.963

Perl is in nearly as many places as bash, and is more fully featured as a programming language. – Sobrique – 2014-12-15T20:47:15.033