Merging two files in unix, with one common column that is redundant

0

I have two files with one common column that is redundant.File 1 has chromosomal locations and TF's, file 2 has chromosomal locations and Refseq numbers.

File 1:
chr1:66997824-67000456      ZNF333
chr1:66997824-67000456      EGR1
chr1:66997824-67000456      MZF-1
chr22:51221989-51222166      Zic2
chr22:51221989-51222166      ZF5

File 2:
chr1:66997824-67000456      Refseq#1
chr22:51221989-51222166      Refseq#22

I would like to merge these two files, and create a new file with three columns,
chr1:66997824-67000456     ZNF333      Refseq#1
chr1:66997824-67000456      EGR1      Refseq#1
chr1:66997824-67000456      MZF-1      Refseq#1
chr22:51221989-51222166      Zic2      Refseq#22
chr22:51221989-51222166      ZF5      Refseq#22

Since the chromosomal locations are redundant, I could not merge them using join in Unix - Is there a way to merge using sed or awk?

ABB

Posted 2015-10-31T14:41:06.023

Reputation: 1

Please take a look at editing-help.

– Cyrus – 2015-10-31T16:08:14.447

Try join command. – 2991ambusher – 2015-10-31T16:17:03.830

Answers

1

join file1 file2

Output:

chr1:66997824-67000456 ZNF333 Refseq#1
chr1:66997824-67000456 EGR1 Refseq#1
chr1:66997824-67000456 MZF-1 Refseq#1
chr22:51221989-51222166 Zic2 Refseq#22
chr22:51221989-51222166 ZF5 Refseq#22

or

join file1 file2 | awk '{OFS="     ";print $1,$2,$3}'

Output:

chr1:66997824-67000456     ZNF333     Refseq#1
chr1:66997824-67000456     EGR1     Refseq#1
chr1:66997824-67000456     MZF-1     Refseq#1
chr22:51221989-51222166     Zic2     Refseq#22
chr22:51221989-51222166     ZF5     Refseq#22

Cyrus

Posted 2015-10-31T14:41:06.023

Reputation: 4 356

It didn't work, it exited after printing the first refseq#. My first file has 2000+records with chromosomal locations chr1:66997824-67000456 - the join command printed Refseq#1 for those 2000+ records and then exited. My input file has almost 20000+ lines whereas the output has only 2000+ lines. – ABB – 2015-11-02T15:27:37.887

@ABB: It should work though, only IF THE FILES ARE SORTED though. Use sort if they are not. Else, are you sure there is no mismatch between the first field of the two files? To test, try first with a subset, e.g using uniq -w 24 on file 1 and try to track your bug. – Joce – 2015-11-03T15:49:39.957

@ABB: Try this: join <(sort file1) <(sort file2) – Cyrus – 2015-11-05T06:38:32.217