How to extract bytes from the middle of a file?

1

We are parsing some large EDI files that do not contain CR/LF. However, they do have ~ (tilde) as a segment delimiter.

I am trying to extract the control record for the file and the last bytes of my 120 MB file look something like this:

~REF*1L*0711882~SE*62300*39093~GE*1*500001242~IEA*1*500001241~

There is only one control record in the file and it always starts with ~SE.

So, is there an easy way using standard Unix cut, awk, grep, etc. tools to cut this file to get the SE*62300*39093 segment, other than converting the ~ to CRLF and tailing the last three lines of the file?

Disclaimer:
I am not a Unix guru, so the answer may be obvious to an experienced user. Also, I have no control over the file format.

Noah

Posted 2013-01-31T20:01:47.573

Reputation: 2 337

What's wrong with converting the ~ to newlines and tailing the last 3 lines of the file. If the file is known to not already contain newlines then this does not introduce any ambiguity into the format, and frankly it's the best way to massage the file into a format that makes it easy for all those line-based tools to work with. – Celada – 2013-01-31T20:11:04.483

@Celada: I'm not a unix person, but converting hundreds of megabytes to extract the last 100 or so characters just seems like overkill; Some of these files can be very large, and I'm looking for the easiest way to do this. – Noah – 2013-01-31T20:15:49.023

You can filter down to the last few lines of a file using tail. No need to parse it all. Something like tail edi_file | grep ~SE | cut -d'~' -f 3 (where edi_file is the name of your large file) (Disclaimer: Example only works if the required field is in field #3 (delimited by ~'s as by -d~. That might need adjusting. Can we get a larger size example of the input file? – Hennes – 2013-01-31T20:16:57.143

120MB is not so big. Nobody ever worried about squeezing every last bit of performance out of a shell script. If you want that, use C :-) So Michael Kohne's answer is pretty much what I would do. Or if the file really is too big for you to want to read the whole thing, pre-filter it with something like tail --bytes=5000 ding... and then you hope that the last 5000 bytes are enough to encompass the 3 lines that you need. – Celada – 2013-01-31T20:33:19.277

For a once off thing. I agree. Let it run. For something used daily I like only to only parse the tail. Both because it is not wasteful and because it just feels wrong to needlessly waste. (Not that trying to come up with an answer for 20 minutes is not wasteful. No --bytes option in BSD find though). – Hennes – 2013-01-31T20:36:59.457

@Hennes: You may have missed it in the question, there is only 1 line in the file. – Noah – 2013-01-31T20:40:43.887

Aye. Missed that 5 minutes into solving it. Tried to come up with a way to only read the last few KB of a file, but no option for that in BSD's find. Ended with tr "~" "\n" < edi_file | tail -20 | grep ^SE which already matches the answer from Michael. – Hennes – 2013-01-31T20:42:59.990

Answers

3

You can do this with:
tr "~" "\n" < edi_file | tail -20 | grep ^SE

The tr translates all tildes to newlines. (Those are represented by a \n).

The output it then fed to tail, which discards all but the last 20 lines.

You can probably fine tune this, depending on what you want to search. Without it the whole file gets fed to grep, which is probably a lot more resource intensive than tail. If you have a specific version of tail which supports showing part of a file based on bytes rather then on lines you might even use this one step sooner.

I did not choose that option because your post is tagged generic unix rather than modern linux with up to date GNU tools and GNU specific extensions.

Finally grep filters the final lines to those containing SE, and the carret (^) makes sure it is at the beginning on a line. (Preventing things like ~fooooSEfoobarquz~SEwewantthispartonly~boobar~ for showing two lines).

Hennes

Posted 2013-01-31T20:01:47.573

Reputation: 60 739

4

While I can see not wanting to modify the original file, you can do the translation in a pipe. That way, you're not modifying the data, but you still get the benefit (in Unix utility terms) of turning ~ into end-of-line.

This should do the trick:

cat ding | tr "~" "\n" | tail -3

It is not the most efficient thing in the universe, but even on a 120 MB file it shouldn't be a big deal to run.

Note the quotes on the two sets are not optional - both ~ and \n will get interpreted by the shell if you drop the quotes.

Michael Kohne

Posted 2013-01-31T20:01:47.573

Reputation: 3 808

3tr "~" "\n" < edi_file | tail -20 | grep ^SE ? (No need to use cat when input can be redirected. Grep to show only fields starting with SE. – Hennes – 2013-01-31T20:41:12.240

@Hennes: This is a simpler answer, can you add it and I'll accept it. What I ended up using was tr "~" "\n" < edi_file | tail -3 | head -n 1 However this is only because I know that SE is always the 3rd to last segment – Noah – 2013-02-01T17:13:23.947

Done. Knowing your specific data format helps. I added some more explanations to the post below and to the reason why I used that. – Hennes – 2013-02-01T17:24:52.450

2

It will be inefficient on large files to tr first, because you actually want data from the end, and tr will process data that will be discarded.

Use tac to read the file in reverse, then take the 20 first lines (of the reverse, so actually the last), reverse again to get original order, now grep:

tac -s~ edi_file | head -n 20 | tac | grep ^SE

Remember that you can't seek() a pipe!

Janus Troelsen

Posted 2013-01-31T20:01:47.573

Reputation: 1 958

1You'll want to quote the ~ characters - depending on shell, a lone ~ may get expanded into something. – Michael Kohne – 2013-02-01T19:15:12.430

@MichaelKohne: Yes. But it seems that tac will convert to newlines itself, so tr shouldn't be needed – Janus Troelsen – 2013-02-01T19:17:02.077

@ysangkok: You may have missed the point that there is only 1 line in the file. – Noah – 2013-02-04T18:17:01.830

@Noah: That's why I use the -s flag for tac – Janus Troelsen – 2013-02-04T18:47:44.133

@ysangkok: I didn't tag the question solaris because I did not think it would matter. But it appears that tac is not supported under solaris. I upvoted your answer because I learned something new and it looks like it would have worked on other *nx systems – Noah – 2013-02-05T01:46:34.950