Sort all files in a folder independently, with an output file for each

0

I have several folders that contain numerous text files, ranging from tens to 100s. These text files are simple databases containing millions of lines, with each line containing a single record. However, the records in them are unsorted and contain many duplicates. I'd like to to sort and de-duplicate them all individually (i.e. independently of each other), but to my understanding, sort can only produce a concatenated output of all input files - that is, even if given multiple files, it will only produce one output file containing the combined results of all those files.

How can I sort all files in the current folder to produce an individually sorted output file for each one? I'd also like for the output files to be outputted to a subfolder within the current directory. A for loop is the obvious solution to me, but I'm asking here in case there's some simpler way to do this with sort that I've not come across or missed. My bash knowledge is also very lacking, so if a for loop is the simplest solution, I'd appreciate someone providing the best way to go about that rather than me spending many days hacking something together that would still fall short of what I want to do.

Hashim

Posted 2018-09-12T23:17:31.103

Reputation: 6 967

sort -u provides a single record list. – matzeri – 2018-09-13T09:26:17.147

@matzeri Are you trolling? – Hashim – 2018-09-13T16:29:20.487

Answers

1

Yes, you can do this with for. Even if there is "some simpler way to do this with sort" (but I don't think so), this is also quite simple:

# cd to the directory you want to process

mkdir sorted    
for file in *; 
do
   printf 'Processing %s\n' "$file"
   [ -f "$file" ] && sort -u "$file" > "./sorted/$file"
done

Notes:

  • for file in * doesn't process files in subdirectories.
  • printf is only to report progress. In fact it should be placed after [ ... ] (see below) but I don't want to overcomplicate the code. You can just remove the printf line, if you want the whole thing to be silent.
  • [ -f "$file" ] checks if $file is a regular file. With the most general pattern (i.e. *) we need this condition at least to avoid running sort with the sorted directory as an argument (this would throw an error, harmless but non-elegant). Most likely this test won't be needed if you use a more specific glob like *.txt or *.db instead of * (e.g. to skip a stray desktop.ini file that shouldn't be processed). In this case you can omit [ ... ] && and start the line with sort (leaving the line intact shouldn't hurt though).
  • sort supports various options and you may want to use some of them, depending on how you need to sort.

  • sort -u de-duplicates entries right after sorting them, and when already using sort is a less redundant alternative to using the uniq command.

If you needed to pick files according to conditions more complex than a simple glob, find might be better to start with. For your current task for should be fine.

Kamil Maciorowski

Posted 2018-09-12T23:17:31.103

Reputation: 38 429

This looks perfect, thank you so much. There is a part of the code that I'm a little confused on - where does the [ -f "$file" ] come from? That is, what command does the -f parameter belong to, and what exactly do the square brackets mean? I haven't seen that sort of syntax used in a bash file before, at least not in this context. – Hashim – 2018-09-13T19:09:05.153

Okay, I found my answer here that it's synonymous with the test command: https://stackoverflow.com/questions/28036052/what-does-f-mea, but for clarification, what exactly does this mean? That [ ... ] is shorthand for the test command?

– Hashim – 2018-09-13T19:22:29.123

@Hashim See help [ in Bash. – Kamil Maciorowski – 2018-09-13T19:41:52.613

Unfortunately that doesn't give any information other than saying it's a synonym, so I suppose that's the only answer. There is one more thing - I've become accustomed to doing LC_ALL=C sort to make what usually be a ridiculously slow sort command a lot faster by limiting it to the default (US English) locale. Do you know of any way to incorporate that into your code above? Simply adding it before sort as normal results in a syntax error near unexpected token &&. It's not a major problem if it can't be done, but it would be nice to find out whether it can be. – Hashim – 2018-09-13T21:20:06.400

Also, I hope you don't mind but I made some minor edits to the code that I feel make it slightly less redundant, feel free to revert them if there was a reason for the way you coded it originally. – Hashim – 2018-09-13T21:21:59.407

@Hashim [ -f "$file" ] && LC_ALL=C sort … works for me on Debian. – Kamil Maciorowski – 2018-09-13T21:29:23.203

Nevermind, I was being slightly dense. Had decided to take out the file test command but had forgotten what you said about starting the line with sort. That gave me the following: https://pastebin.com/hxs9xNBm. I suppose I assumed that the && would chain on from the printf command just like it did with the file test command. Is there a reason this isn't the case?

– Hashim – 2018-09-13T21:46:41.460

1@Hashim Yes. When the shell parses printf, it has no reason to look forward to the next line. When the shell encounters your &&, the line with printf has already been parsed and executed. – Kamil Maciorowski – 2018-09-13T21:51:52.283

Ah I see, that makes sense. Sorry to keep bothering you with this but I promise this is the last question I'll ask for this piece of code. :P I've been trying to figure out how to modify the code in question to output the sorted file in the same directory, but with a _sorted suffix at the end of it. Is that a lot harder to do? I can't seem to find any help for it elsewhere hence why I'm still bothering you. So far I have LC_ALL=C sort -u "$i" >> "./$i_sorted", but for some reason that stops with Is a directory errors even though it looks to me like a file and there is no such directory. – Hashim – 2018-09-13T22:07:30.320

Let us continue this discussion in chat.

– Kamil Maciorowski – 2018-09-13T22:12:19.783