0
This shell script is mostly the work of other people. It has gone through several iterations, and I have tweaked it slightly while also trying to fully understand how it works. I think I understand it now, but I don't have confidence to significantly alter it on my own and risk losing data when I run the altered version. So I would appreciate some expert guidance on how to improve this script.
The changes I am seeking are:
- make it even more robust to any strange file names, if possible. It currently handles spaces in file names, but not newlines. I can live with that (because I try to find any file names with newlines and get rid of them).
- make it more intelligent about which file gets retained as the actual inode content and which file(s) become sym links. I would like to be able to choose to retain the file that is either a) the shortest path, b) the longest path or c) has the filename with the most alpha characters (which will probably be the most descriptive name).
- allow it to read the directories to process either from parameters passed in or from a file.
- optionally, write a long of all changes and/or all files not processed.
Of all of these, #2 is the most important for me right now. I need to process some files with it and I need to improve the way it chooses which files to turn into symlinks. (I tried using things like the find option -depth without success.)
Here's the current script:
#!/bin/bash
# clean up known problematic files first.
## find /home -type f -wholename '*Icon*
## *' -exec rm '{}' \;
# Configure script environment
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
set -o nounset
dir='/SOME/PATH/HERE/'
# For each path which has multiple links
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# (except ones containing newline)
last_inode=
while IFS= read -r path_info
do
#echo "DEBUG: path_info: '$path_info'"
inode=${path_info%%:*}
path=${path_info#*:}
if [[ $last_inode != $inode ]]; then
last_inode=$inode
path_to_keep=$path
else
printf "ln -s\t'$path_to_keep'\t'$path'\n"
rm "$path"
ln -s "$path_to_keep" "$path"
fi
done < <( find "$dir" -type f -links +1 ! -wholename '*
*' -printf '%i:%p\n' | sort --field-separator=: )
# Warn about any excluded files
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
buf=$( find "$dir" -type f -links +1 -path '*
*' )
if [[ $buf != '' ]]; then
echo 'Some files not processed because their paths contained newline(s):'$'\n'"$buf"
fi
exit 0
thank you. I have a couple questions. I'll focus on item 2 first. For finding the most descriptive names, I want to use something like
printf -- "$(basename "$path")" | tr -cd [:alpha:] | wc -m
in the linedone <<(find ...
and sort first on inode number, then on alpha char count. I just can't glue it all together. Can you help? – MountainX – 2012-03-29T16:48:07.020regarding path depth, find has the printf % directive
%d File's depth in the directory tree
which I could add to the existing find command indone <<(find ...
. Then I just need to be able to choose whether to sort on alpha char count or path depth. Not sure how to do that. – MountainX – 2012-03-29T16:52:05.470To sort on a different key, e.g. the second, use '
sort -t: -k2
'. By setting the key number according to alnum or path depth preference you can choose sorting order. You can also sort on multiple keys using multiple-k
arguments. The syntax is a bit weird, but 'sort -k 1,1 -k 3,3
' sorts on field 1, then field 3 (the,1
is needed to tell sort to "stop looking" after field 1). This will enable you to sort on inode and path depth as in your first comment, and on a chosen field as in your second comment. – Daniel Andersson – 2012-03-29T17:25:02.373This code
basename "$myfilename" | tr -cd [:alpha:] | wc -m
gives me the alpha char count of the filename, but I can't figure out how to get that into thefind
output so I can sort on it. In other words, I need afind
command that willprintf %i:%d:%p\n
PLUS print this alpha char count. Then I'll sort. I can't figure out how to integrate those find/printf commands. This was the essence of my first comment/question above. Thanks. – MountainX – 2012-03-29T18:07:38.527I think you are really nearing the limit of what one can cram into a single pipe in that way. If you really want to keep the structure of the script, you probably need to throw in
awk
in the find pipe, where you can certainly do thetr -cd | wc -m
dance and more. You are not really gaining much in performance from the pipe any longer sincesort
needs all input before acting, so the "stream" is held up. I don't really have time for more writing for the evening, though, but good look in your quest. – Daniel Andersson – 2012-03-29T18:17:29.887I turned that last comment into a question here: http://superuser.com/questions/406275/how-to-find-files-print-some-standard-info-about-those-files-plus-print-the-al
– MountainX – 2012-03-29T18:21:13.177thanks. I guess
awk
is the way to go. I have not usedawk
much. I'm not against starting from scratch with a better solution either. – MountainX – 2012-03-29T18:30:35.253("look"->"luck" in the list line of my last comment). I saw the awk thread, where it as in my prophecy could be pushed into the command to solve character count. A comment on the command/pipes/redirections:
<()
is a Bash construct which executes the command, stores output in a temp file and returns the filename. If one simply did this step manually, one would gain direct control over the output in a different way. To compress the commands as they are currently done does not gain any real performance, is less portable and as noticed difficult to modify. – Daniel Andersson – 2012-03-30T07:52:51.627Added note:
rm "$path"; ln -s "$path_to_keep" "$path"
could be placed by justln -sf "$path_to_keep" "$path"
. – Daniel Andersson – 2012-03-30T07:54:22.903And instead of shell substring removal such as in
inode=${path_info%%:*}
, you should probably start usingcut
now that there are more than two output fields (unless you only need the first and last arguments, but cut is a more "semantically correct" solution). – Daniel Andersson – 2012-03-30T08:00:22.567thank you for all the tips and suggestions. Those are very helpful. – MountainX – 2012-03-30T13:31:39.653