How to improve this bash shell script for turning hardlinks into symlinks?

0

This shell script is mostly the work of other people. It has gone through several iterations, and I have tweaked it slightly while also trying to fully understand how it works. I think I understand it now, but I don't have confidence to significantly alter it on my own and risk losing data when I run the altered version. So I would appreciate some expert guidance on how to improve this script.

The changes I am seeking are:

  1. make it even more robust to any strange file names, if possible. It currently handles spaces in file names, but not newlines. I can live with that (because I try to find any file names with newlines and get rid of them).
  2. make it more intelligent about which file gets retained as the actual inode content and which file(s) become sym links. I would like to be able to choose to retain the file that is either a) the shortest path, b) the longest path or c) has the filename with the most alpha characters (which will probably be the most descriptive name).
  3. allow it to read the directories to process either from parameters passed in or from a file.
  4. optionally, write a long of all changes and/or all files not processed.

Of all of these, #2 is the most important for me right now. I need to process some files with it and I need to improve the way it chooses which files to turn into symlinks. (I tried using things like the find option -depth without success.)

Here's the current script:

#!/bin/bash

# clean up known problematic files first.
## find /home -type f -wholename '*Icon*
## *' -exec rm '{}' \;

# Configure script environment
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
set -o nounset
dir='/SOME/PATH/HERE/'

# For each path which has multiple links
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# (except ones containing newline)
last_inode=
while IFS= read -r path_info
do
   #echo "DEBUG: path_info: '$path_info'"
   inode=${path_info%%:*}
   path=${path_info#*:}
   if [[ $last_inode != $inode ]]; then
       last_inode=$inode
       path_to_keep=$path
   else
       printf "ln -s\t'$path_to_keep'\t'$path'\n"
       rm "$path"
       ln -s "$path_to_keep" "$path"
   fi
done < <( find "$dir" -type f -links +1 ! -wholename '*
*' -printf '%i:%p\n' | sort --field-separator=: )

# Warn about any excluded files
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
buf=$( find "$dir" -type f -links +1 -path '*
*' )
if [[ $buf != '' ]]; then
    echo 'Some files not processed because their paths contained newline(s):'$'\n'"$buf"
fi

exit 0

MountainX

Posted 2012-03-29T05:09:08.887

Reputation: 1 735

Answers

2

1.

One simple change to not die on file names that start on - is to add -- (means "now all options have been given, only positional arguments left") before the file name arguments start, e.g.

rm -- "$path"
ln -s -- "$path_to_keep" "$path"

and so on.


2.

To count alpha ("alphanumeric" is probably what you really want) characters in a file name you could do

numberofalnum=$(printf -- "$path" | tr -cd [:alnum:] | wc -m)

To count path depth, you could try to just count occurences of '/' in the filename. A caveat could be that /home///daniel is equivalent with /home/daniel, but find won't output unnecessary multiple slashes, so it will be alright.

depth=$(printf -- "$path" | tr -cd / | wc -m)

One could also collapse multiple slashes by running tr -s / after printf. Combining -s, -c and -d in this way in a single invocation is not really possible, it seems.

In this case, since find is already used in this way in the script, just adding a : separated field in the -printf output with %d will print the depth directly, as noted below in comment.


3a.

To read directories as arguments from the command line, see this minimal snippet:

#!/bin/sh
i=0
while [ $# -ne 0 ]; do
    printf -- 'Argument %d: %s\n' "${i}" "${1}"
    i=$((i+1))
    shift
done

($i is just a counter to show you what is happening)

If you wrap your logic in such a while loop, you can access the first argument as ${1}, then use shift which pops the first item off the argument list, and then iterate again and now ${1} is the originally second argument. Do this while the argument count $# is not 0.


3b.

To read the arguments from a file, wrap it instead like

#!/bin/sh
i=1
while read line; do
    printf -- 'Argument %d: %s\n' "${i}" "${line}"
    i=$((i+1))
    shift
done < "${1}"

Tip: instead of just increasing indent and wrapping the whole file logic that way, create functions of the current logic and call them at the end of the script. This will easily enable you to choose between either giving directories as arguments or reading them from a file without duplicating code in your script.


4.

Add

printf 'My descriptive log message for path %s\n' "${path}" >> "${logfile}"

in the logic blocks where you have decided to take action or not. Set $logfile earlier to a wanted log path.

Daniel Andersson

Posted 2012-03-29T05:09:08.887

Reputation: 20 465

thank you. I have a couple questions. I'll focus on item 2 first. For finding the most descriptive names, I want to use something like printf -- "$(basename "$path")" | tr -cd [:alpha:] | wc -m in the line done <<(find ... and sort first on inode number, then on alpha char count. I just can't glue it all together. Can you help? – MountainX – 2012-03-29T16:48:07.020

regarding path depth, find has the printf % directive %d File's depth in the directory tree which I could add to the existing find command in done <<(find .... Then I just need to be able to choose whether to sort on alpha char count or path depth. Not sure how to do that. – MountainX – 2012-03-29T16:52:05.470

To sort on a different key, e.g. the second, use 'sort -t: -k2'. By setting the key number according to alnum or path depth preference you can choose sorting order. You can also sort on multiple keys using multiple -k arguments. The syntax is a bit weird, but 'sort -k 1,1 -k 3,3' sorts on field 1, then field 3 (the ,1 is needed to tell sort to "stop looking" after field 1). This will enable you to sort on inode and path depth as in your first comment, and on a chosen field as in your second comment. – Daniel Andersson – 2012-03-29T17:25:02.373

This code basename "$myfilename" | tr -cd [:alpha:] | wc -m gives me the alpha char count of the filename, but I can't figure out how to get that into the find output so I can sort on it. In other words, I need a find command that will printf %i:%d:%p\n PLUS print this alpha char count. Then I'll sort. I can't figure out how to integrate those find/printf commands. This was the essence of my first comment/question above. Thanks. – MountainX – 2012-03-29T18:07:38.527

I think you are really nearing the limit of what one can cram into a single pipe in that way. If you really want to keep the structure of the script, you probably need to throw in awk in the find pipe, where you can certainly do the tr -cd | wc -m dance and more. You are not really gaining much in performance from the pipe any longer since sort needs all input before acting, so the "stream" is held up. I don't really have time for more writing for the evening, though, but good look in your quest. – Daniel Andersson – 2012-03-29T18:17:29.887

I turned that last comment into a question here: http://superuser.com/questions/406275/how-to-find-files-print-some-standard-info-about-those-files-plus-print-the-al

– MountainX – 2012-03-29T18:21:13.177

thanks. I guess awk is the way to go. I have not used awk much. I'm not against starting from scratch with a better solution either. – MountainX – 2012-03-29T18:30:35.253

("look"->"luck" in the list line of my last comment). I saw the awk thread, where it as in my prophecy could be pushed into the command to solve character count. A comment on the command/pipes/redirections: <() is a Bash construct which executes the command, stores output in a temp file and returns the filename. If one simply did this step manually, one would gain direct control over the output in a different way. To compress the commands as they are currently done does not gain any real performance, is less portable and as noticed difficult to modify. – Daniel Andersson – 2012-03-30T07:52:51.627

Added note: rm "$path"; ln -s "$path_to_keep" "$path" could be placed by just ln -sf "$path_to_keep" "$path". – Daniel Andersson – 2012-03-30T07:54:22.903

And instead of shell substring removal such as in inode=${path_info%%:*}, you should probably start using cut now that there are more than two output fields (unless you only need the first and last arguments, but cut is a more "semantically correct" solution). – Daniel Andersson – 2012-03-30T08:00:22.567

thank you for all the tips and suggestions. Those are very helpful. – MountainX – 2012-03-30T13:31:39.653