How to incrementally rename files after processing in bash?

0

I have a set of files that need processing, so I tend to do this programmatically in bash in mac and linux. Since I like to keep the originals in case something gets screwed, I want the files to come out renumbered incrementally, but I don't know the proper bash construction to accomplish this.

Here's an example. I have a set of .pdf files:

bulletinlois00.pdf
bulletinlois01.pdf
bulletinlois02.pdf
...
bulletinlois33.pdf

The pdfs have not yet been OCRed, and so I want to iterate through them with tesseract or ocrmypdf but instead of outputting them like bulletinlois01.pdf they would be 01.pdf. Here is another example using the same file set. I want to iterate through files doing pdftotext, but instead of having bulletinlois01.pdf go to bulletinlois01.txt I want it to be 01.txt

I could do a cp+mv process, or grep to replace the unwanted parts of the names, but this seems overkill and gets me confused about whether I should be using wait or a && construction.

Is there a simple way to script this using bash, and could you please explain what exactly the construction is doing so that I can learn how to adapt it to other, more complex processing I need to do? For instance, maybe I could use the construction to output the names using

`date "+%H.%M.%S"`

Here's the rudimentary script:

for f in *.pdf ; do
    tesseract -l fra "$f" "$f"_done.pdf
done

grad student

Posted 2019-09-08T21:26:43.973

Reputation: 203

You should post your actual Bash script instead of explaining what it does. Please edit your question to add it. – JakeGould – 2019-09-08T21:31:47.117

@JakeGould if it wasn't clear enough, I do not know how to do this – grad student – 2019-09-08T21:34:03.027

Are the new filenames (e.g. 01.pdf) named that way because the incoming filename has 01 in it, or because it's (e.g.) the first file being processed? If 01.pdf already exists, what should happen? It's confusing that your example code indicates a new filename of "_done" instead of a sequence number. – Jeff Schaller – 2019-09-10T00:34:08.290

good point. ideally, it would be 01.pdf because the incoming file has 01 in it, which would let me compare the output quality to the original. I added the _done so the next command would be something like mv "$f"_done.pdf ... to something like 01.pdf, but I realized that sort of mv construction would simply write over each file. I suspect I need some sort of array expansion, but I'm not sure how to implement it. – grad student – 2019-09-11T04:02:10.923

Answers

1

You can have more control over the resulting filenames by stripping prefix and suffix from the matched filenames.

This is one possible way to achieve that:

for matched_filename in bulletinlois*.pdf ; do

    # strip "bulletinlois" prefix from the filename
    tmp=${matched_filename#bulletinlois}

    # then strip ".pdf" suffix
    number=${tmp%\.pdf}

    tesseract -l fra "$matched_filename" "$number"_done.pdf
done

Stripping in this example is done using bash shell parameter expansion.

To find out more about shell parameter expansion visit this blog post or the official bash documentation.

curusarn

Posted 2019-09-08T21:26:43.973

Reputation: 294