Wednesday, May 25, 2022

[SOLVED] How can I run recursive find and replace operations on multiple files in parallel?

Issue

I am attempting to replace text data in a git repository using the git filter-branch functionality.

I wrote a simple script to search for various terms and replace them. It was running extremely slow. I had multiple lines of BASH code executing to customize my search results and replacement operation. I know my code was not very efficient. I decided to go ahead and try just my first line which should be semi-efficient. It's still taking forever to walk through the code base.

Is it possible to use BASH or another simple approach to search through my files and executed Find & Replace operations in parallel to speed things up?

If not, are there any other suggestions on how to go about handling this better?

Here's the Git command I'm executing:

git filter-branch --tree-filter "sh /home/kurtis/.bin/redact.sh || true" \
    -- --all

Here's the code my command is essentially executing:

find . -not -name "*.sql" -not -name "*.tsv" -not -name "*.class" \
    -type f -exec sed -i 's/01dPassw0rd\!/HIDDENPASSWORD/g' {} \;

Solution

git filter-branch cannot process commits in parallel, becouse it needs to know hash (id) of parent commit to calculate current hash.

But you can speed up processing of each commit:

Your code executes sed for each file. That is very slow. Use this instead:

find . -not -name "*.sql" -not -name "*.tsv" -not -name "*.class" \
       -type f -print0 \
  | xargs -0 sed -i 's/01dPassw0rd\!/HIDDENPASSWORD/g'

This version does exactly the same as yours, but sed is executed with as many files (arguments) as possible. Find's "-print0" and xargs's "-0" means "separe filenames with zero byte". So there is no trouble when filename contains spaces, new lines, binary trash, etc.



Answered By - Josef Kufner
Answer Checked By - Mary Flores (WPSolving Volunteer)