Tuesday, July 26, 2022

[SOLVED] shell or PHP remove HTML comment if the word match in the same paragraph

Issue

I need to verify if the words in the HTML comments are included in the same line, in this case, delete the comment. Otherwise, keep the comment.

At the same time, the script needs to ignore the pronouns, adverbs, articles. I already have a list and is over 100 hundreds words. Like this:

"the", "this", "I", "me", "you", "she", "her", "he", "him", "it", "they", "them", "that", "which", etc...

This is an example of one line:

text <!-- They are human # life --> text text <!-- the rights --> text the human text

After running the script:

text text text <!-- the rights --> text the human text

Resume:

  1. in the same line can be many comments, not only one.
  2. the script needs to ignore my list of pronouns, adverbs, etc...
  3. the script needs to ignore the words to other comments.
  4. not sensitive case.
  5. the files have over one thousand lines.
  6. usually in the comments I have this character # (I hope is not a problem).

Solution

As others have mentioned, you should show some research, tell what you've tried and why it didn't work.

That being said, I found this to be a fun little challenge, so I decided to give it a go.

I assumed there are two files, "file.html" which we want to modify, and "words.txt" which lists the words to ignore separated by newlines (\n). This script should do the trick:

#!/bin/bash

FILE="file.html"
WORDS="words.txt"

#Set array delimiter to '\n':
IFS=$'\n'

#Find all comments within the file:
comments="$(cat $FILE | grep -oP '<!--[^<]+-->' | sort | uniq)"

for comment in $comments; do

  #Words In Comment. Gets all words in the comment.
  wic="$(echo $comment | head -1 | grep -oP '[^\s]+' | grep -v '<' | grep -v '>')"

  words="$(cat $WORDS)"

  #Filtered Words. It's $wic without any of the words in words.txt
  fw="$(echo $wic $words $words | tr ' ' '\n' | sort | uniq -u)"

  #if any remain
  if [ ! -z "$fw" ]
  then

    for word in $fw; do
      #Gets all lines with both the comment and the word outside the comment 
      lines="$(cat $FILE | grep -P "$comment.+$word|$word.+$comment")"

      #If it finds any
      if [ ! -z "$lines" ]
      then
        for line in $lines; do

          #Generate the replacement line
          replace="$(echo $line | sed "s/$comment//g")"

          #Replace the line with the replacement in the file
          sed -i "s/$line/$replace/g" $FILE

        done
      fi
    done
  fi
done

It's not perfect but gets the job done. Tested it on a file with the following contents:

text <!-- foo # --> foo
text <!-- bar # --> foo
text <!-- bar # --> bar
text <!-- bar # --> text <!-- something # --> something bar
text <!-- foo # --> text <!-- bar # --> text foo bar

Using the following words.txt:

foo

And got the expected result:

text <!-- foo # --> foo
text <!-- bar # --> foo
text  bar
text  text  something bar
text <!-- foo # --> text  text foo bar


Answered By - 3snoW
Answer Checked By - Mildred Charles (WPSolving Admin)