Issue
How do I remove all script tags in html file using sed?
I try with this but doesn't work, the command below doesn't remove any script tag from test1.html
.
$ sed -e 's/<script[.]+<\/script>//g' test1.html > test1_output.html
My goal is from test1.html to test1_output.html
test1.html
:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
</head>
<body>
<h1>My Website</h1>
<div class="row">
some text
</div>
<script type="text/javascript"> utmx( 'url', 'A/B' );</script>
<script src="ga_exp.js" type="text/javascript" charset="utf-8"></script>
<script type="text/javascript">
window.exp_version = 'control';
</script>
</body>
</html>
test1_output.html
:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
</head>
<body>
<h1>My Website</h1>
<div class="row">
some text
</div>
</body>
</html>
Solution
If I understood correctly your question, and you want to delete everything inside <script></script>
, I think you have to split the sed in parts (You can do it one-liner with ;):
Using:
sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g'
The first piece (s/<script>.*<\/script>//g
) will work for them when in one line;
The second section (/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}}
) is almost a quote to @akingokay answer, only that I excluded the lines of occurrence (Just in case they have something before or after). Great explanation of that in here Using sed to delete all lines between two matching patterns;
The last two (s/<script>.*//g
and s/.*<\/script>//g
) finally take care of the lines that start and don't finish or don't start and finish.
Now if you have an index.html that has:
<html>
<body>
foo
<script> console.log("bar) </script>
<div id="something"></div>
<script>
// Multiple Lines script
// Blah blah
</script>
foo <script> //Some
console.log("script")</script> bar
</body>
</html>
and you run this sed command, you will get:
cat index.html | sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g'
<html>
<body>
foo
<div id="something"></div>
foo
bar
</body>
</html>
Finally you will have a lot of blank spaces, but the code should work as expected. Of course you could easily remove them with sed as well.
Hope it helps.
PS: I think that @l0b0 is right, and this is not the correct tool.
Answered By - Jorge Valentini Answer Checked By - Cary Denson (WPSolving Admin)