Issue
I would like to recursively scan a given directory for all .zip
files, extract text from each such a file using Apache Tika (in my case this is /opt/solr/bin/post
script) into a single text file and put that text file into the same directory where the original zip file is.
To find all zip files recursively and extract all the content I use:
find . -name "*zip" -exec sh -c 'f="{}"; /opt/solr/bin/post "$f" \
-params="...params..." > "$f.txt"' \;
The content of the extracted file is:
java -classpath /opt/solr/dist/solr-core-8.7.0.jar -Dauto=yes -Dout=yes -
Dparams=literal.search_area=test&extractOnly=true
&extractFormat=text&defaultField=text -Dc=mycoll
-Ddata=files org.apache.solr.util.SimplePostTool zip.zip
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/mycoll/update?
literal.search_area=test&extractOnly=true&extractFormat=text
&defaultField=text...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,
odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file zip.zip (application/octet-stream) to [base]/extract
{
"responseHeader":{
"status":0,
"QTime":1614},
"":"**EXTRACTED TEXT**",
"null_metadata":[
"stream_size",["79855"],
"X-Parsed-By",["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pkg.PackageParser"],
"stream_content_type",["application/octet-stream"],
"resourceName",["/mnt/remote/users/zhilov/!tmp/zip.zip"],
"Content-Type",["application/zip"]]}
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/mycoll/update?
literal.search_area=test&extractOnly=true&
extractFormat=text&defaultField=text...
Time spent: 0:00:03.495
From that output I would like to cut out the beginning and the end of the file leaving only EXTRACTED TEXT inside of the generated file for further indexing.
Is that possible to do all those operations in one bash command line? Or at least with a bash script?
Solution
Try this:
sed -n '/QTime/{N;s/.*\n.*:.//;s/.,$//p;}'
This question addresses the UTF-8 problem.
Answered By - Beta