{"id":32,"date":"2022-02-10T12:32:26","date_gmt":"2022-02-10T12:32:26","guid":{"rendered":"https:\/\/www.wientzek.ch\/?p=32"},"modified":"2022-02-10T12:39:53","modified_gmt":"2022-02-10T12:39:53","slug":"the-first-input-channel-with-a-shell-script","status":"publish","type":"post","link":"https:\/\/www.wientzek.ch\/?p=32","title":{"rendered":"The first input channel&#8230; with a Shell Script"},"content":{"rendered":"\n<p>A PDF arrives, a cron job checks once a minute if new files have arrived, it finds the file and starts a little processing chain.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>### Process the file ###\n\n          # OCR the document\n          ocrmypdf -l deu \"$entry\" \"$OUTPUT_DIR${entry##*\/}\"\n          # extract all text of the pdf to a text file\n          pdf2txt -o \"$OUTPUT_DIR${entry##*\/}.txt\" \"$OUTPUT_DIR${entry##*\/}\"\n          # save thumbnails of the pages of the pdf\n          convert \"$entry\" -quality 30 \"$OUTPUT_DIR${entry##*\/}.jpg\"<\/code><\/pre>\n\n\n\n<p>What I am using is OCRMyPDF to OCR the incoming file, it writes a new output file that has text underlay.<\/p>\n\n\n\n<p>With pdf2txt the text is exported and all blanks and empty lines are removed, as to have one string of text for a file.<\/p>\n\n\n\n<p>Using ImageMagick, thumbnails of each pdf page are prepared, with a low but readable quality<\/p>\n\n\n\n<p>This result is stored in an archive directory structure, together with a JSON file containing the most important information. This little JSON file is created directly in the shell script. Low tech indeed.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em># Assemble the thumbnail subjason<br><\/em>SEARCH=\"$OUTPUT_DIR${entry##*\/}\"<br>for JPGFILE in $SEARCH*.jpg; do<br>  THUMBNAILS=\"$THUMBNAILS{\\\"imgname\\\" : \\\"${JPGFILE##*\/}\\\",\\\"imdirectory\\\" : \\\"\/$OUTPUT_DIR\\\"},\"<br>done<br><br><em>## strip the last \",\"<br><\/em>THUMBNAILS=${THUMBNAILS::${#THUMBNAILS}-1}<br><br>JSON=\"{<br>          \\\"document\\\" : {<br>            \\\"name\\\" : \\\"$NAME\\\",<br>            \\\"directoy\\\" : \\\"$OUTPUT_DIR\\\",<br>            \\\"text\\\" : \\\"$PDFTXT\\\",<br>            \\\"timestamp\\\" : \\\"$YEAR-$MONTH-$DAY-$HOUR-$MINUTE\\\",<br>            \\\"origin\\\" : \\\"SCAN\\\",<br>            \\\"thumbnails\\\" : [<br>              $THUMBNAILS<br>            ],<br>            \\\"tags\\\" : [<br>              {<br>                \\\"tagname\\\" : \\\"SCANNED\\\"<br>              }<br>            ]<br>          }<br>        }\"<br>echo \"$JSON\" &gt; \"$OUTPUT_DIR${entry##*\/}.json\"<\/pre>\n\n\n\n<p>A .1-file is written in the import directory to mark a pdf as processed. I am planning to use the JSON file later as a document in Elastic Search.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A PDF arrives, a cron job checks once a minute if new files have arrived, it finds the file and starts a little processing chain. What I am using is OCRMyPDF to OCR the incoming file, it writes a new output file that has text underlay. With pdf2txt the text is exported and all blanks &hellip;<\/p>\n<p class=\"read-more\"> <a class=\"\" href=\"https:\/\/www.wientzek.ch\/?p=32\"> <span class=\"screen-reader-text\">The first input channel&#8230; with a Shell Script<\/span> Read More &raquo;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"default","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","footnotes":""},"categories":[4],"tags":[],"class_list":["post-32","post","type-post","status-publish","format-standard","hentry","category-dochauser"],"_links":{"self":[{"href":"https:\/\/www.wientzek.ch\/index.php?rest_route=\/wp\/v2\/posts\/32","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wientzek.ch\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wientzek.ch\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wientzek.ch\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wientzek.ch\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=32"}],"version-history":[{"count":2,"href":"https:\/\/www.wientzek.ch\/index.php?rest_route=\/wp\/v2\/posts\/32\/revisions"}],"predecessor-version":[{"id":34,"href":"https:\/\/www.wientzek.ch\/index.php?rest_route=\/wp\/v2\/posts\/32\/revisions\/34"}],"wp:attachment":[{"href":"https:\/\/www.wientzek.ch\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=32"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wientzek.ch\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=32"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wientzek.ch\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=32"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}