The first input channel… with a Shell Script

A PDF arrives, a cron job checks once a minute if new files have arrived, it finds the file and starts a little processing chain.

### Process the file ###

          # OCR the document
          ocrmypdf -l deu "$entry" "$OUTPUT_DIR${entry##*/}"
          # extract all text of the pdf to a text file
          pdf2txt -o "$OUTPUT_DIR${entry##*/}.txt" "$OUTPUT_DIR${entry##*/}"
          # save thumbnails of the pages of the pdf
          convert "$entry" -quality 30 "$OUTPUT_DIR${entry##*/}.jpg"

What I am using is OCRMyPDF to OCR the incoming file, it writes a new output file that has text underlay.

With pdf2txt the text is exported and all blanks and empty lines are removed, as to have one string of text for a file.

Using ImageMagick, thumbnails of each pdf page are prepared, with a low but readable quality

This result is stored in an archive directory structure, together with a JSON file containing the most important information. This little JSON file is created directly in the shell script. Low tech indeed.

# Assemble the thumbnail subjason
SEARCH="$OUTPUT_DIR${entry##*/}"
for JPGFILE in $SEARCH*.jpg; do
  THUMBNAILS="$THUMBNAILS{\"imgname\" : \"${JPGFILE##*/}\",\"imdirectory\" : \"/$OUTPUT_DIR\"},"
done

## strip the last ","
THUMBNAILS=${THUMBNAILS::${#THUMBNAILS}-1}

JSON="{
          \"document\" : {
            \"name\" : \"$NAME\",
            \"directoy\" : \"$OUTPUT_DIR\",
            \"text\" : \"$PDFTXT\",
            \"timestamp\" : \"$YEAR-$MONTH-$DAY-$HOUR-$MINUTE\",
            \"origin\" : \"SCAN\",
            \"thumbnails\" : [
              $THUMBNAILS
            ],
            \"tags\" : [
              {
                \"tagname\" : \"SCANNED\"
              }
            ]
          }
        }"
echo "$JSON" > "$OUTPUT_DIR${entry##*/}.json"

A .1-file is written in the import directory to mark a pdf as processed. I am planning to use the JSON file later as a document in Elastic Search.

Leave a Comment Cancel Reply