java - Converting a PDF to text using Tesseract OCR -


aim: convert pdf base64 pdf can general pdf or scanned one.

i using tesseract ocr converting scanned pdfs text files. since working in java, using terr4j library this.

the flow of program have thought follows:

get pdf file ---> convert each page image using ghost4j ---> pass each image tess4f ocr ---> convert whole text base64.

i have been able convert pdf file images using following code:

package helpers;  import java.io.file; import java.io.fileinputstream; import java.io.filenotfoundexception; import java.io.ioexception;  import java.awt.image; import java.awt.image.renderedimage; import java.util.list; import javax.imageio.imageio;  import org.ghost4j.document.documentexception; import org.ghost4j.document.pdfdocument; import org.ghost4j.analyzer.fontanalyzer; import org.ghost4j.renderer.rendererexception; import org.ghost4j.renderer.simplerenderer; import net.sourceforge.tess4j.*;  class encoder {     public static byte[] createbytearray(file pcurrentfolder, string pnameofbinaryfile) {         string pathtobinarydata = pcurrentfolder.getabsolutepath()+"/"+pnameofbinaryfile;          file file = new file(pathtobinarydata);         if (!file.exists()) {             system.out.println(pnameofbinaryfile+" not found in folder "+pcurrentfolder.getname());             return null;         }          fileinputstream fin = null;         try {             fin = new fileinputstream(file);         } catch (filenotfoundexception e) {             e.printstacktrace();         }          byte filecontent[] = new byte[(int) file.length()];         try {             if (fin != null)                 fin.read(filecontent);         } catch (ioexception e) {             e.printstacktrace();         }         return filecontent;     }      public void coverttoimage(file pdfdoc) {         pdfdocument document = new pdfdocument();         try {             document.load(pdfdoc);         } catch (ioexception e) {             e.printstacktrace();         }         simplerenderer renderer = new simplerenderer();         renderer.setresolution(300);         list<image> images = null;         try {             images = renderer.render(document);         } catch (ioexception e) {             e.printstacktrace();         } catch (rendererexception e) {             e.printstacktrace();         } catch (documentexception e) {             e.printstacktrace();         }         try {             if (images != null) {                 // testing 1 page                 imageio.write((renderedimage) images.get(10), "png", new file("/home/cloudera/downloads/1.png"));             }         } catch (ioexception e) {             e.printstacktrace();         }     } }  public class encodefile {     public static void main(string[] args) {         /* part pure pdf files i.e. not scanned */         //byte[] arr = encoder.createbytearray(new file("/home/cloudera/downloads/"), "test.pdf");         //string result = javax.xml.bind.datatypeconverter.printbase64binary(arr);         //system.out.println(result);          /* part create image page of scanned pdf file */         new encoder().coverttoimage(new file("/home/cloudera/downloads/isl99201.pdf")); // results in 1.png          /* part ocr */         tesseract instance = new tesseract();         string res = instance.doocr(new file("/home/cloudera/downloads/1.png"));         system.out.println(res);     } } 

running produces these errors:

this occurs when try create image pdf. have seen if remove tess4j build.sbt, image created out errors have use that.

connected target vm, address: '127.0.0.1:46698', transport: 'socket' exception in thread "main" java.lang.abstractmethoderror: com.sun.jna.structure.getfieldorder()ljava/util/list;     @ com.sun.jna.structure.fieldorder(structure.java:884)     @ com.sun.jna.structure.getfields(structure.java:910)     @ com.sun.jna.structure.derivelayout(structure.java:1058)     @ com.sun.jna.structure.calculatesize(structure.java:982)     @ com.sun.jna.structure.calculatesize(structure.java:949)     @ com.sun.jna.structure.allocatememory(structure.java:375)     @ com.sun.jna.structure.<init>(structure.java:184)     @ com.sun.jna.structure.<init>(structure.java:172)     @ com.sun.jna.structure.<init>(structure.java:159)     @ com.sun.jna.structure.<init>(structure.java:151)     @ org.ghost4j.ghostscriptlibrary$display_callback_s.<init>(ghostscriptlibrary.java:63)     @ org.ghost4j.ghostscript.buildnativedisplaycallback(ghostscript.java:381)     @ org.ghost4j.ghostscript.initialize(ghostscript.java:336)     @ org.ghost4j.renderer.simplerenderer.run(simplerenderer.java:105)     @ org.ghost4j.renderer.abstractremoterenderer.render(abstractremoterenderer.java:86)     @ org.ghost4j.renderer.abstractremoterenderer.render(abstractremoterenderer.java:70)     @ helpers.encoder.coverttoimage(encodefile.java:62)     @ helpers.encodefile.main(encodefile.java:86) disconnected target vm, address: '127.0.0.1:46698', transport: 'socket'  process finished exit code 1 

this error occurs while passing image tess4j:

connected target vm, address: '127.0.0.1:46133', transport: 'socket' exception in thread "main" java.lang.unsatisfiedlinkerror: unable load library 'tesseract': native library (linux-x86-64/libtesseract.so) not found in resource path (....)     @ com.sun.jna.nativelibrary.loadlibrary(nativelibrary.java:271)     @ com.sun.jna.nativelibrary.getinstance(nativelibrary.java:398)     @ com.sun.jna.library$handler.<init>(library.java:147)     @ com.sun.jna.native.loadlibrary(native.java:412)     @ com.sun.jna.native.loadlibrary(native.java:391)     @ net.sourceforge.tess4j.util.loadlibs.gettessapiinstance(loadlibs.java:78)     @ net.sourceforge.tess4j.tessapi.<clinit>(tessapi.java:40)     @ net.sourceforge.tess4j.tesseract.init(tesseract.java:360)     @ net.sourceforge.tess4j.tesseract.doocr(tesseract.java:273)     @ net.sourceforge.tess4j.tesseract.doocr(tesseract.java:205)     @ net.sourceforge.tess4j.tesseract.doocr(tesseract.java:189)     @ helpers.encodefile.main(encodefile.java:89) disconnected target vm, address: '127.0.0.1:46133', transport: 'socket'  process finished exit code 1 

i working on intellij using sbt on 64 bit centos 6.6. internet search have able understand issues above facing 2 constraints:

  • the jna library being used default of latest version i.e. 4.1.0. read on internet incompatibility between jna , other libraries can occur. tried specify older version of 3.4.0. build.sbt keeps rejecting that.

  • i on 64 bit system , tessearct work 32 bit system. how should integrate in project?

following part build.sbt handles required libraries:

"org.ghost4j" % "ghost4j" % "0.5.1", "org.bouncycastle" % "bctsp-jdk14" % "1.46", "net.sourceforge.tess4j" % "tess4j" % "2.0.0", "com.github.jai-imageio" % "jai-imageio-core" % "1.3.0"  "net.java.dev.jna" % "jna" % "3.4.0", // not make difference 4.1.0 installed. 

please me out in problem.

update: added "net.java.dev.jna" % "jna" % "3.4.0" force() build.sbt , solved first problem.

the solution issue lies in tesseract-api found on github. forked my github account , added test scanned image , did code refactoring. way library started function properly. scanned doc used testing here.

i built on travis , working fine on 32 64 bit systems.


Comments

Popular posts from this blog

python - How to create jsonb index using GIN on SQLAlchemy? -

PHP DOM loadHTML() method unusual warning -

c# - TransactionScope not rolling back although no complete() is called -