java - Converting a PDF to text using Tesseract OCR -
aim: convert pdf base64 pdf can general pdf or scanned one.
i using tesseract ocr converting scanned pdfs text files. since working in java, using terr4j
library this.
the flow of program have thought follows:
get pdf file ---> convert each page image using ghost4j ---> pass each image tess4f ocr ---> convert whole text base64
.
i have been able convert pdf file images using following code:
package helpers; import java.io.file; import java.io.fileinputstream; import java.io.filenotfoundexception; import java.io.ioexception; import java.awt.image; import java.awt.image.renderedimage; import java.util.list; import javax.imageio.imageio; import org.ghost4j.document.documentexception; import org.ghost4j.document.pdfdocument; import org.ghost4j.analyzer.fontanalyzer; import org.ghost4j.renderer.rendererexception; import org.ghost4j.renderer.simplerenderer; import net.sourceforge.tess4j.*; class encoder { public static byte[] createbytearray(file pcurrentfolder, string pnameofbinaryfile) { string pathtobinarydata = pcurrentfolder.getabsolutepath()+"/"+pnameofbinaryfile; file file = new file(pathtobinarydata); if (!file.exists()) { system.out.println(pnameofbinaryfile+" not found in folder "+pcurrentfolder.getname()); return null; } fileinputstream fin = null; try { fin = new fileinputstream(file); } catch (filenotfoundexception e) { e.printstacktrace(); } byte filecontent[] = new byte[(int) file.length()]; try { if (fin != null) fin.read(filecontent); } catch (ioexception e) { e.printstacktrace(); } return filecontent; } public void coverttoimage(file pdfdoc) { pdfdocument document = new pdfdocument(); try { document.load(pdfdoc); } catch (ioexception e) { e.printstacktrace(); } simplerenderer renderer = new simplerenderer(); renderer.setresolution(300); list<image> images = null; try { images = renderer.render(document); } catch (ioexception e) { e.printstacktrace(); } catch (rendererexception e) { e.printstacktrace(); } catch (documentexception e) { e.printstacktrace(); } try { if (images != null) { // testing 1 page imageio.write((renderedimage) images.get(10), "png", new file("/home/cloudera/downloads/1.png")); } } catch (ioexception e) { e.printstacktrace(); } } } public class encodefile { public static void main(string[] args) { /* part pure pdf files i.e. not scanned */ //byte[] arr = encoder.createbytearray(new file("/home/cloudera/downloads/"), "test.pdf"); //string result = javax.xml.bind.datatypeconverter.printbase64binary(arr); //system.out.println(result); /* part create image page of scanned pdf file */ new encoder().coverttoimage(new file("/home/cloudera/downloads/isl99201.pdf")); // results in 1.png /* part ocr */ tesseract instance = new tesseract(); string res = instance.doocr(new file("/home/cloudera/downloads/1.png")); system.out.println(res); } }
running produces these errors:
this occurs when try create image pdf. have seen if remove tess4j
build.sbt, image created out errors have use that.
connected target vm, address: '127.0.0.1:46698', transport: 'socket' exception in thread "main" java.lang.abstractmethoderror: com.sun.jna.structure.getfieldorder()ljava/util/list; @ com.sun.jna.structure.fieldorder(structure.java:884) @ com.sun.jna.structure.getfields(structure.java:910) @ com.sun.jna.structure.derivelayout(structure.java:1058) @ com.sun.jna.structure.calculatesize(structure.java:982) @ com.sun.jna.structure.calculatesize(structure.java:949) @ com.sun.jna.structure.allocatememory(structure.java:375) @ com.sun.jna.structure.<init>(structure.java:184) @ com.sun.jna.structure.<init>(structure.java:172) @ com.sun.jna.structure.<init>(structure.java:159) @ com.sun.jna.structure.<init>(structure.java:151) @ org.ghost4j.ghostscriptlibrary$display_callback_s.<init>(ghostscriptlibrary.java:63) @ org.ghost4j.ghostscript.buildnativedisplaycallback(ghostscript.java:381) @ org.ghost4j.ghostscript.initialize(ghostscript.java:336) @ org.ghost4j.renderer.simplerenderer.run(simplerenderer.java:105) @ org.ghost4j.renderer.abstractremoterenderer.render(abstractremoterenderer.java:86) @ org.ghost4j.renderer.abstractremoterenderer.render(abstractremoterenderer.java:70) @ helpers.encoder.coverttoimage(encodefile.java:62) @ helpers.encodefile.main(encodefile.java:86) disconnected target vm, address: '127.0.0.1:46698', transport: 'socket' process finished exit code 1
this error occurs while passing image tess4j
:
connected target vm, address: '127.0.0.1:46133', transport: 'socket' exception in thread "main" java.lang.unsatisfiedlinkerror: unable load library 'tesseract': native library (linux-x86-64/libtesseract.so) not found in resource path (....) @ com.sun.jna.nativelibrary.loadlibrary(nativelibrary.java:271) @ com.sun.jna.nativelibrary.getinstance(nativelibrary.java:398) @ com.sun.jna.library$handler.<init>(library.java:147) @ com.sun.jna.native.loadlibrary(native.java:412) @ com.sun.jna.native.loadlibrary(native.java:391) @ net.sourceforge.tess4j.util.loadlibs.gettessapiinstance(loadlibs.java:78) @ net.sourceforge.tess4j.tessapi.<clinit>(tessapi.java:40) @ net.sourceforge.tess4j.tesseract.init(tesseract.java:360) @ net.sourceforge.tess4j.tesseract.doocr(tesseract.java:273) @ net.sourceforge.tess4j.tesseract.doocr(tesseract.java:205) @ net.sourceforge.tess4j.tesseract.doocr(tesseract.java:189) @ helpers.encodefile.main(encodefile.java:89) disconnected target vm, address: '127.0.0.1:46133', transport: 'socket' process finished exit code 1
i working on intellij
using sbt on 64 bit centos 6.6. internet search have able understand issues above facing 2 constraints:
the jna library being used default of latest version i.e.
4.1.0
. read on internet incompatibility between jna , other libraries can occur. tried specify older version of 3.4.0. build.sbt keeps rejecting that.i on 64 bit system ,
tessearct
work 32 bit system. how should integrate in project?
following part build.sbt
handles required libraries:
"org.ghost4j" % "ghost4j" % "0.5.1", "org.bouncycastle" % "bctsp-jdk14" % "1.46", "net.sourceforge.tess4j" % "tess4j" % "2.0.0", "com.github.jai-imageio" % "jai-imageio-core" % "1.3.0" "net.java.dev.jna" % "jna" % "3.4.0", // not make difference 4.1.0 installed.
please me out in problem.
update: added "net.java.dev.jna" % "jna" % "3.4.0" force()
build.sbt
, solved first problem.
the solution issue lies in tesseract-api found on github. forked my github account , added test scanned image , did code refactoring. way library started function properly. scanned doc used testing here.
i built on travis , working fine on 32 64 bit systems.
Comments
Post a Comment