Extract all text with string positions from a PDF

Extract all text with string positions from a PDF - java

This may seem an old question, but I didn't find an exhaustive answer after spending half an hour searching all over SO.
I am using PDFBox and I would like to extract all of the text from a PDF file along with the coordinates of each string. I am using their PrintTextLocations example (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/examples/util/PrintTextLocations.html) but with the kind of pdf I am using (E-Tickets) the program fails to recognize strings, printing each character separately. The output is a list of strings (each representing a TextPosition object) like this:
String[414.93896,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.0] s
String[418.93896,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.447998] a
String[423.38696,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=1.776001] l
String[425.16296,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.447998] e
While I would like the program to recognize the string "sale" as an unique TextPosition and give me its position.
I also tried to play with the setSpacingTolerance() and setAverageCharacterTolerance() PDFTextStripper methods, setting different values above and under the standard values (which FYI are 0.5 and 0.3 respectively), but the output didn't change at all. Where am I going wrong? Thanks in advance.

As Joey mentioned, PDF is just a collection of instructions telling you where a certain character should be printed.
In order to extract words or lines, you will have to perform some data segmentation: studying the bounding boxes of the characters should let you recognize those that are on a same line and then which one form words.

Here is your Solution:
1. Reading File
2. Fetching Each Page to Text by using PDFParserTextStripper
3. Each Position of the text will be printed by char.
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
class PDFParserTextStripper extends PDFTextStripper {
public PDFParserTextStripper(PDDocument pdd) throws IOException {
super();
document = pdd;
}
public void stripPage(int pageNr) throws IOException {
this.setStartPage(pageNr + 1);
this.setEndPage(pageNr + 1);
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
writeText(document, dummy); // This call starts the parsing process and calls writeString repeatedly.
}
#Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
for (TextPosition text : textPositions) {
System.out.println("String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSizeInPt()
+ " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space="
+ text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + " ] " + text.getUnicode());
}
}
public static void extractText(InputStream inputStream) {
PDDocument pdd = null;
try {
pdd = PDDocument.load(inputStream);
PDFParserTextStripper stripper = new PDFParserTextStripper(pdd);
stripper.setSortByPosition(true);
for (int i = 0; i < pdd.getNumberOfPages(); i++) {
stripper.stripPage(i);
}
} catch (IOException e) {
// throw error
} finally {
if (pdd != null) {
try {
pdd.close();
} catch (IOException e) {
}
}
}
}
public static void main(String[] args) throws IOException {
File f = new File("C://PDFLOCATION//target.pdf");
FileInputStream fis = null;
try {
fis = new FileInputStream(f);
extractText(fis);
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (fis != null)
fis.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
}

Related

How to write placeholder in pdf using itext

I am using itext and converting html to pdf for that i am using this code
import java.io.FileOutputStream;
import java.io.StringReader;
import javax.sql.rowset.spi.XmlWriter;
import com.itextpdf.text.Chunk;
import com.itextpdf.text.Document;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.html.simpleparser.HTMLWorker;
import com.itextpdf.text.pdf.PdfWriter;
public class HtmlToPDF2 {
// itextpdf-5.4.1.jar http://sourceforge.net/projects/itext/files/iText/
public static void main(String ... args ) {
try {
Document document = new Document(PageSize.LETTER);
PdfWriter.getInstance(document, new FileOutputStream("testpdf1.pdf"));
document.open();
HTMLWorker htmlWorker = new HTMLWorker(document);
String firstName = "<name>" ;
String sign = "<sign>";
String str = "<html> " +
"<body>" +
"<form>" +
"<div><strong>Dear</strong> "+firstName +",</div><br/>"+
"<div>"+
"<P> It is informed that you are selected in your interview<br/>"+
" and please report on the <b>20 may</b> with your all original <br/>"+
" document on our head office at jaipur.>"+
" </P>"+
" </div><br/>"+
" <div>"+
" <p>Yours sincierly </p><br/>"+sign+"</div>"+
" </form>"+
"<body>"+
"<html>";
htmlWorker.parse(new StringReader(str));
document.close();
System.out.println("Done");
}
catch (Exception e) {
e.printStackTrace();
}
}
}
but this will give me output
desired output is
and is it correct way to create placeholder .. or i need to do anything else to create placeholder ? if yes then please suggest me .

< and > signs consider as html tags. Because of that it don't show in your pdf.
you can define firstName and sign as below..
public class HtmlToPDF2 {
public static void main(String ... args ) {
....
....
String firstName = "<name>" ;
String sign = "<sign>";
....
....
}
}

Parse zip or Jar project

I need to return all the packages, classes ... that a java project (zip/jar) contains. I guess QDox can do that. I found that class : http://www.jarvana.com/jarvana/view/com/ning/metrics.serialization-all/2.0.0-pre5/metrics.serialization-all-2.0.0-pre5-jar-with-dependencies-sources.jar!/com/thoughtworks/qdox/tools/QDoxTester.java?format=ok
package com.thoughtworks.qdox.tools;
import com.thoughtworks.qdox.JavaDocBuilder;
import com.thoughtworks.qdox.directorywalker.DirectoryScanner;
import com.thoughtworks.qdox.directorywalker.FileVisitor;
import com.thoughtworks.qdox.directorywalker.SuffixFilter;
import com.thoughtworks.qdox.parser.ParseException;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.PrintStream;
import java.util.Enumeration;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
/**
* Tool for testing that QDox can parse Java source code.
*
* #author Joe Walnes
*/
public class QDoxTester {
public static interface Reporter {
void success(String id);
void parseFailure(String id, int line, int column, String reason);
void error(String id, Throwable throwable);
}
private final Reporter reporter;
public QDoxTester(Reporter reporter) {
this.reporter = reporter;
}
public void checkZipOrJarFile(File file) throws IOException {
ZipFile zipFile = new ZipFile(file);
Enumeration entries = zipFile.entries();
while (entries.hasMoreElements()) {
ZipEntry zipEntry = (ZipEntry) entries.nextElement();
InputStream inputStream = zipFile.getInputStream(zipEntry);
try {
verify(file.getName() + "!" + zipEntry.getName(), inputStream);
} finally {
inputStream.close();
}
}
}
public void checkDirectory(File dir) throws IOException {
DirectoryScanner directoryScanner = new DirectoryScanner(dir);
directoryScanner.addFilter(new SuffixFilter(".java"));
directoryScanner.scan(new FileVisitor() {
public void visitFile(File file) {
try {
checkJavaFile(file);
} catch (IOException e) {
// ?
}
}
});
}
public void checkJavaFile(File file) throws IOException {
InputStream inputStream = new FileInputStream(file);
try {
verify(file.getName(), inputStream);
} finally {
inputStream.close();
}
}
private void verify(String id, InputStream inputStream) {
try {
JavaDocBuilder javaDocBuilder = new JavaDocBuilder();
javaDocBuilder.addSource(new BufferedReader(new InputStreamReader(inputStream)));
reporter.success(id);
} catch (ParseException parseException) {
reporter.parseFailure(id, parseException.getLine(), parseException.getColumn(), parseException.getMessage());
} catch (Exception otherException) {
reporter.error(id, otherException);
}
}
public static void main(String[] args) throws IOException {
if (args.length == 0) {
System.err.println("Tool that verifies that QDox can parse some Java source.");
System.err.println();
System.err.println("Usage: java " + QDoxTester.class.getName() + " src1 [src2] [src3]...");
System.err.println();
System.err.println("Each src can be a single .java file, or a directory/zip/jar containing multiple source files");
System.exit(-1);
}
ConsoleReporter reporter = new ConsoleReporter(System.out);
QDoxTester qDoxTester = new QDoxTester(reporter);
for (int i = 0; i < args.length; i++) {
File file = new File(args[i]);
if (file.isDirectory()) {
qDoxTester.checkDirectory(file);
} else if (file.getName().endsWith(".java")) {
qDoxTester.checkJavaFile(file);
} else if (file.getName().endsWith(".jar") || file.getName().endsWith(".zip")) {
qDoxTester.checkZipOrJarFile(file);
} else {
System.err.println("Unknown input <" + file.getName() + ">. Should be zip, jar, java or directory");
}
}
reporter.writeSummary();
}
private static class ConsoleReporter implements Reporter {
private final PrintStream out;
private int success;
private int failure;
private int error;
private int dotsWrittenThisLine;
public ConsoleReporter(PrintStream out) {
this.out = out;
}
public void success(String id) {
success++;
if (++dotsWrittenThisLine > 80) {
newLine();
}
out.print('.');
}
private void newLine() {
dotsWrittenThisLine = 0;
out.println();
out.flush();
}
public void parseFailure(String id, int line, int column, String reason) {
newLine();
out.println("* " + id);
out.println(" [" + line + ":" + column + "] " + reason);
failure++;
}
public void error(String id, Throwable throwable) {
newLine();
out.println("* " + id);
throwable.printStackTrace(out);
error++;
}
public void writeSummary() {
newLine();
out.println("-- Summary --------------");
out.println("Success: " + success);
out.println("Failure: " + failure);
out.println("Error : " + error);
out.println("Total : " + (success + failure + error));
out.println("-------------------------");
}
}
}
It contains a method called checkZipOrJarFile(File file), maybe it could help me. but I can't find any examples or tutorials on how to use it.
Any help is welcomed.

QDox cannot do that for you, unfortunately (I came here looking for QDox support for jars). The source code for QDox that you list above is only for testing if the classes in the given jar can be parsed successfully by QDox. That code does, however, give you a clue on how to use standard java apis to do what you want: enumerate over classed in a jar.
Here's some code I'm using (which I cribbed from another SO answer here: analyze jar file programmatically)
// Your jar file
JarFile jar = new JarFile(jarFile);
// Getting the files into the jar
Enumeration<? extends JarEntry> enumeration = jar.entries();
// Iterates into the files in the jar file
while (enumeration.hasMoreElements()) {
ZipEntry zipEntry = enumeration.nextElement();
// Is this a class?
if (zipEntry.getName().endsWith(".class")) {
// Relative path of file into the jar.
String className = zipEntry.getName();
// Complete class name
className = className.replace(".class", "").replace("/", ".");
// Load class definition from JVM - you may not need this, but I want to introspect the class
try {
Class<?> clazz = getClass().getClassLoader().loadClass(className);
// ... I then go on to do some intropsection
If you don't actually want to introspect the class as I do, you can stop at getName(). Also, if you specifically want to find packages, you could use zipEntry.isDirectory() on your zip entries.

bufferedwritter writing weird foreign characters?

I'm writing a little program that just takes a file, and trims the last 4 characters after a space and writes those to a new file. When I tell it to do this and then print them to console it works fine. They show up fine and everything works. But when I use the BufferedWriter to write it to a new file it gives me a weird string of characters in that file when I check it. Here is my code:
package trimmer;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Scanner;
public class trimmer {
private File file;
private File newfile;
private Scanner in;
public void Create() {
String temp, temp1;
try {
setScanner(new Scanner(file));
} catch (FileNotFoundException e) {
System.out.println("file not found!!");
}
if (!newfile.exists()) {
try {
newfile.createNewFile();
FileWriter fw = new FileWriter(newfile.getAbsoluteFile());
BufferedWriter bw = new BufferedWriter(fw);
while (in.hasNextLine()) {
temp1 = in.nextLine();
temp = temp1.substring(temp1.lastIndexOf(' ') + 1);
System.out.println(temp);
bw.write(temp);
}
bw.close();
System.out.println("done!");
} catch (IOException e) {
System.out.println("Could not make new file: " + newfile + " Error code: " + e.getMessage());
}
}
}
public Scanner getScanner() {
return in;
}
public void setScanner(Scanner in) {
this.in = in;
}
public File getFile() {
return file;
}
public void setFile(File file) {
this.file = file;
}
public File getNewfile() {
return newfile;
}
public void setNewfile(File newfile) {
this.newfile = newfile;
}
}
and when I check the file it looks like this:
䐳噔吳商吳啍唳噎吳剄唳剄䘳剄唳噎吳商䠳卉䌳䕎䜳䱁䠳卉䴳㉕倳乓䐳䍐䐳啐吳䍖吳乓吳啍䔳䥘䌳噔匳剕唳乓唳䅍䌳䕎䜳䱁䴳㉕倳乓䐳䍐䐳啐吳䍖䠳卉吳乓吳啍䔳䥘䌳噔匳剕唳乓唳䅍
Can anyone tell me why this would be happening?

FileWriter uses the platform default character encoding. If this is not the encoding that you want, then you need to use an OutputStreamWriter with the appropriately chosen character encoding.

Using PDFbox to determine the coordinates of words in a document

I'm using PDFbox to extract the coordinates of words/strings in a PDF document, and have so far had success determining the position of individual characters. this is the code thus far, from the PDFbox doc:
package printtextlocations;
import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;
import java.io.IOException;
import java.util.List;
public class PrintTextLocations extends PDFTextStripper {
public PrintTextLocations() throws IOException {
super.setSortByPosition(true);
}
public static void main(String[] args) throws Exception {
PDDocument document = null;
try {
File input = new File("C:\\path\\to\\PDF.pdf");
document = PDDocument.load(input);
if (document.isEncrypted()) {
try {
document.decrypt("");
} catch (InvalidPasswordException e) {
System.err.println("Error: Document is encrypted with a password.");
System.exit(1);
}
}
PrintTextLocations printer = new PrintTextLocations();
List allPages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
PDPage page = (PDPage) allPages.get(i);
System.out.println("Processing page: " + i);
PDStream contents = page.getContents();
if (contents != null) {
printer.processStream(page, page.findResources(), page.getContents().getStream());
}
}
} finally {
if (document != null) {
document.close();
}
}
}
/**
* #param text The text to be processed
*/
#Override /* this is questionable, not sure if needed... */
protected void processTextPosition(TextPosition text) {
System.out.println("String[" + text.getXDirAdj() + ","
+ text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
+ text.getXScale() + " height=" + text.getHeightDir() + " space="
+ text.getWidthOfSpace() + " width="
+ text.getWidthDirAdj() + "]" + text.getCharacter());
}
}
This produces a series of lines containing the position of each character, including spaces, that looks like this:
String[202.5604,41.880127 fs=1.0 xscale=13.98 height=9.68814 space=3.8864403 width=9.324661]P
Where 'P' is the character. I have not been able to find a function in PDFbox to find words, and I am not familiar enough with Java to be able to accurately concatenate these characters back into words to search through even though the spaces are also included. Has anyone else been in a similar situation, and if so how did you approach it? I really only need the coordinate of the first character in the word so that parts simplified, but as to how I'm going to match a string against that kind of output is beyond me.

There is no function in PDFBox that allows you to extract words automatically. I'm currently working on extracting data to gather it into blocks and here is my process:
I extract all the characters of the document (called glyphs) and store them in a list.
I do an analysis of the coordinates of each glyph, looping over the list. If they overlap (if the top of the current glyph is contained between the top and bottom of the preceding/or the bottom of the current glyph is contained between the top and bottom of the preceding one), I add it to the same line.
At this point, I have extracted the different lines of the document (be careful, if your document is multi-column, the expression "lines" means all the glyphs that overlap vertically, ie the text of all the columns that have the same vertical coordinates).
Then, you can compare the left coordinate of the current glyph to the right coordinate of the preceding one to determine if they belong to the same word or not (the PDFTextStripper class provides a getSpacingTolerance() method that gives you, based on trials and errors, the value of a "normal" space. If the difference between the right and the left coordinates is lower than this value, both glyphs belong to the same word.
I applied this method to my work and it works good.

Based on the original idea here is a version of the text search for PDFBox 2. The code itself is rough, but simple. It should get you started fairly quickly.
import java.io.IOException;
import java.io.Writer;
import java.util.List;
import java.util.Set;
import lu.abac.pdfclient.data.PDFTextLocation;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
public class PrintTextLocator extends PDFTextStripper {
private final Set<PDFTextLocation> locations;
public PrintTextLocator(PDDocument document, Set<PDFTextLocation> locations) throws IOException {
super.setSortByPosition(true);
this.document = document;
this.locations = locations;
this.output = new Writer() {
#Override
public void write(char[] cbuf, int off, int len) throws IOException {
}
#Override
public void flush() throws IOException {
}
#Override
public void close() throws IOException {
}
};
}
public Set<PDFTextLocation> doSearch() throws IOException {
processPages(document.getDocumentCatalog().getPages());
return locations;
}
#Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
super.writeString(text);
String searchText = text.toLowerCase();
for (PDFTextLocation textLoc:locations) {
int start = searchText.indexOf(textLoc.getText().toLowerCase());
if (start!=-1) {
// found
TextPosition pos = textPositions.get(start);
textLoc.setFound(true);
textLoc.setPage(getCurrentPageNo());
textLoc.setX(pos.getXDirAdj());
textLoc.setY(pos.getYDirAdj());
}
}
}
}

take a look on this, I think it's what you need.
https://jackson-brain.com/using-pdfbox-to-locate-text-coordinates-within-a-pdf-in-java/
Here is the code:
import java.io.File;
import java.io.IOException;
import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;
public class PrintTextLocations extends PDFTextStripper {
public static StringBuilder tWord = new StringBuilder();
public static String seek;
public static String[] seekA;
public static List wordList = new ArrayList();
public static boolean is1stChar = true;
public static boolean lineMatch;
public static int pageNo = 1;
public static double lastYVal;
public PrintTextLocations()
throws IOException {
super.setSortByPosition(true);
}
public static void main(String[] args)
throws Exception {
PDDocument document = null;
seekA = args[1].split(",");
seek = args[1];
try {
File input = new File(args[0]);
document = PDDocument.load(input);
if (document.isEncrypted()) {
try {
document.decrypt("");
} catch (InvalidPasswordException e) {
System.err.println("Error: Document is encrypted with a password.");
System.exit(1);
}
}
PrintTextLocations printer = new PrintTextLocations();
List allPages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
PDPage page = (PDPage) allPages.get(i);
PDStream contents = page.getContents();
if (contents != null) {
printer.processStream(page, page.findResources(), page.getContents().getStream());
}
pageNo += 1;
}
} finally {
if (document != null) {
System.out.println(wordList);
document.close();
}
}
}
#Override
protected void processTextPosition(TextPosition text) {
String tChar = text.getCharacter();
System.out.println("String[" + text.getXDirAdj() + ","
+ text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
+ text.getXScale() + " height=" + text.getHeightDir() + " space="
+ text.getWidthOfSpace() + " width="
+ text.getWidthDirAdj() + "]" + text.getCharacter());
String REGEX = "[,.\\[\\](:;!?)/]";
char c = tChar.charAt(0);
lineMatch = matchCharLine(text);
if ((!tChar.matches(REGEX)) && (!Character.isWhitespace(c))) {
if ((!is1stChar) && (lineMatch == true)) {
appendChar(tChar);
} else if (is1stChar == true) {
setWordCoord(text, tChar);
}
} else {
endWord();
}
}
protected void appendChar(String tChar) {
tWord.append(tChar);
is1stChar = false;
}
protected void setWordCoord(TextPosition text, String tChar) {
tWord.append("(").append(pageNo).append(")[").append(roundVal(Float.valueOf(text.getXDirAdj()))).append(" : ").append(roundVal(Float.valueOf(text.getYDirAdj()))).append("] ").append(tChar);
is1stChar = false;
}
protected void endWord() {
String newWord = tWord.toString().replaceAll("[^\\x00-\\x7F]", "");
String sWord = newWord.substring(newWord.lastIndexOf(' ') + 1);
if (!"".equals(sWord)) {
if (Arrays.asList(seekA).contains(sWord)) {
wordList.add(newWord);
} else if ("SHOWMETHEMONEY".equals(seek)) {
wordList.add(newWord);
}
}
tWord.delete(0, tWord.length());
is1stChar = true;
}
protected boolean matchCharLine(TextPosition text) {
Double yVal = roundVal(Float.valueOf(text.getYDirAdj()));
if (yVal.doubleValue() == lastYVal) {
return true;
}
lastYVal = yVal.doubleValue();
endWord();
return false;
}
protected Double roundVal(Float yVal) {
DecimalFormat rounded = new DecimalFormat("0.0'0'");
Double yValDub = new Double(rounded.format(yVal));
return yValDub;
}
}
Dependencies:
PDFBox,
FontBox,
Apache Common Logging Interface.
You can run it by typing on command line:
javac PrintTextLocations.java
sudo java PrintTextLocations file.pdf WORD1,WORD2,....
the output is similar to:
[(1)[190.3 : 286.8] WORD1, (1)[283.3 : 286.8] WORD2, ...]

For those who still need assistance, this is what I used in my code and should be useful to start with. It uses PDFBox 2.0.16
public class PDFTextLocator extends PDFTextStripper {
private static String key_string;
private static float x;
private static float y;
public PDFTextLocator() throws IOException {
x = -1;
y = -1;
}
/**
* Takes in a PDF Document, phrase to find, and page to search and returns the x,y in float array
* #param document
* #param phrase
* #param page
* #return
* #throws IOException
*/
public static float[] getCoordiantes(PDDocument document, String phrase, int page) throws IOException {
key_string = phrase;
PDFTextStripper stripper = new PDFTextLocator();
stripper.setSortByPosition(true);
stripper.setStartPage(page);
stripper.setEndPage(page);
stripper.writeText(document, new OutputStreamWriter(new ByteArrayOutputStream()));
y = document.getPage(page).getMediaBox().getHeight()-y;
return new float[]{x,y};
}
/**
* Override the default functionality of PDFTextStripper.writeString()
*/
#Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
if(string.contains(key_string)) {
TextPosition text = textPositions.get(0);
if(x == -1) {
x = text.getXDirAdj();
y = text.getYDirAdj();
}
}
}
}
Below is the Maven dependency details,
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.16</version>
</dependency>

I got this working using the IKVM conversion PDFBox.NET 1.8.9. in C# and .NET.
I finally figured out the character (glyph) coordinates are private to the .NET assembly, but can be accessed using System.Reflection.
I posted a full example of getting the coordinates of WORDS and drawing them back on images of PDF's using SVG and HTML here: https://github.com/tsamop/PDF_Interpreter
For the example below you need PDFbox.NET: http://www.squarepdf.net/pdfbox-in-net, and include references to it in your project.
It took me quite a while to figure it out, so I really hope it saves someone else time!!
If you just need to know where to look for the characters & coordinates, a very abridged version would be:
using System;
using System.Reflection;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
// to test run pdfTest.RunTest(#"C:\temp\test_2.pdf");
class pdfTest
{
//simple example for getting character (gliph) coordinates out of a pdf doc.
// a more complete example is here: https://github.com/tsamop/PDF_Interpreter
public static void RunTest(string sFilename)
{
//probably a better way to get page count, but I cut this out of a bigger project.
PDDocument oDoc = PDDocument.load(sFilename);
object[] oPages = oDoc.getDocumentCatalog().getAllPages().toArray();
int iPageNo = 0; //1's based!!
foreach (object oPage in oPages)
{
iPageNo++;
//feed the stripper a page.
PDFTextStripper tStripper = new PDFTextStripper();
tStripper.setStartPage(iPageNo);
tStripper.setEndPage(iPageNo);
tStripper.getText(oDoc);
//This gets the "charactersByArticle" private object in PDF Box.
FieldInfo charactersByArticleInfo = typeof(PDFTextStripper).GetField("charactersByArticle", BIndingFlags.NonPublic | BindingFlags.Instance);
object charactersByArticle = charactersByArticleInfo.GetValue(tStripper);
object[] aoArticles = (object[])charactersByArticle.GetField("elementData");
foreach (object oArticle in aoArticles)
{
if (oArticle != null)
{
//THE CHARACTERS within the article
object[] aoCharacters = (object[])oArticle.GetField("elementData");
foreach (object oChar in aoCharacters)
{
/*properties I caulght using reflection:
* endX, endY, font, fontSize, fontSizePt, maxTextHeight, pageHeight, pageWidth, rot, str textPos, unicodCP, widthOfSpace, widths, wordSpacing, x, y
*
*/
if (oChar != null)
{
//this is a really quick test.
// for a more complete solution that pulls the characters into words and displays the word positions on the page, try this: https://github.com/tsamop/PDF_Interpreter
//the Y's appear to be the bottom of the char?
double mfMaxTextHeight = Convert.ToDouble(oChar.GetField("maxTextHeight")); //I think this is the height of the character/word
char mcThisChar = oChar.GetField("str").ToString().ToCharArray()[0];
double mfX = Convert.ToDouble(oChar.GetField("x"));
double mfY = Convert.ToDouble(oChar.GetField("y")) - mfMaxTextHeight;
//CALCULATE THE OTHER SIDE OF THE GLIPH
double mfWidth0 = ((Single[])oChar.GetField("widths"))[0];
double mfXend = mfX + mfWidth0; // Convert.ToDouble(oChar.GetField("endX"));
//CALCULATE THE BOTTOM OF THE GLIPH.
double mfYend = mfY + mfMaxTextHeight; // Convert.ToDouble(oChar.GetField("endY"));
double mfPageHeight = Convert.ToDouble(oChar.GetField("pageHeight"));
double mfPageWidth = Convert.ToDouble(oChar.GetField("pageWidth"));
System.Diagnostics.Debug.Print(#"add some stuff to test {0}, {1}, {2}", mcThisChar, mfX, mfY);
}
}
}
}
}
}
}
using System.Reflection;
/// <summary>
/// To deal with the Java interface hiding necessary properties! ~mwr
/// </summary>
public static class GetField_Extension
{
public static object GetField(this object randomPDFboxObject, string sFieldName)
{
FieldInfo itemInfo = randomPDFboxObject.GetType().GetField(sFieldName, BindingFlags.NonPublic | BindingFlags.Instance);
return itemInfo.GetValue(randomPDFboxObject);
}
}

How to read MP3 file tags

I want to have a program that reads metadata from an MP3 file. My program should also able to edit these metadata. What can I do?
I got to search out for some open source code. But they have code; but not simplified idea for my job they are going to do.
When I read further I found the metadata is stored in the MP3 file itself. But I am yet not able to make a full idea of my baby program.
Any help will be appreciated; with a program or very idea (like an algorithm). :)

The last 128 bytes of a mp3 file contains meta data about the mp3 file., You can write a program to read the last 128 bytes...
UPDATE:
ID3v1 Implementation
The Information is stored in the last 128 bytes of an MP3. The Tag
has got the following fields, and the offsets given here, are from
0-127.
Field Length Offsets
Tag 3 0-2
Songname 30 3-32
Artist 30 33-62
Album 30 63-92
Year 4 93-96
Comment 30 97-126
Genre 1 127
WARINING- This is just an ugly way of getting metadata and it might not actually be there because the world has moved to id3v2. id3v1 is actually obsolete. Id3v2 is more complex than this, so ideally you should use existing libraries to read id3v2 data from mp3s . Just putting this out there.

You can use apache tika Java API for meta-data parsing from MP3 such as title, album, genre, duraion, composer, artist and etc.. required jars are tika-parsers-1.4, tika-core-1.4.
Sample Program:
package com.parse.mp3;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.mp3.Mp3Parser;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class AudioParser {
/**
* #param args
*/
public static void main(String[] args) {
String fileLocation = "G:/asas/album/song.mp3";
try {
InputStream input = new FileInputStream(new File(fileLocation));
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
Parser parser = new Mp3Parser();
ParseContext parseCtx = new ParseContext();
parser.parse(input, handler, metadata, parseCtx);
input.close();
// List all metadata
String[] metadataNames = metadata.names();
for(String name : metadataNames){
System.out.println(name + ": " + metadata.get(name));
}
// Retrieve the necessary info from metadata
// Names - title, xmpDM:artist etc. - mentioned below may differ based
System.out.println("----------------------------------------------");
System.out.println("Title: " + metadata.get("title"));
System.out.println("Artists: " + metadata.get("xmpDM:artist"));
System.out.println("Composer : "+metadata.get("xmpDM:composer"));
System.out.println("Genre : "+metadata.get("xmpDM:genre"));
System.out.println("Album : "+metadata.get("xmpDM:album"));
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
}
}

For J2ME(which is what I was struggling with), here's the code that worked for me..
import java.io.InputStream;
import javax.microedition.io.Connector;
import javax.microedition.io.file.FileConnection;
import javax.microedition.lcdui.*;
import javax.microedition.media.Manager;
import javax.microedition.media.Player;
import javax.microedition.media.control.MetaDataControl;
import javax.microedition.midlet.MIDlet;
public class MetaDataControlMIDlet extends MIDlet implements CommandListener {
private Display display = null;
private List list = new List("Message", List.IMPLICIT);
private Command exitCommand = new Command("Exit", Command.EXIT, 1);
private Alert alert = new Alert("Message");
private Player player = null;
public MetaDataControlMIDlet() {
display = Display.getDisplay(this);
alert.addCommand(exitCommand);
alert.setCommandListener(this);
list.addCommand(exitCommand);
list.setCommandListener(this);
//display.setCurrent(list);
}
public void startApp() {
try {
FileConnection connection = (FileConnection) Connector.open("file:///e:/breathe.mp3");
InputStream is = null;
is = connection.openInputStream();
player = Manager.createPlayer(is, "audio/mp3");
player.prefetch();
player.realize();
} catch (Exception e) {
alert.setString(e.getMessage());
display.setCurrent(alert);
e.printStackTrace();
}
if (player != null) {
MetaDataControl mControl = (MetaDataControl) player.getControl("javax.microedition.media.control.MetaDataControl");
if (mControl == null) {
alert.setString("No Meta Information");
display.setCurrent(alert);
} else {
String[] keys = mControl.getKeys();
for (int i = 0; i < keys.length; i++) {
list.append(keys[i] + " -- " + mControl.getKeyValue(keys[i]), null);
}
display.setCurrent(list);
}
}
}
public void commandAction(Command cmd, Displayable disp) {
if (cmd == exitCommand) {
notifyDestroyed();
}
}
public void pauseApp() {
}
public void destroyApp(boolean unconditional) {
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract all text with string positions from a PDF - java

Related

How to write placeholder in pdf using itext

Parse zip or Jar project

bufferedwritter writing weird foreign characters?

Using PDFbox to determine the coordinates of words in a document

How to read MP3 file tags

Categories

Resources