iText PdfTextExtractor Missing Ligatures in Resulting Text - java

I am attempting to take a pdf file and grab the text from it.
I found iText and have been using it and have had decent success. The one problem I have remaining are ligatures.
At first I noticed that I was simply missing characters. After doing some searches I came across this:
http://support.itextpdf.com/node/25
Once I knew that it was ligatures I was missing, I began to search for ways to solve the problem and haven't been able to come up with a solution yet.
Here is my code:
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.FilteredTextRenderListener;
import java.io.File;
import java.io.OutputStreamWriter;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.BufferedWriter;
import java.io.IOException;
import java.util.Formatter;
import java.lang.StringBuilder;
public class ReadPdf {
private static String INPUTFILE = "F:/Users/jmack/Webwork/Redglue_PDF/live/ADP/APR/ADP_41.pdf";
public static void writeTextFile(String fileName, String s) {
// s = s.replaceAll("\u0063\u006B", "just a test");
s = s.replaceAll("\uFB00", "ff");
s = s.replaceAll("\uFB01", "fi");
s = s.replaceAll("\uFB02", "fl");
s = s.replaceAll("\uFB03", "ffi");
s = s.replaceAll("\uFB04", "ffl");
s = s.replaceAll("\uFB05", "ft");
s = s.replaceAll("\uFB06", "st");
s = s.replaceAll("\u0132", "IJ");
s = s.replaceAll("\u0133", "ij");
FileWriter output = null;
try {
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileName), "UTF-8"));
writer.write(s);
writer.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (output != null) {
try {
output.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) {
try {
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
String str = PdfTextExtractor.getTextFromPage(reader, 1, new SimpleTextExtractionStrategy());
writeTextFile("F:/Users/jmack/Webwork/Redglue_PDF/live/itext/read_test.txt", str);
}
catch (Exception e) {
System.out.println(e);
}
}
}
In the PDF referenced above one line reads:
part of its design difference is a roofline
But when I run the Java class above the text output contains:
part of its design diference is a roofine
Notice that difference became diference and roofline became roofine.
It is interesting to note that when I copy and paste from the PDF to stackoverflow's textfield, it also looks like the second sentence with the two ligatures "ff" and "fl" reduced to simply "f"s.
I am hoping that someone here can help me figure out how to catch the ligatures and perhaps replaces them with the characters they represent, as in the ligature "fl" being replaced with an actual "f" and a "l".
I ran some tests on the output from the PDFTextExtractor and attempted to replace the ligature unicode characters with the actual characters, but discovered that the unicode characters for those ligatures do not exist in the value it returns.
It seems that it must be something in iText itself that is not reading those ligatures correctly. I am hopeful that someone knows how to work around that.
Thank you for any help you can give!
TLDR: Converting PDF to text with iText, had missing characters, discovered they were ligatures, now I need to capture those ligatures, not sure how to go about doing that.

Related

Strange behavior with Regex in Java

I want to filter a text, leaving only letters (a-z and A-Z). It seemed to be easy, following something like this How to filter a Java String to get only alphabet characters?
String cleanedText = text.toString().toLowerCase().replaceAll("[^a-zA-Z]", "");
System.out.println(cleanedText);
The problem that the output of this is empty, unless I change the regex, adding another character, e.g. : --> [^:a-zA-Z]
I allready tried to check if it works with normal regex (not using the method ReplaceAll given by String object in Java), but I had exactly the same problem.
Any idea what could be the source of this strange behavior?
I had a txt file which I read using a BufferedReader. I add each line to one long string and apply the code I posted before to this. The whole code is as follows:
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.lang.StringBuffer;
import java.util.regex.*;
public class Loader {
public static void main(String[] args) {
BufferedReader file = null;
StringBuffer text = new StringBuffer();
String str;
try {
file = new BufferedReader(new FileReader("text.txt"));
} catch (FileNotFoundException ex) {
}
try
{
while ((str = file.readLine()) != null) {
text.append(str);
}
String cleanedText = text.toString().toLowerCase().replaceAll("[^:a-z]", "");
System.out.println(cleanedText);
} catch (IOException ex) {
}
}
}
The text file is a normal article where I want to delete everything (including whitespaces) that is not a letter. An extract is as follows "[16]The Free Software Foundation (FSF), started in 1985, intended the word "free" to mean freedom to distribute"
as I wrote in a comment, specify more precisely what's wrong...
What I tried
public class Regexp45348303 {
public static void main(String[] args) {
String[] tests = { "abc01", "01DEF34", "abc 01 def.", "a0101\n0202\n0303x" };
for (String text : tests) {
String cleanedText = text.toLowerCase().replaceAll("[^a-z]", ""); // A-Z removed too
System.out.println(text + " -> " + cleanedText);
}
}
}
and the output is:
abc01 -> abc
01DEF34 -> def
abc 01 def. -> abcdef
a0101
0202
0303x -> ax
which is correct based on my understanding...
In the end the problem was not with the regex nor with the program itself. It was just that eclipse does not show the output in console if it exceeds a certain length (but you can still work on it). To solve this simply check the fixed width console in Window -> Preferences -> Run/Debug -> Console
as described in http://code2care.org/2015/how-to-word-wrap-eclipse-console-logs-width/
Image of where to check fixed width console checkbox

File I/O Practice.

I'm trying to take names from a file called boysNames.txt . That contains 1000 boy names. for Example every line looks like this in the file looks like this:
Devan
Chris
Tom
The goal was just trying to read in the name, but I couldn't find a method in the java.util package that allowed me to grab just the name in the file boysName.txt .
For example I just wanted to grab Devan, then next Chris, and tom.
NOT "1. Devan" and "2. Chris."
The problem is hasNextLine grabs the whole line. I don't want the "1." part.
So I just want Devan, Chris, Tom to be read or stored in a variable of type String. Does anyone know how to do that? I've tried HasNext(), but that didn't work.
Here the code here so you can get a visual:
import java.io.PrintWriter;
import java.io.FileOutputStream;
import java.io.FileNotFoundException;
import java.util.Scanner;
import java.io.FileInputStream;
public class PracticeWithTxtFiles {
public static void main(String[] args){
PrintWriter outputStream = null;
Scanner inputStream = null;
try{
outputStream = new PrintWriter(new FileOutputStream("boys.txt")); //opens up the file boys.txt
inputStream = new Scanner(new FileInputStream("boyNames.txt"));//opens up the file boyNames.txt
}catch(FileNotFoundException e){
System.out.println("Problem opening/creating files");
System.exit(0);
}
String names = null;
int ssnumbers= 0;
while(inputStream.hasNextLine())//problem is right here need it to just
// grab String representation of String, not an int
{
names = inputStream.nextLine();
ssnumbers++;
outputStream.println(names + " " + ssnumbers);
}
inputStream.close();
outputStream.close();
}
}
If you are unaware, Check for this String API's Replace method
String - Library
Just do a replace, its as simple as that.
names = names.replace(".","").replaceAll("[0-9]+","").trim()

use Java to convert ANY file to hex and back again

I have some files i would like to convert to hex, alter, and then reverse again, but i have a problem trying to do jars, zips, and rars. It seems to only work on files containing normally readable text. I have looked all around but cant find anything that would allow jars or bats to do this correctly. Does anyone have an answer that does both? converts to hex then back again, not just to hex?
You can convert any file to hex. It's just a matter of obtaining a byte stream, and mapping every byte to two hexadecimal numbers.
Here's a utility class that lets you convert from a binary stream to a hex stream and back:
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.io.Writer;
public class Hex {
public static void binaryToHex(InputStream is, OutputStream os) {
Writer writer = new BufferedWriter(new OutputStreamWriter(os));
try {
int value;
while ((value = is.read()) != -1) {
writer.write(String.format("%02X", value));
}
writer.flush();
} catch (IOException e) {
System.err.println("An error occurred");
}
}
public static void hexToBinary(InputStream is, OutputStream os) {
Reader reader = new BufferedReader(new InputStreamReader(is));
try {
char buffer[] = new char[2];
while (reader.read(buffer) != -1) {
os.write((Character.digit(buffer[0], 16) << 4)
+ Character.digit(buffer[1], 16));
}
} catch (IOException e) {
System.err.println("An error occurred");
}
}
}
Partly inspired by this sample from Mykong and this answer.
Don't use a Reader to read String / char / char[], use an InputStream to read byte / byte[].

Java: Having trouble reading from a file

import java.io.FileReader;
import java.io.FileWriter;
import java.io.BufferedReader;
import java.io.PrintWriter;
import java.io.IOException;
public class TextFile {
private static void doReadWriteTextFile() {
try {
// input/output file names
String inputFileName = "README_InputFile.rtf";
// Create FileReader Object
FileReader inputFileReader = new FileReader(inputFileName);
// Create Buffered/PrintWriter Objects
BufferedReader inputStream = new BufferedReader(inputFileReader);
while ((inLine = inputStream.readLine()) != null) {
System.out.println(inLine);
}
inputStream.close();
} catch (IOException e) {
System.out.println("IOException:");
e.printStackTrace();
}
}
public static void main(String[] args) {
doReadTextFile();
}
}
I'm just learning Java, so take it easy on me. My program's objective is to read a text file and output it into another text file in reverse order. The problem is the professor taught us to to deal with strings and reverse it and such, but nothing about importing/exporting files. Instead, he gave us the following sample code which should import a file. The file returns 3 errors: The first two deal with inLine not being a symbol on lines 24 and 25. The last cannot find the symbol doReadTextFile on line 40.
I have no idea how to read this file and make the necessary changes to reverse and output into a new file. Any help is hugely appreciated.
I also had to change the file type from .txt to .rtf. I'm not sure if that affects how I need to go about this.
EDIT I defined inLine and fixed the doReadWritetextFile naming error, which fixed all my compiling errors. Any help on outputting into new file still appreciated!
I'm also aware he gave me bad sample code. It's supposed to be so we can learn troubleshooting, but with no working code to go off of and very extremely knowledge of the language, it's very difficult to see what's wrong. Thanks for the help!
The good practice will be to use a BufferedFileReader
BufferedFileReader bf = new BufferedFileReader(new FileReader(new File("your_file.your_extention")));
Then you can read lines in your file :
// Initilisation of the inLine variable...
String inLine = null;
while((inLine = bf.readLine()) != null){
System.out.println(inLine);
}
To output a file, you can use StringBuilder to hold the file contents:
private static void doReadWriteTextFile()
{
....
StringBuilder sb = new StringBuilder();
while ((inLine = inputStream.readLine()) != null)
{
sb.append(inline);
}
FileWriter writer = new FileWriter(new File("C:\\temp\\test.txt"));
BufferedWriter bw = new BufferedWriter(writer);
w.write(sb.toString());
bw.close();
}

writing a file from Set(Crawler,Jsoup,Java)

I recently wrote a small crawler that searches for links on any page, and writes to a file. I added code in the collection (HashSet) to avoid the same links ...
but the code does not work for some reason and the file I see a lot of duplicates.
Could you help fix bugs in it?
here is the code of the crawler:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Crawler {
public static void main(String[] args) {
Set<String> setUrlBase = new HashSet<String>();
Document doc;
String BaseUrlTxtT = "C://Search/urlw.txt";
try {
doc = Jsoup.connect("http://stackoverflow.com/").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
String UrlLinkHref = link.attr("href");
if (UrlLinkHref.indexOf("http://") == 0) {
setUrlBase.add(UrlLinkHref);
for (String strUrlHash : setUrlBase) {
writeToBase(BaseUrlTxtT, strUrlHash + "\n");
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
private static void writeToBase(String fileName, String text) {
File file = new File(fileName);
try {
if (!file.exists()) {
file.createNewFile();
}
FileWriter wr = new FileWriter(file.getAbsoluteFile(), true);
try {
wr.write(text + "\n");
} finally {
wr.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
It seems that in code
String UrlLinkHref = link.attr("href");
if (UrlLinkHref.indexOf("http://") == 0) {
setUrlBase.add(UrlLinkHref);
for (String strUrlHash : setUrlBase) {
writeToBase(BaseUrlTxtT, strUrlHash + "\n");
}
}
you are adding link to set and appending to file entire content of set. I said appending instead of rewriting because you are using
FileWriter wr = new FileWriter(file.getAbsoluteFile(), true);
where true means "don't erase my file, but add new content after existing one" so your file probably looks like
link1 //part written when added first link
link1 //part written when added second link
link2
link1 //part written when added third link
likn2
link3
(actually since HashSet is unordered elements in each parts could be written as link2, link3, link1 instead of link1, link2, link3 but you should get the idea)
To avoid it just use code like
String UrlLinkHref = link.attr("href");
if (UrlLinkHref.indexOf("http://") == 0) {
if (setUrlBase.add(UrlLinkHref)){//will return true if link wasn't in set yet
writeToBase(BaseUrlTxtT, UrlLinkHref + "\n");// so we also want to write
// it to file
}
}
which can be shorten even more using short-circuit-and (&&)
String UrlLinkHref = link.attr("href");
if (UrlLinkHref.startsWith("http://") && setUrlBase.add(UrlLinkHref)){
writeToBase(BaseUrlTxtT, UrlLinkHref + "\n");// so we also want to write
}
Few more improvements
To check if string starts with "http://" don't use "cryptic" code like UrlLinkHref.indexOf("http://") == 0 but simply UrlLinkHref.startsWith("http://")
Don't hardcode line separator line \n because different Operation Systems can use different sets of line separators (Windows for instance uses \r\n). Instead get it from System.lineSeparator(). Or even better, wrap your writers in PrintWriter which has println method (just like System.out.println()) which will add line separator based on OS for you.

Categories