Extract text from a large pdf with Tika

Extract text from a large pdf with Tika - java

I try to extract text from a large pdf, but i only get the first pages, i need all text to will be passed to a string variable.
This is the code
public class ParsePDF {
public static void main(String args[]) throws Exception {
try {
File file = new File("C:/vlarge.pdf");
String content = new Tika().parseToString(file);
System.out.println("The Content: " + content);
}
catch (Exception e) {
e.printStackTrace();
}
}
}

From the Javadocs:
To avoid unpredictable excess memory use, the returned string contains
only up to getMaxStringLength() first characters extracted from the
input document. Use the setMaxStringLength(int) method to adjust this
limitation.
Calling setMaxStringLength(-1) will disable this limit.

Try the apache api TIKA. Its working for large PDF's also.
Sample :
InputStream input = new FileInputStream("sample.pdf");
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
new PDFParser().parse(input, handler, metadata, new ParseContext());
String plainText = handler.toString();
System.out.println(plainText);

Related

convert html to ppt using java aspose library

i am passing html code to a variable in java. using aspose library, the html code should be executed and rendered into ppt (i am also giving the reference to css in the html).
appreciated if the ppt is editable.

Please use the following java equivalent code on your end.
public static void main(String[] args) throws Exception {
// The path to the documents directory.
String dataDir ="C:\\html\\";
// Create Empty presentation instance
Presentation pres = new Presentation();
// Access the default first slide of presentation
ISlide slide = pres.getSlides().get_Item(0);
// Adding the AutoShape to accommodate the HTML content
IAutoShape ashape = slide.getShapes().addAutoShape(ShapeType.Rectangle, 10, 10, (float) pres.getSlideSize().getSize().getWidth(), (float) pres.getSlideSize().getSize().getHeight());
ashape.getFillFormat().setFillType(FillType.NoFill);
// Adding text frame to the shape
ashape.addTextFrame("");
// Clearing all paragraphs in added text frame
ashape.getTextFrame().getParagraphs().clear();
// Loading the HTML file using InputStream
InputStream inputStream = new FileInputStream(dataDir + "file.html");
Reader reader = new InputStreamReader(inputStream);
int data = reader.read();
String content = ReadFile(dataDir + "file.html");
// Adding text from HTML stream reader in text frame
ashape.getTextFrame().getParagraphs().addFromHtml(content);
// Saving Presentation
pres.save(dataDir + "output.pptx", SaveFormat.Pptx);
}
public static String ReadFile(String FileName) throws Exception {
File file = new File(FileName);
StringBuilder contents = new StringBuilder();
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
String text = null;
// repeat until all lines is read
while ((text = reader.readLine()) != null) {
contents.append(text).append(System.getProperty("line.separator"));
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (reader != null) {
reader.close();
}
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
return contents.toString();
}

#Balchandar Reddy,
I have observed your comments and like to share that ImportingHTMLTextInParagraphs.class points to path of file. I have updated the code relate to this.
Secondly, you need to call import com.aspose.slides.IAutoShape on your end to resolve the issue.

I have observed your requirements and regret to share that Aspose.Slides which is an API for managing PowerPoint slides, does not support feature for converting HTML to PPT/PPTX. However, it supports importing HTML text inside slide text frames that you may use.
// Create Empty presentation instance// Create Empty presentation instance
using (Presentation pres = new Presentation())
{
// Acesss the default first slide of presentation
ISlide slide = pres.Slides[0];
// Adding the AutoShape to accomodate the HTML content
IAutoShape ashape = slide.Shapes.AddAutoShape(ShapeType.Rectangle, 10, 10, pres.SlideSize.Size.Width - 20, pres.SlideSize.Size.Height - 10);
ashape.FillFormat.FillType = FillType.NoFill;
// Adding text frame to the shape
ashape.AddTextFrame("");
// Clearing all paragraphs in added text frame
ashape.TextFrame.Paragraphs.Clear();
// Loading the HTML file using stream reader
TextReader tr = new StreamReader(dataDir + "file.html");
// Adding text from HTML stream reader in text frame
ashape.TextFrame.Paragraphs.AddFromHtml(tr.ReadToEnd());
// Saving Presentation
pres.Save("output_out.pptx", Aspose.Slides.Export.SaveFormat.Pptx);
}
I am working as Support developer/ Evangelist at Aspose.

How to write to a StyledDocument with a specific charset?

for a NetBeans plugin I want to change the content of a file (which is opened in the NetBeans editor) with a specific String and a specific charset. In order to achieve that, I open the file (a DataObject) with an EditorCookie and then I change the content by inserting a different string to the StyledDocument of my data object.
However, I have a feeling that the file is always saved as UTF-8. Even if I write a file mark in the file. Am I doing something wrong?
This is my code:
...
EditorCookie cookie = dataObject.getLookup().lookup(EditorCookie.class);
String utf16be = new String("\uFEFFHello World!".getBytes(StandardCharsets.UTF_16BE));
NbDocument.runAtomic(cookie.getDocument(), () -> {
try {
StyledDocument document = cookie.openDocument();
document.remove(0, document.getLength());
document.insertString(0, utf16be, null);
cookie.saveDocument();
} catch (BadLocationException | IOException ex) {
Exceptions.printStackTrace(ex);
}
});
I have also tried this approach which doesn't work too:
...
EditorCookie cookie = dataObject.getLookup().lookup(EditorCookie.class);
NbDocument.runAtomic(cookie.getDocument(), () -> {
try {
StyledDocument doc = cookie.openDocument();
String utf16be = "\uFEFFHello World!";
InputStream is = new ByteArrayInputStream(utf16be.getBytes(StandardCharsets.UTF_16BE));
FileObject fileObject = dataObject.getPrimaryFile();
String mimePath = fileObject.getMIMEType();
Lookup lookup = MimeLookup.getLookup(MimePath.parse(mimePath));
EditorKit kit = lookup.lookup(EditorKit.class);
try {
kit.read(is, doc, doc.getLength());
} catch (IOException | BadLocationException ex) {
Exceptions.printStackTrace(ex);
} finally {
is.close();
}
cookie.saveDocument();
} catch (Exception ex) {
Exceptions.printStackTrace(ex);
}
});

Your problem is probably here:
String utf16be = new String("\uFEFFHello World!".getBytes(StandardCharsets.UTF_16BE));
This won't do what you think it does. This will convert your string to a byte array using the UTF-16 little endian encoding and then create a String from these bytes using the JRE's default encoding.
So, here's the catch:
A String has no encoding.
The fact that in Java this is a sequence of chars does not matter. Substitute 'char' for 'carrier pigeons', the net effect will be the same.
If you want to write a String to a byte stream with a given encoding, you need to specify the encoding you need on the Writer object you create. Similarly, if you want to read a byte stream into a String using a given encoding, it is the Reader which you need to configure to use the encoding you want.
But your StyledDocument object's method name is .insertString(); You should .insertString() your String object as is; don't transform it the way you do, since this is misguided, as explained above.

Why is PDFParser generating special characters instead of spaces?

The following code is generating special characters instead of spaces for one PDF but not another:
String fullText;
BodyContentHandler handler = null;
try {
// size is limit is 100M
handler = new BodyContentHandler(100 * 1024 * 1024);
Metadata meta = new Metadata();
PDFParser parser = new PDFParser();
parser.setEnableAutoSpace(false);
parser.parse(new FileInputStream(this.pdf /*always a valid pdf file*/), handler, meta, new ParseContext());
}
catch (SAXException e) {
throw new IOException(e);
} catch (TikaException e) {
throw new IOException(e);
}
fullText = handler.toString();
Depending on the PDF a substring of fullText will look like:
will*continue*to*be*used*in*support*of*the
When It should look like this:
will continue to be used in support of the
In other places, '%' substitute '-' and '!' substitute spaces amongst bolded text.
This issue only when processing one PDF but not the other. According to pdfinfo, both PDF's are generated by Quartz PDFContext.
linux command pdftotext renders the same results.
Is this a problem with how the original PDF is generated? Why is this happening?

Displaying Arabic on Device J2ME

I am using some arabic text in my app. on simulator Arabic Text is diplaying fine.
BUT on device it is not displaying Properly.
On Simulator it is like مَرْحَبًا that.
But on device it is like مرحبا.
My need is this one مَرْحَبًا.

Create text resources for a MIDP application, and how to load them at run-time. This technique is unicode safe, and so is suitable for all languages. The run-time code is small, fast, and uses relatively little memory.
Creating the Text Source
اَللّٰهُمَّ اِنِّىْ اَسْئَلُكَ رِزْقًاوَّاسِعًاطَيِّبًامِنْ رِزْقِكَ
مَرْحَبًا
The process starts with creating a text file. When the file is loaded, each line becomes a separate String object, so you can create a file like:
This needs to be in UTF-8 format. On Windows, you can create UTF-8 files in Notepad. Make sure you use Save As..., and select UTF-8 encoding.
Make the name arb.utf8
This needs to be converted to a format that can be read easily by the MIDP application. MIDP does not provide convenient ways to read text files, like J2SE's BufferedReader. Unicode support can also be a problem when converting between bytes and characters. The easiest way to read text is to use DataInput.readUTF(). But to use this, we need to have written the text using DataOutput.writeUTF().
Below is a simple J2SE, command-line program that will read the .uft8 file you saved from notepad, and create a .res file to go in the JAR.
import java.io.*;
import java.util.*;
public class TextConverter {
public static void main(String[] args) {
if (args.length == 1) {
String language = args[0];
List<String> text = new Vector<String>();
try {
// read text from Notepad UTF-8 file
InputStream in = new FileInputStream(language + ".utf8");
try {
BufferedReader bufin = new BufferedReader(new InputStreamReader(in, "UTF-8"));
String s;
while ( (s = bufin.readLine()) != null ) {
// remove formatting character added by Notepad
s = s.replaceAll("\ufffe", "");
text.add(s);
}
} finally {
in.close();
}
// write it for easy reading in J2ME
OutputStream out = new FileOutputStream(language + ".res");
DataOutputStream dout = new DataOutputStream(out);
try {
// first item is the number of strings
dout.writeShort(text.size());
// then the string themselves
for (String s: text) {
dout.writeUTF(s);
}
} finally {
dout.close();
}
} catch (Exception e) {
System.err.println("TextConverter: " + e);
}
} else {
System.err.println("syntax: TextConverter <language-code>");
}
}
}
To convert arb.utf8 to arb.res, run the converter as:
java TextConverter arb
Using the Text at Runtime
Place the .res file in the JAR.
In the MIDP application, the text can be read with this method:
public String[] loadText(String resName) throws IOException {
String[] text;
InputStream in = getClass().getResourceAsStream(resName);
try {
DataInputStream din = new DataInputStream(in);
int size = din.readShort();
text = new String[size];
for (int i = 0; i < size; i++) {
text[i] = din.readUTF();
}
} finally {
in.close();
}
return text;
}
Load and use text like this:
String[] text = loadText("arb.res");
System.out.println("my arabic word from arb.res file ::"+text[0]+" second from arb.res file ::"+text[1]);
Hope this will help you. Thanks

How to extract .gz file Dynamically in Java?

In http://www.newegg.com/Siteindex_USA.xml lots of urls of .gz-files are given, like this:
<loc>
http://www.newegg.com//Sitemap/USA/newegg_sitemap_product01.xml.gz
</loc>
I want to extract these dynamically. I don't want to store them locally, I just want to extract them and store the contained data in a database.
Modify:
I am getting exception
private void processGzip(URL url, byte[] response) throws MalformedURLException,
IOException, UnknownFormatException {
if (DEBUG) System.out.println("Processing gzip");
InputStream is = new ByteArrayInputStream(response);
// Remove .gz ending
String xmlUrl = url.toString().replaceFirst("\\.gz$", "");
if (DEBUG) System.out.println("XML url = " + xmlUrl);
InputStream decompressed = new GZIPInputStream(is);
InputSource in = new InputSource(decompressed);
in.setSystemId(xmlUrl);
processXml(url, in);
decompressed.close();
}

Simply wrap the input stream in GZIPInputStream, and it'll decompress the data as you're reading it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract text from a large pdf with Tika - java

From the Javadocs: To avoid unpredictable excess memory use, the returned string contains only up to getMaxStringLength() first characters extracted from the input document. Use the setMaxStringLength(int) method to adjust this limitation. Calling setMaxStringLength(-1) will disable this limit.

Related

convert html to ppt using java aspose library

How to write to a StyledDocument with a specific charset?

Why is PDFParser generating special characters instead of spaces?

Displaying Arabic on Device J2ME

How to extract .gz file Dynamically in Java?

Categories

Resources