Speeding up HTML extraction

Speeding up HTML extraction - java

I am using Java to get a chunk of HTML from a web page. Right now I am using a URLConnection with getInputStream() which is loading the whole page and taking a little longer than I would like. Is there anyway for it to load just the chunk i need or to exclude images or anything else that could speed it up. Any help is appreciated. Thank you.
Here is some code:
URL page = new URL("http://www.stackoverflow.com");
URLConnection connection = page.openConnection();
String html = getResponseData(connection);
public static String getResponseData(URLConncetion connection) {
StringBuffer sb = new StringBuffer();
InputStream is = connection.getInputStream();
int count;
while((count=is.read()) != -1){
sb.append((char)count);
}

I think you could try to find the actual data in that while loop, and abort as soon as you have found it.
Side note, your code will only load the HTML. Not the real images. They are not part of the response you get when requesting the page.
UPDATE: You could also buffer your inputstream. It could make the input faster. You can do this as follows
InputStream is = new BufferedInputStream(connection.getInputStream());

Related

Error Merging Large PDF Files with PDFBox - Missing end of file marker '%%EOF'

I have a successfully implemented a pdf merge solution using PDFBox using InputStreams. However, when I try to merge a document that is of a very large size I receive the following error:
Caused by: java.io.IOException: Missing root object specification in trailer.
at org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2832) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:173) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1060) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:379) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:280) ~[pdfbox-2.0.11.jar:2.0.11]
Of more importance (I think) are these statements that occur just before the error:
FINE (pdfparser.COSParser) [] - Missing end of file marker '%%EOF'
FINE (pdfparser.COSParser) [] - Set missing offset 388 for object 2 0 R
It seems to me that it can't find the '%%EOF' marker in very large files. Now I know that it is indeed there because I can look at the source (unfortunately I can't provide the file itself).
Doing some searching online I found that there is a setEOFLookupRange() method on the COSParser class. I'm wondering if perhaps the lookup range is too small and that is why it can't find the '%%EOF' marker. The problem is...I'm not using the COSParser object at all in my code; I'm only using the PDFMergerUtility class. The PDFMergerUtility seems to be using the COSParser under the hood.
So my questions are
Is my hypothesis about the EOFLookupRange correct?
If so, how can I set that range only having the PDFMergerUtility in my code and not the COSParser object?
Many thanks for your time!
UPDATED with code below
private boolean getCoolDocuments(final String slateId, final String filePathAndName)
throws IOException {
boolean status = false;
InputStream pdfStream = null;
HttpURLConnection connection = null;
final PDFMergerUtility merger = new PDFMergerUtility();
final ByteArrayOutputStream mergedPdfOutputStream = new ByteArrayOutputStream();
try {
final List<SlateDocument> parsedSlateDocuments = this.getSpecificDocumentsFromSlate(slateId);
if (!parsedSlateDocuments.isEmpty()) {
// iterate through each document, adding each pdf stream to the merger utility
int numberOfDocuments = 0;
for (final SlateDocument slateDocument : parsedSlateDocuments) {
final String url = this.getBaseURL() + "/slate/" + slateId + "/documents/"
+ slateDocument.getDocumentId();
/* code for RequestResponseUtil.initializeRequest(...) below */
connection = RequestResponseUtil.initializeRequest(url, "GET", this.getAuthenticationHeader(),
true, MediaType.APPLICATION_PDF_VALUE);
if (RequestResponseUtil.isSuccessful(connection.getResponseCode())) {
pdfStream = connection.getInputStream();
}
else {
/* do various things */
}
merger.addSource(pdfStream);
numberOfDocuments++;
}
merger.setDestinationStream(mergedPdfOutputStream);
// merge the all the pdf streams together
merger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
status = true;
}
else {
LOG.severe("An error occurred while parsing the slated documents; no documents remain after parsing!");
}
}
finally {
RequestResponseUtil.close(pdfStream);
this.disconnect(connection);
}
return status;
}
public static HttpURLConnection initializeRequest(final String url, final String method,
final String httpAuthHeader, final boolean multiPartFormData, final String reponseType) {
HttpURLConnection conn = null;
try {
conn = (HttpURLConnection) new URL(url).openConnection();
conn.setRequestMethod(method);
conn.setRequestProperty("X-Slater-Authentication", httpAuthHeader);
conn.setRequestProperty("Accept", reponseType);
if (multiPartFormData) {
conn.setRequestProperty("Content-Type", "multipart/form-data; boundary=BOUNDARY");
conn.setDoOutput(true);
}
else {
conn.setRequestProperty("Content-Type", "application/xml");
}
}
catch (final MalformedURLException e) {
throw new CustomException(e);
}
catch (final IOException e) {
throw new CustomException(e);
}
return conn;
}

As I suspected, this was an issue with the InputStream. It wasn't exactly what I thought, but basically I was making the (very wrong) assumption that I could just do this:
pdfStream = connection.getInputStream();
/* ... */
merger.addSource(pdfStream);
Of course, that's not going to work because the entire InputStream may or may not be read. It needs to be read in explicitly until the last -1 byte is reached. I'm pretty sure that on the smaller files this was working fine and actually reading in the entire stream, but on the larger files it simply wasn't making it to the end...hence not finding the %%EOF marker.
The solution was to use an intermediary ByteArrayOutputStream and then convert that back to an InputStream via a ByteArrayInputStream.
So if you replace this line of code:
pdfStream = connection.getInputStream();
above with this code:
final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int c;
while ((c = connection.getInputStream().read()) != -1) {
byteArrayOutputStream.write(c);
}
pdfStream = new ByteArrayInputStream(byteArrayOutputStream.toByteArray());
you'll end up with a working example.
I may end up changing this to implementation to use Pipes or Circular Buffers instead, but at least this is working for now.
While this wasn't necessarily a Java 101 mistake, it was more like a Java 102 mistake and is still shameful. :/ Hopefully it will help someone else.
Thanks to #Tilman Hausherr and #Master_ex for all there help!

I took a look in the code and found out that the default EOFLookupRange in COSParser is 2048 bytes.
I think that your assumption is valid.
Looking the PDFParser which extends the COSParser and is the parser used internally by the PDFMergerUtility I see that it is possible to set another EOFLookupRange by using a system property. The system property name is org.apache.pdfbox.pdfparser.nonSequentialPDFParser.eofLookupRange and it should be a valid integer.
Here is a question demonstrating how to set system properties.
I haven't tested the above but I hope it will work :)
The links to the PDFBox code use the 2.0.11 version which is the one that you are using.

unable save image in jsp

I'm unable to save a Data URI in JSP. I am trying like this, is there any mistake in the following code?
<%# page import="java.awt.image.*,java.io.*,javax.imageio.*,sun.misc.*" %>
function save_photo()
{
Webcam.snap(function(data_uri)
{
document.getElementById('results').innerHTML =
'<h2>Here is your image:</h2>' + '<img src="'+data_uri+'"/>';
var dat = data_uri;
<%
String st = "document.writeln(dat)";
BufferedImage image = null;
byte[] imageByte;
BASE64Decoder decoder = new BASE64Decoder();
imageByte = decoder.decodeBuffer(st);
ByteArrayInputStream bis = new ByteArrayInputStream(imageByte);
image = ImageIO.read(bis);
bis.close();
if (image != null)
ImageIO.write(image, "jpg", new File("d://1.jpg"));
out.println("value=" + st); // here it going to displaying base64 chars
System.out.println("value=" + st); //but here it is going to displaying document.writeln(dat)
%>
}
}
Finally, the image is not saved.

I think you didn't get the difference between JSP and JavaScript. While JSP is executed on the Server at the time your browser requires the web page, JavaScript is executed at the Client side, so in your browser, when you do an interaction that causes the JavaScript to run.
You Server (eg Apache Tomcat) will firstly execute your JSP code:
String st = "document.writeln(dat)";
BufferedImage image = null;
byte[] imageByte;
BASE64Decoder decoder = new BASE64Decoder();
imageByte = decoder.decodeBuffer(st);
ByteArrayInputStream bis = new ByteArrayInputStream(imageByte);
image = ImageIO.read(bis);
bis.close();
if (image != null)
ImageIO.write(image, "jpg", new File("d://1.jpg"));
out.println("value=" + st);
System.out.println("value=" + st);
As you can see, nowhere is the value of st changed. Your broser will receive the following snippet from your server:
value=document.writeln(dat);
Since your browser is the one that executes JavaScript, he will execute it and show the Base64-encoded Image - but your server won't.
For the exact difference, read this article.
To make the code working, the easiest way is to redirect the page:
function(data_uri)
{
// redirect
document.location.href = 'saveImage.jsp?img='+data_uri;
}
Now, you can have a JSP-page called saveImage.jsp that saves the Image, and returns the webpage you had already, and write the dara_uri into the element results.
Another, but more difficult way is to use AJAX. Here is an introduction to it.

You are trying to use JavaScript variables in Java code. Java code is running on your server, while Javascript code runs in user's browser. By the time JavaScript code executes, your Java code has already been executed. Whatever you're trying to do, you have to do it in pure javascript, or send an AJAX call to your server when your Javascript code has done it's thing.

How can i get all page Content?

I want get all page content of website Examp : http://academic.research.microsoft.com/Author/1789765/hoang-kiem?query=hoang%20kiem
I used this code:
String getResults(URL source) throws IOException {
InputStream in = source.openStream();
StringBuffer sb = new StringBuffer();
byte[] buffer = new byte[256];
while(true) {
int bytesRead = in.read(buffer);
if(bytesRead == -1) break;
for (int i=0; i<bytesRead; i++)
sb.append((char)buffer[i]);
}
return sb.toString();
}
But the result missing some information such as information some hints about the author as shown below
can you give me some advice ! Thanks

The author details are loaded by ajax calls (click the "Net" tab in firebug and reload the page). If you want to get these details you will have to load the page in an environment that will execute javascript (ie: a browser).

I am pretty sure these contents are loaded into the page per JavaScript, and there's not really anything you can do about that when retrieving the page text from Java. You'll probably want to get a browser-plugin instead (Firefox has the largest repository of addons).

Parsing PDF files hosted in web servers

I have used iText to parse pdf files. It works well on local files but I want to parse pdf files which are hosted in web servers like this one:
"http://protege.stanford.edu/publications/ontology_development/ontology101.pdf"
but I don't know how??? Could you please answer me how to do this task using iText or other libraries... thx

You need to download the bytes of the PDF file. You can do this with:
URL url = new URL("http://.....");
URLConnection conn = url.getConnection();
if (conn.getResponseCode() != HttpURLConnection.HTTP_OK) { ..error.. }
if ( ! conn.getContentType().equals("application/pdf")) { ..error.. }
InputStream byteStream = conn.getInputStream();
try {
... // give bytes from byteStream to iText
} finally { byteStream.close(); }

Use the URLConnection class:
URL reqURL = new URL("http://www.mysite.edu/mydoc.pdf" );
URLConnection urlCon = reqURL.openConnection();
Then you can use the URLConnection method to retrieve the content. Easiest way:
InputStream is = urlCon.getInputStream();
byte[] b = new byte[1024]; //size of a buffer, can be any
int len;
while((len = is.read(b)) != -1){
//Store the content in preferred way
}
is.close();

Nothing to it. You can pass a URL directly into PdfReader, and let it handle the streaming for you:
URL url = new URL("http://protege.stanford.edu/publications/ontology_development/ontology101.pdf" );
PdfReader reader = new PDFReader( url );
The JavaDoc is your friend.

How to play .wav file from URL in web browser embed - Java

I want to play a .wav sound file in embed default media player in IE. Sound file is on some HTTP location. I am unable to sound it in that player.
Following is the code.
URL url = new URL("http://www.concidel.com/upload/myfile.wav");
URLConnection urlc = url.openConnection();
InputStream is = (InputStream)urlc.getInputStream();
fileBytes = new byte[is.available()];
while (is.read(fileBytes,0,fileBytes.length)!=-1){}
BufferedOutputStream out = new BufferedOutputStream(response.getOutputStream());
out.write(fileBytes);
Here is embed code of HTML.
<embed src="CallStatesTreeAction.do?ivrCallId=${requestScope.vo.callId}&agentId=${requestScope.vo.agentId}" type="application/x-mplayer2" autostart="0" playcount="1" style="width: 40%; height: 45" />
If I write in FileOutputStream then it plays well
If I replace my code of getting file from URL to my local hard disk. then it also works fine.
I don't know why I am unable to play file from HTTP. And why it plays well from local hard disk.
Please help.

Make sure you set the correct response type. IE is very picky in that regard.
[EDIT] Your copy loop is broken. Try this code:
URL url = new URL("http://www.concidel.com/upload/myfile.wav");
URLConnection urlc = url.openConnection();
InputStream is = (InputStream)urlc.getInputStream();
fileBytes = new byte[is.available()];
int len;
while ( (len = is.read(fileBytes,0,fileBytes.length)) !=-1){
response.getOutputStream.write(fileBytes, 0, len);
}
The problem with your code is: If the data isn't fetched in a single call to is.read(), it's not appended to fileBytes but instead the first bytes are overwritten.
Also, the output stream which you get from the response is already buffered.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Speeding up HTML extraction - java

Related

Error Merging Large PDF Files with PDFBox - Missing end of file marker '%%EOF'

unable save image in jsp

How can i get all page Content?

Parsing PDF files hosted in web servers

How to play .wav file from URL in web browser embed - Java

Categories

Resources