Parsing PDF that has been downloaded from internet

Parsing PDF that has been downloaded from internet - java

I have searched questions about this topic on stackoverflow. They really helped me but I stuck again.
My problem is that I need do write a method that downloads pdf from a site like (www.example.com/abc.pdf) and then I want to read the output. I don't want to save this file, just read in system out. I don't need to put bytes to fileoutputstream. I tried to cast bytes to char to get characters ( it can be dumbest solution ). But I got unknown characters. Any idea or am I understood it in a wrong way?
Here is the code and its output:
String textlink="http://www.selab.isti.cnr.it/ws-mate/example.pdf";// it comes from main class
public String HtmlTest(String textLink) throws IOException{
StringBuilder sd=new StringBuilder();
URL link=new URL(textLink);
URLConnection urlConn = link.openConnection();
BufferedInputStream in = null;
try
{
in = new BufferedInputStream(urlConn.getInputStream());
byte data[] = new byte[1024];
in.read(data, 0, 1024);
for (int j = 0; j < data.length; j++) {
if(j%100==0){
sd.append((char)data[j]+"\n"); // i used this for making readable text
}
else{
sd.append((char)data[j]);
}
}
}
finally
{
if (in != null)
in.close();
}
return sd.toString();
}
Output
run:
%
PDF-1.3
%ￇ￬ﾏﾢ
7 0 obj
<</Length 8 0 R/Filter /FlateDecode>>
stream
xﾜﾭY[ﾓￛﾶ￮ﾳ&?BoNf,,q%￠ﾼ4￞x&ﾞ6ﾩﾛlￓ
ﾗﾼ￐ﾽￋZeﾑ￲f￻￫￻ﾁ

You're not going to get very far trying to read a .pdf file as though it were basically a text file. For starters, the "text" is in a compressed binary format; there are other issues you'll probably also have to deal with.
STRONG SUGGESTION:
Use a Java .pdf library like Apache PDFBox
IMHO>.

Related

Read faster a file & convert it into HEX

I need to read a file that is in ascii and convert it into hex before applying some functions (search for a specific caracter)
To do this, I read a file, convert it in hex and write into a new file. Then I open my new hex file and I apply my functions.
My issue is that it makes way too much time to read and convert it (approx 8sec for a 9Mb file)
My reading method is :
public static void convertToHex2(PrintStream out, File file) throws IOException {
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
int value = 0;
StringBuilder sbHex = new StringBuilder();
StringBuilder sbResult = new StringBuilder();
while ((value = bis.read()) != -1) {
sbHex.append(String.format("%02X ", value));
}
sbResult.append(sbHex);
out.print(sbResult);
bis.close();
}
Do you have any suggestions to make it faster ?

Did you measure what your actual bottleneck is? Because you seem to read very little amount of data in your loop and process that each time. You might as well read larger chunks of data and process those, e.g. using DataInputStream or whatever. That way you would benefit more from optimized reads of your OS, file system, their caches etc.
Additionally, you fill sbHex and append that to sbResult, to print that somewhere. Looks like an unnecessary copy to me, because sbResult will always be empty in your case and with sbHex you already have a StringBuilder for your PrintStream.

Try this:
static String[] xx = new String[256];
static {
for( int i = 0; i < 256; ++i ){
xx[i] = String.format("%02X ", i);
}
}
and use it:
sbHex.append(xx[value]);
Formatting is a heavy operation: it does not only the coversion - it also has to look at the format string.

Java SequenceInputStream

I try to send multiple Files from my Server (NanoHttpd) to my Client (Apache DefaultHttpClient).
My approach is to send multiple files via one Response of NanoHttpd.
For this purpose i wanted to use SequenceInputStream.
I am trying to concatenate multiple Files, send them via the Response (InputStream) and write every File again in a seperate File with my Client.
On the Serverside i call this:
List<InputStream> data = new ArrayList<InputStream>(o_file_path.size());
for (String file_name : files)
{
File file = new File(file_name);
data.add(new FileInputStream(file));
}
InputStream is = new SequenceInputStream(Collections.enumeration(data));
return new NanoHTTPD.Response(HTTP_OK, "application/octet-stream", is);
Now my Question is how to receive and split the Files correctly.
I have tried it this way on my client, but it does not work:
int read = 0;
int remaining = 0;
byte[] bytes = new byte[buffer];
// Read till the end of the Stream
while ( (read != -1) && (counter < files.size()))
{
// Create a .o file for the current file
read = 0;
remaining = is.available();
// Should open each Stream
while (remaining > 0)
{
read = is.read(bytes);
remaining = remaining - read;
os.write(bytes, 0, read);
}
os.flush();
os.close();
}
This way I want to go over all Stream (untill read == 1, or i know there is no file anymore), and read any stream into a file.
I clearly seem to understand something groundbreaking wrong, since is.available() always is 0.
Could anyone please tell me how to read properly from this SequencedInputStream, or how to solve my Problem.
Thanks in advance.

It won't work this way. SequenceInputStream will merge all input streams in one solid byte stream. There will be no separators or EOFs. I suggest to abandon the idea and look for a different approach.

Load text file to memory in Java

I have wiki.txt file and its size is 50 MB.
I need to do several things on the file and so I thought that the best way in terms of performance is to load the file to memory, is that correct?
This is the code that I written:
File file = new File("wiki.txt");
FileInputStream fileInputStream = new FileInputStream(file);
FileChannel fileChannel = fileInputStream.getChannel();
MappedByteBuffer mapByteBuffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, file.length());
System.out.println((char)mapByteBuffer.get());
I get error on this code: mapByteBuffer.get().
I tried the get() function a few options but all of them I get error and didn't even get an error on e.getMessage() I just got null.
Another important thing to note, my text file contains English words and actions I need to do is search, if expressed is exist in this text file.
Thank you.

I would suggest using a MemoryMappedFile, to read the file directly from the disk instead of loading it in memory.
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_WRITE, 0, 1024*50);
And then you can read the buffer as usual.

My answers for point (1):
It depends on what you want to do with the file. If your processing doesn't involve rewind operation (looking what was read behind/before), it's best to just read as a stream and process it in one go (instead of loading all into memory).
Even if you need random access across the file, you may also be interested in doing block file operation, because your solution may not scale well when the file size change to bigger size.
RandomAccessFile if you are on Java 1.4 or above.
For random access, the operating system usually handles the file buffer caching quite well you don't have to handle yourself.

It is important to read the whole error, not just the message. Often the real information is in the exception's name not the text associated with it.
You will get an error if the file is empty as there is no first byte.
Note: the approach you are using assumes ASCII 7-bit characters. If you want to assume ISO-8859-1 characters you can use (char) (byteBuffer.get() & 0xFF)
However, if you have plan text you may find that using strings is simpler to use and not much slower. e.g. you can read a 50 MB file as text in less than a second. I would only use a memory mapped file if this is far too long.

I would suggest to use BufferedReader. It is much faster and requires relatively less resources.
First read number of lines:
InputStream is = new BufferedInputStream(new FileInputStream(filename));
byte[] chars = new byte[1024];
int numberOfChars = 0;
while ((numberOfChars = is.read(chars)) != -1)
{
for (int i = 0; i < numberOfChars; ++i)
{
if (chars[i] == '\n' && numberOfChars - i != 1)
{
++count;
}
}
}
count++
return count; // number of lines
Then read the lines:
BufferedReader in = new BufferedReader(new FileReader(fileName));
for (int i = 0; i < endLine; i++)
{
String oneLine = in.readLine();
}
In this strings you can even do search for what you need.

java: error checking php output

Hi i have a problem i'm not able to solve.
In my Android\java application i call a script download.php. Basically it gives a file in output that i download and save on my device. I had to add a control on all my php scripts that basically consist in sending a token to the script and check if it's valid or not. If it's a valid token i will get the output (in this case a file in the other scripts a json file) if it's not i get back a string "false".
To check this condition in my other java files i used IOUtils method to turn the input stream to a String, check it, and than
InputStream newInputStream = new ByteArrayInputStream(mystring.getBytes("UTF-8"));
to get a valid input stream again and read it......it works with my JSon files, but not in this case......i get this error:
11-04 16:50:31.074: ERROR/AndroidRuntime(32363):
java.lang.OutOfMemoryError
when i try IOUtils.toString(inputStream, "UTF-8");
I think it's because in this case i'm trying to download really long file.
fileOutput = new BufferedOutputStream(new FileOutputStream(file,false));
inputStream = new BufferedInputStream(conn.getInputStream());
String result = IOUtils.toString(inputStream, "UTF-8");
if(result.equals("false"))
{
return false;
}
else
{
Reader r = new InputStreamReader(MyMethods.stringToInputStream(result));
int totalSize = conn.getContentLength();
int downloadedSize = 0;
byte[] buffer = new byte[1024];
int bufferLength = 0;
while ( (bufferLength = inputStream.read(buffer)) > 0 )
{
fileOutput.write(buffer, 0, bufferLength);
downloadedSize += bufferLength;
}
fileOutput.flush();
fileOutput.close();

Don't read the stream as a string to start with. Keep it as binary data, and start off by just reading the first 5 bytes. You can then check whether those 5 bytes are the 5 bytes used to encode "false" in UTF-8, and act accordingly if so. Otherwise, write those 5 bytes to the output file and then do the same looping/reading/writing as before. Note that to read those 5 bytes you may need to loop (however unlikely that seems). Perhaps your IOUtils class has something to say "read at least 5 bytes"? Will the real content ever be smaller than 5 bytes?
To be honest, it would be better if you could use a header in the response to indicate the different result, instead of just a body with "false" - are you in control of the PHP script?

Reading and writing binary file in Java (seeing half of the file being corrupted)

I have some working code in python that I need to convert to Java.
I have read quite a few threads on this forum but could not find an answer. I am reading in a JPG image and converting it into a byte array. I then write this buffer it to a different file. When I compare the written files from both Java and python code, the bytes at the end do not match. Please let me know if you have a suggestion. I need to use the byte array to pack the image into a message that needs to be sent over to a remote server.
Java code (Running on Android)
Reading the file:
File queryImg = new File(ImagePath);
int imageLen = (int)queryImg.length();
byte [] imgData = new byte[imageLen];
FileInputStream fis = new FileInputStream(queryImg);
fis.read(imgData);
Writing the file:
FileOutputStream f = new FileOutputStream(new File("/sdcard/output.raw"));
f.write(imgData);
f.flush();
f.close();
Thanks!

InputStream.read is not guaranteed to read any particular number of bytes and may read less than you asked it to. It returns the actual number read so you can have a loop that keeps track of progress:
public void pump(InputStream in, OutputStream out, int size) {
byte[] buffer = new byte[4096]; // Or whatever constant you feel like using
int done = 0;
while (done < size) {
int read = in.read(buffer);
if (read == -1) {
throw new IOException("Something went horribly wrong");
}
out.write(buffer, 0, read);
done += read;
}
// Maybe put cleanup code in here if you like, e.g. in.close, out.flush, out.close
}
I believe Apache Commons IO has classes for doing this kind of stuff so you don't need to write it yourself.

Your file length might be more than int can hold and than you end up having wrong array length, hence not reading entire file into the buffer.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing PDF that has been downloaded from internet - java

You're not going to get very far trying to read a .pdf file as though it were basically a text file. For starters, the "text" is in a compressed binary format; there are other issues you'll probably also have to deal with. STRONG SUGGESTION: Use a Java .pdf library like Apache PDFBox IMHO>.

Related

Read faster a file & convert it into HEX

Java SequenceInputStream

Load text file to memory in Java

java: error checking php output

Reading and writing binary file in Java (seeing half of the file being corrupted)

Categories

Resources