Bytes Read from PDF are skipped

Bytes Read from PDF are skipped - java

import java.io.*;
class BS{
public void pStr(){
try{
String command="cat /usr/share/doc/bash/rbash.pdf";
Process ps=Runtime.getRuntime().exec(command);
InputStream in = ps.getInputStream();
int c;
while((c=in.read())!=-1){
System.out.print((char)c);
}
}catch(Exception e){
e.printStackTrace();
}
}
public static void main(String args[]){
new BS().pStr();
}
}
jabira-whosechild-lm.local 23:54:00 % java BS|wc
384 2003 43885
jabira-whosechild-lm.local 23:54:05 % wc /usr/share/doc/bash/rbash.pdf
384 2153 43885 /usr/share/doc/bash/rbash.pdf
Why do i see the difference in the number of characters that are read
and printed to the console

The method InputStream.read() reads only one byte.
Your source code line System.out.print((char)c); is wrong. The method PrintStream.print(char c) is called and this method writes two bytes for some non-ASCII character values.
You need to call a method that always writes one byte value. The correct method is System.out.write(c);.

Isn't it that the number of characters is the same, but the number of words are different?
I'm guessing that somewhere in your c=in.read() and print((char)c) code there is some encoding issues going on.
Can you save the output to another PDF file and do a binary compare of them? If they are identical then that's really weird! If they're not, then you might find a clue in the differences.

Related

how to read a large log file which other process current write

Create log file by day， one file about 400MB，JVM memory about 2GB。
Have one process write a large log file with 'a' mode。
I want to read this file and be able to achieve some functions：
Append read newly written data
I will store the offset to restore the read after jvm restart
This is my simple implementation, but I don't know if the time and memory consumption are good. I want to know if there is a better way to solve this problem
public static void main(String[] args) throws IOException {
String filePath = "D://test.log";
long restoreOffset = resotoreOffset();
RandomAccessFile randomAccessFile = new RandomAccessFile(filePath, "r");
randomAccessFile.seek(restoreOffset);
while (true) {
String line = randomAccessFile.readLine();
if(line != null) {
// doSomething(line);
restoreOffset = randomAccessFile.getFilePointer();
//storeOffset(restoreOffset);
}
}
}

It's not, unfortunately.
There are 2 major problems with this code. First I'll tackle the simple one, but the most important one is the second point.
Encoding issues
String line = randomAccessFile.readLine();
This line converts bytes to characters implicitly, and that's generally a bad idea, because bytes aren't characters, and converting from one to the other requires a charset encoding.
This method (readLine() from RAF) is a bizarre case - probably because RandomAccessFile is incredibly old API. Using this method will apply some bizarro ISO-8859-1 esque charset encoding: It converts bytes to chars by taking each byte as a complete char, assuming the byte represents the unicode character as listed, which isn't actually a sane encoding, just a lazy programmer.
The upshot for you is: Unless you can guarantee that this log file shall always only ever contain ASCII characters, this code is broken, and readLine cannot be used at all. Instead you'll have to do considerably more work: read bytes until you hit a newline, then turn the bytes so gathered into a string with new String(byteArray, StandardCharsets.UTF_8), or use ByteBuffer and apply similar tactics. But keep reading, because solving the second problem kinda solves this one automatically.
Buffering
Modern computer systems tend to like 'packeting'. You can't really operate on a single byte. Take SSDs (though this applies to spinning platter disks as well): The actual SSD hardware can't read single bytes. It can only read entire blocks worth of data.
When you therefore ask the OS explicitly for a single byte, that ends up setting off a chain of events that causes the SSD to read the entire block, then pass that entire block to the operating system, which will then disregard everything except the one byte you wanted, and returns just that.
If your code then asks for the next byte, we do that routine again.
So, if you read 1024 bytes consecutively from an SSD that has 1024-byte blocks, doing so by calling read() 1024 times causes the SSD to perform 1024 reads, whereas calling read(byteArr) once, passing it a 1024-byte array, causes the SSD to perform a single read.
Yup, that means the byte array solution is literally 1000 times faster.
The same applies to networking, too. Sending 1 byte a thousand times is usually nearly 1000 times slower than sending 1000 bytes once; TCP/IP packets can carry about 1800 bytes worth of data, so sending any less than that gains you almost nothing.
RAF's readLine() works like the first (bad) scenario: It reads bytes one at a time until it hits a newline character. Thus, to read a 100 character string, it's 100x slower than just knowing you need to read 100 characters and reading them in one go.
The solution
You may want to abandon RandomAccessFile entirely, it's quite old API.
A major issue with buffering is that it's a lot harder unless you know how many bytes to read beforehand. Here, you don't know that: You want to keep reading until you hit a newline character, but you have no idea how long it'll be until we get there. Furthermore, buffering APIs tend to just return what's convenient, and may therefore read fewer bytes than we ask for (it'll always read at least 1, though, unless we hit end of file). So, we need to write code that will repeatedly read entire chunk's worth of data, analyse the chunk for a newline, and if it's not there, keep reading.
Furthermore, opening channels and such is expensive. So, if you want to dig through all log lines, writing code that opens a new channel every time is suboptimal.
How about this, using the newer file API from java.nio.file:
public class LogLineReader implements AutoCloseable {
private final byte[] buffer = new byte[1024];
private final ByteBuffer bb = wrap(buffer);
private final SeekableByteChannel channel;
private final Charset charset = StandardCharsets.UTF_8;
public LogLineReader(Path p) {
channel = Files.newByteChannel(p, StandardOpenOption.READ);
channel.position(111L); // you seek to pos 111 in your code...
}
#Override public void close() throws IOException {
channel.close();
}
// This code buffers: First, our internal buffer is scanned
// for a new line. If there is no full line in the buffer,
// we read bytes from the file and check again until we find one.
public String readLine() {
int len = 0;
if (!channel.isOpen()) return null;
int scanStart = 0;
while (true) {
// Scan through the bytes we have buffered for a newline.
for (int i = scanStart; i < buffer.position(); i++) {
if (buffer[i] == '\n') {
// Found it. Take all bytes up to the new line, turn into
// a string.
String res = new String(buffer, 0, i, charset);
// Copy all bytes from _after_ the newline to the front.
System.arraycopy(buffer, i + 1, buffer, 0, buffer.position() - i - 1);
// Adjust the position (which represents how many bytes are buffered).
buffer.position(buffer.position() - i - 1);
return res;
}
}
scanStart = buffer.position();
// If we get here, the buffer is empty or contains no newline.
if (scanStart == buffer.limit()) {
throw new IOException("Log line too long");
}
int read = channel.read(buffer); // let's fetch more bytes!
if (read == -1) {
// we've reached the end of the file.
if (buffer.position() == 0) return null;
return new String(buffer, 0, buffer.position(), charset);
}
}
}
}
For the sake of efficiency, this code cannot deal with log lines longer than 1024 in length; feel free to up that number. If you want to be capable of reading infinite size loglines, at some point a gigantic buffer is a problem. If you must, you could write code that resizes the buffer if you hit 1024, or you can update this code that it'll keep reading, but only returns a truncated string with the first 1024 characters. I'll leave that as an exercise for you.
NB: I also didn't test this, but at the very least it should give you the general gist of using SeekableByteChannel, and the concept of buffers.
To use:
Path p = Paths.get("D://logfile.txt");
try (LogLineReader reader = new LogLineReader(p)) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// do something with line
}
}
You must ensure the LLR object is closed, hence, use try-with-resources.

FileWriter doesn't write integers into file

I am trying to read 2 input files containing integers(even duplicates are considered) and trying to find common integers and write them to the output file.
input1.txt
01
21
14
27
31
20
31
input2.txt
14
21
27
08
09
14
Following is the code I tried:
public static void main(String[] args) throws NumberFormatException {
try {
BufferedReader inputFile1 = new BufferedReader(new FileReader(new File("src/input1.txt")));
BufferedReader inputFile2 = new BufferedReader(new FileReader(new File("src/input2.txt")));
FileWriter fileCommon = new FileWriter("src/common.txt");
String lineInput1;
String lineInput2;
int inputArray1[] = new int[10];
int inputArray2[] = new int[10];
int index = 0;
while ((lineInput1 = inputFile1.readLine()) != null) {
inputArray1[index] = Integer.parseInt(lineInput1);
index++;
}
index = 0;
while((lineInput2 = inputFile2.readLine()) != null) {
inputArray2[index] = Integer.parseInt(lineInput2);
index++;
}
for (int a = 0; a < inputArray1.length; a++) {
for (int b = 0;b < inputArray2.length; b++) {
if(inputArray1[a] == inputArray2[b]) {
fileCommon.write(inputArray1[a]);
}
}
}
inputFile1.close();
inputFile2.close();
fileCommon.close();
} catch (IOException e) {
e.printStackTrace();
}
}
I don't understand where I am making mistake. I am not getting any errors and the output file that is generated is empty.
output expected are common integers in both files
14
21
27

Remember, that FileWriter's write(int c) accepts an integer representing a character code from either a specified charset or the platform's default charset, which is mostly extensions of ASCII (for example, in Windows, default charset is Windows-1252 which is an extension of ASCII).
which means, that you actually don't have any (semantical or syntactical) problem per se, and you're writing into file successfully, but! you're writing some special characters which you can't see afterwards.
If you'll invoke write(..) with some integer representing Latin character (or symbol) in the ASCII table, you'll see that it'll write actual English letter (or symbol) into your file.
For instance:
fileCommon.write(37); //will write `%` into your file.
fileCommon.write(66); //will write `B` into your file.
In your code, you're only writing 21, 14 and 27 into your file, and as you can see from the ASCII table:
Decimal 21 represents Negative Acknowledgment
Decimal 14 represents Shift-out
Decimal 27 represents Escape

FileWriter.write(int) will write a single character, in your case 14, 21, and 27 are all control characters that would not be visible in a text file.
common.write("" + arr1[a]);
Should write the string representation. You'll find some other problems though, such as missing line endings and repeated values, but this should get you started.

Here's the thing.
The write(int c) method of FileWriter is not actually write an int value, but write an ASCII code of a single character.For example, write(53) will write a "5" to a file.
In your code, you are acctually writting some symbols.You can use write(String str) method of FileWriter or just use BufferedWriter class to achieve you goal.
The result of the write value is acctually "21141427" by your code, so you have to remove the repeat value when write it and line feed after write each value.
Sorry for the poor English.

You can read Strings from the original input files, instead of ints, and use the String.equals(Object):boolean function to compare Strings.
Then you won't need to parse from String to int, and convert an int to string back when writing to the file.
Also note that writing an int will write the unicode char value to the file, not the number as a string.

The problem is the common.write line. It should be as follows.
common.write(String.valueOf(arr1[a])+"\n");
Additionally, This would perform much better if you put all of the data from the first file into a Map vs an array then when reading the second file just check the map for the key and if it exists write to common.
If you are dead set on using an array you can sort the first array and use a binary search. This would also perform much better than looping through everything over and over.

If read() method of FileInputStream return 1 byte and char in java occupy 2 bytes, how below program works

If read() method of FileInputStream return one byte and char in java occupy 2 bytes, how does casting of integer return by read() to char return character. Below is the program
import java.io.File;
import java.io.FileInputStream;
public class ReadFile {
public static void main(String[] args) throws Exception {
File file = new File("J:\\Java\\Programs\\xanadu.txt");
FileInputStream stream = new FileInputStream(file);
int i, iteration = 0;
while ((i = stream.read()) != -1) {
System.out.print((char) i);
iteration++;
}
System.out.println("\nNo of Iteration :" + iteration);
}
}
Content of file is : StackOverFlow
Output is :
StackOverflow
No of Iteration :13
So file contains 13 character which means 26 bytes. How the number of iteration is 13.
If there is a link where this behaviour is explain, please share it.

The file contains 13 ascii characters (and 1 ascii character is 1 byte). When stored in memory, in Java, each character might consumes 2 bytes. However, they are all on the basic plane... and they could be stored as UTF-8. While a single Java character might take 2 bytes of memory it might also take more when to create a single character when it's part of a String containing values from the Supplementary_Multilingual_Plane.

Writing to a random access file

I have a project in which I have to write to a random access file.
I am reading a country with a some information including: id, name, year of independence, etc. That information is what I have to write to the file.
My questions are:
How can I measure the size of the record I'm writing to on the random access file?
I know how to do it via: filewrite.int(variable). But the project requires me to somehow have a constant size of where I write each character.
I know 2 bytes is one character, but how can I say "I want to write this line (where each line is a country with its information) so write the line from byte 1 to byte 15 and have a constant size"?
Thanks!

You shouldn't have to measure the size of a record. In this instance, I'd say you should define a fixed length for each field of a record. This way, you will always know how long a record will be. You should be able to use RandomAccessFile. Read through the documentation on that class. If you want to make your life easier, write a service class for your particular file to wrap the RandomAccessFile methods.
An example of a method signature for the service class would be:
void writeCountry(int recordNumber, String country) throws IllegalArgumentException;
Then when you implement this, you will do some math to figure out where to seek to, and what to write to file.

You can define a fixed length for each field. Then format your values before writing them out. Since each line is a fixed number of characters you can determine the number of bytes based on the number of lines.
You could use the Formatter class to help.
http://docs.oracle.com/javase/6/docs/api/java/util/Formatter.html
Specifically take a look at the Width and Precision sections of the doc.
Take a look at the out put of something like this:
public static void main(String [] args) {
Formatter formatter = new Formatter(System.out);
formatter.format("%5.5s %3.3s %3.3s %3.3s", "012345678901", "b", "c", "d");
}
Keep in mind what should happen if the value is larger than the field.

Read more at Ankit.co
import java.io.IOException;
import java.io.RandomAccessFile;
public class RandomAccessDemo
{
public static void main(String[] args)
{
try
{
RandomAccessFile raf = new RandomAccessFile("test.txt", "rw");
raf.writeInt(10);
raf.writeInt(20);
raf.writeInt(30);
raf.writeInt(400);
raf.seek((3 - 1) * 4); // For 3rd integer, We are doing 2 * size of Int(4).
raf.writeInt(99);
raf.seek(0); // Going back to start point
int i = 0;
while(raf.length()>raf.getFilePointer())
{
i = raf.readInt();
System.out.println(i);
}
raf.close();
}
catch (Exception e)
{
System.out.println(e);
}
}
}
View my post on Random Access Files in Java at ankit.co

Java Output Integers to File

Im trying to output an integer array to a file and have hit a snag. The code executes properly, no errors thrown, but instead of giving me a file containing the numbers 1-30 it gives me a file filled with [] [] [] [] [] I have isolated the problem to the included code segment.
try
{
BufferedWriter bw = new BufferedWriter(new FileWriter(filepath));
int test=0;
int count=0;
while(count<temps.length)
{
test=temps[count];
bw.write(test);
bw.newLine();
bw.flush();
count++;
}
}
catch(IOException e)
{
System.out.println("IOException: "+e);
}
filepath refers to the location of the output file. temps is an array containing the values 1-30. If anymore information is necessary, i will be happy to provide.

BufferedWriter.write(int) writes the character value of the int, not the int value. So outputing 65 should put the letter A to file, 66 would print B...etc. You need to write the String value not the int value to the stream.
Use BufferedWriter.write(java.lang.String) instead
bw.write(String.valueOf(test));

I suggest to use PrintStream or PrintWriter instead:
PrintStream ps = new PrintStream(filePath, true); // true for auto-flush
int test = 0;
int count = 0;
while(count < temps.length)
{
test = temps[count];
ps.println(test);
count++;
}
ps.close();

The problem you are having is that you are using the BufferedWriter.write(int) method. What is confusing you is that while the method signature indicates it's writing an int, it's actually expecting that int to represent an encoded character. In other words, writing 0 is writing NUL, and writing 65 would output 'A'.
From Writer's javadoc:
public void write(int c) throws IOException
Writes a single character. The character to be written is contained in the 16 low-order bits of the given integer value; the 16 high-order bits are ignored.
A simple way to correct your problem is to convert the number to a String before writing. There are numerous ways to achieve this, including:
int test = 42;
bw.write(test+"");

You could convert the integer array to a byte array and do something like this:
public void saveBytes(byte[] bytes) throws FileNotFoundException, IOException {
try (BufferedOutputStream out = new BufferedOutputStream(new FileOutputStream(new File(filepath))) {
out.write(bytes);
}
}

You write the number as an Integer to the file, but you want it to be a string.
change bw.write(test); to bw.write(Integer.toString(test));

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Bytes Read from PDF are skipped - java

Related

how to read a large log file which other process current write

FileWriter doesn't write integers into file

If read() method of FileInputStream return 1 byte and char in java occupy 2 bytes, how below program works

Writing to a random access file

Java Output Integers to File

Categories

Resources