I'm doing some Java classes to read informations from Git object. Every class works in the same way: the file is retrieved using the repo path and the hash, then it is opened, inflated and read a line at time. This works very well for blobs and commits, but somehow the inflating doesn't work for tree objects.
The code I use to read the files is the same everywhere:
FileInputStream fis = new FileInputStream(path);
InflaterInputStream inStream = new InflaterInputStream(fis);
BufferedReader bf = new BufferedReader(new InputStreamReader(inStream));
and it works without issues for every object beside trees. When I try to read a tree this way I get this:
tree 167100644 README.mdDRwJiU��#�%?^>n��40000 dir1*�j4ކ��K-�������100644 file1�⛲��CK�)�wZ���S�100644 file2�⛲��CK�)�wZ���S�100644 file4�⛲��CK�)�wZ���S�
It seems that the file names and the octal mode are decoded the right way, while the hashes aren't (and I didn't have any problem decoding the other hashes with the above code). Is there some difference between the encoding of the hashes in tree objects and in the other git objects?
The core of the problem is that there are two encoding inside a git tree file (and it isn't so clear from the documentation). Most of the file is encoded in ASCII, which means it can be read with whatever you like but the hashes are not encoded, they are simply raw bytes.
Since there are two differend encodings, the best solution is to read the file byte by byte, keeping in mind what's where.
My solution (I'm only interested in the name and hashes of the contents, so the rest is simply thrown away):
FileInputStream fis = new FileInputStream(this.filepath);
InflaterInputStream inStream = new InflaterInputStream(fis);
int i = -1;
while((i = inStream.read()) != 0){
//First line
}
//Content data
while((i = inStream.read()) != -1){
while((i = inStream.read()) != 0x20){ //0x20 is the space char
//Permission bytes
}
//Filename: 0-terminated
String filename = "";
while((i = inStream.read()) != 0){
filename += (char) i;
}
//Hash: 20 byte long, can contain any value, the only way
// to be sure is to count the bytes
String hash = "";
for(int count = 0; count < 20 ; count++){
i = inStream.read();
hash += Integer.toHexString(i);
}
}
OID's are stored raw in trees, not as text, so the answer to your question as asked in the title is "you're already doing it", and the answer to your question in the text is "yes."
To answer a why do it that way? follow-up, it's got its upsides and downsides, you hit a downside. Not much point talking about it, the pain/gain ratio on any change to that decision would be horrendous.
and read a line at time.
Don't Do That. One upside of the store-as-binary call is it breaks code that relies on never encountering an embedded newline much, much faster than would otherwise be the case. I recommend "if you misuse it or misunderstand it, it should break as fast as possible" as an excellent design rule to follow, right along with "be conservative in what you send, and liberal in what you accept".
Related
Create log file by day, one file about 400MB,JVM memory about 2GB。
Have one process write a large log file with 'a' mode。
I want to read this file and be able to achieve some functions:
Append read newly written data
I will store the offset to restore the read after jvm restart
This is my simple implementation, but I don't know if the time and memory consumption are good. I want to know if there is a better way to solve this problem
public static void main(String[] args) throws IOException {
String filePath = "D://test.log";
long restoreOffset = resotoreOffset();
RandomAccessFile randomAccessFile = new RandomAccessFile(filePath, "r");
randomAccessFile.seek(restoreOffset);
while (true) {
String line = randomAccessFile.readLine();
if(line != null) {
// doSomething(line);
restoreOffset = randomAccessFile.getFilePointer();
//storeOffset(restoreOffset);
}
}
}
It's not, unfortunately.
There are 2 major problems with this code. First I'll tackle the simple one, but the most important one is the second point.
Encoding issues
String line = randomAccessFile.readLine();
This line converts bytes to characters implicitly, and that's generally a bad idea, because bytes aren't characters, and converting from one to the other requires a charset encoding.
This method (readLine() from RAF) is a bizarre case - probably because RandomAccessFile is incredibly old API. Using this method will apply some bizarro ISO-8859-1 esque charset encoding: It converts bytes to chars by taking each byte as a complete char, assuming the byte represents the unicode character as listed, which isn't actually a sane encoding, just a lazy programmer.
The upshot for you is: Unless you can guarantee that this log file shall always only ever contain ASCII characters, this code is broken, and readLine cannot be used at all. Instead you'll have to do considerably more work: read bytes until you hit a newline, then turn the bytes so gathered into a string with new String(byteArray, StandardCharsets.UTF_8), or use ByteBuffer and apply similar tactics. But keep reading, because solving the second problem kinda solves this one automatically.
Buffering
Modern computer systems tend to like 'packeting'. You can't really operate on a single byte. Take SSDs (though this applies to spinning platter disks as well): The actual SSD hardware can't read single bytes. It can only read entire blocks worth of data.
When you therefore ask the OS explicitly for a single byte, that ends up setting off a chain of events that causes the SSD to read the entire block, then pass that entire block to the operating system, which will then disregard everything except the one byte you wanted, and returns just that.
If your code then asks for the next byte, we do that routine again.
So, if you read 1024 bytes consecutively from an SSD that has 1024-byte blocks, doing so by calling read() 1024 times causes the SSD to perform 1024 reads, whereas calling read(byteArr) once, passing it a 1024-byte array, causes the SSD to perform a single read.
Yup, that means the byte array solution is literally 1000 times faster.
The same applies to networking, too. Sending 1 byte a thousand times is usually nearly 1000 times slower than sending 1000 bytes once; TCP/IP packets can carry about 1800 bytes worth of data, so sending any less than that gains you almost nothing.
RAF's readLine() works like the first (bad) scenario: It reads bytes one at a time until it hits a newline character. Thus, to read a 100 character string, it's 100x slower than just knowing you need to read 100 characters and reading them in one go.
The solution
You may want to abandon RandomAccessFile entirely, it's quite old API.
A major issue with buffering is that it's a lot harder unless you know how many bytes to read beforehand. Here, you don't know that: You want to keep reading until you hit a newline character, but you have no idea how long it'll be until we get there. Furthermore, buffering APIs tend to just return what's convenient, and may therefore read fewer bytes than we ask for (it'll always read at least 1, though, unless we hit end of file). So, we need to write code that will repeatedly read entire chunk's worth of data, analyse the chunk for a newline, and if it's not there, keep reading.
Furthermore, opening channels and such is expensive. So, if you want to dig through all log lines, writing code that opens a new channel every time is suboptimal.
How about this, using the newer file API from java.nio.file:
public class LogLineReader implements AutoCloseable {
private final byte[] buffer = new byte[1024];
private final ByteBuffer bb = wrap(buffer);
private final SeekableByteChannel channel;
private final Charset charset = StandardCharsets.UTF_8;
public LogLineReader(Path p) {
channel = Files.newByteChannel(p, StandardOpenOption.READ);
channel.position(111L); // you seek to pos 111 in your code...
}
#Override public void close() throws IOException {
channel.close();
}
// This code buffers: First, our internal buffer is scanned
// for a new line. If there is no full line in the buffer,
// we read bytes from the file and check again until we find one.
public String readLine() {
int len = 0;
if (!channel.isOpen()) return null;
int scanStart = 0;
while (true) {
// Scan through the bytes we have buffered for a newline.
for (int i = scanStart; i < buffer.position(); i++) {
if (buffer[i] == '\n') {
// Found it. Take all bytes up to the new line, turn into
// a string.
String res = new String(buffer, 0, i, charset);
// Copy all bytes from _after_ the newline to the front.
System.arraycopy(buffer, i + 1, buffer, 0, buffer.position() - i - 1);
// Adjust the position (which represents how many bytes are buffered).
buffer.position(buffer.position() - i - 1);
return res;
}
}
scanStart = buffer.position();
// If we get here, the buffer is empty or contains no newline.
if (scanStart == buffer.limit()) {
throw new IOException("Log line too long");
}
int read = channel.read(buffer); // let's fetch more bytes!
if (read == -1) {
// we've reached the end of the file.
if (buffer.position() == 0) return null;
return new String(buffer, 0, buffer.position(), charset);
}
}
}
}
For the sake of efficiency, this code cannot deal with log lines longer than 1024 in length; feel free to up that number. If you want to be capable of reading infinite size loglines, at some point a gigantic buffer is a problem. If you must, you could write code that resizes the buffer if you hit 1024, or you can update this code that it'll keep reading, but only returns a truncated string with the first 1024 characters. I'll leave that as an exercise for you.
NB: I also didn't test this, but at the very least it should give you the general gist of using SeekableByteChannel, and the concept of buffers.
To use:
Path p = Paths.get("D://logfile.txt");
try (LogLineReader reader = new LogLineReader(p)) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// do something with line
}
}
You must ensure the LLR object is closed, hence, use try-with-resources.
I need to know the number of lines of a file before processing it, because I need to know the number of lines before read it, or in the worst case escenario read it twice..... so I made this code but It not works.. so maybe is just not possible ?
InputStream inputStream2 = getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(getInputStream()));
String line;
int numLines = 0;
while ((line = reader.readLine()) != null) {
numLines++;
}
TextFileDataCollection dataCollection = new TextFileDataCollection (numLines, 50);
BufferedReader reader2 = new BufferedReader(new InputStreamReader(inputStream2));
while ((line = reader2.readLine()) != null) {
StringTokenizer st = new StringTokenizer(reader2.readLine(), ",");
while (st.hasMoreElements()) {
System.out.println(st.nextElement());
}
}
Here's a similar question with java code, although it's a bit older:
Number of lines in a file in Java
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
EDIT:
Here's a reference related to inputstreams specifically:
From Total number of rows in an InputStream (or CsvMapper) in Java
"Unless you know the row count ahead of time, it is not possible without looping. You have to read that file in its entirety to know how many lines are in it, and neither InputStream nor CsvMapper have a means of reading ahead and abstracting that for you (they are both stream oriented interfaces).
None of the interfaces that ObjectReader can operate on support querying the underlying file size (if it's a file) or number of bytes read so far.
One possible option is to create your own custom InputStream that also provides methods for grabbing the total size and number of bytes read so far, e.g. if it is reading from a file, it can expose the underlying File.length() and also track the number of bytes read. This may not be entirely accurate, especially if Jackson buffers far ahead, but it could get you something at least."
You write
I need to know the number of lines of a file before processing it
but you don't present any file in your code; rather, you present only an InputStream. This makes a difference, because indeed no, you cannot know the number of lines in the input without examining the input to count them.
If you had a file name, File object, or similar mechanism by which you could access the data more than once, then that would be straightforward, but a stream is not guaranteed to be associated with any persistent file -- it might convey data piped from another process or communicated over a network connection, for example. Therefore, each byte provided by a generic InputStream can be read only once.
InputStream does provide an API for marking (mark()) a position and later returning to it (reset()), but stream implementations are not required to support it, and many do not. Those that do support it typically impose a limit on how far past the mark you can read before invalidating it. Readers support such a facility as well, with similar limitations.
Overall, if your only access to the data is via an InputStream, then your best bet is to process it without relying on advance knowledge of the contents. But if you want to be able to read the data twice, to count lines first, for example, then you need to make your own arrangements to stash the data somewhere in order to ensure your ability to do so. For example, you might copy it to a temporary file, or if you're prepared to rely on the input not being too large for it then you might store the contents in memory as a List of byte, byte[], char, or String.
I have a big file and I want to do some „operations” on it.(find some text, check if some text exists, get the offset of some text, maybe changing the file).
My current aproach is this:
public ResultSet getResultSet(String fileName) throws IOException {
InputStream in = new FileInputStream(fileName);
byte[] buffer = new byte[CAPACITY];
byte[] doubleBuffer = new byte[2 * CAPACITY];
long len = in.read(doubleBuffer);
while (true) {
String reconstitutedString = new String(doubleBuffer, 0 ,doubleBuffer.length);
//...do stuff
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write(doubleBuffer, CAPACITY, CAPACITY);
readUntilNow += len;
len = in.read(buffer);
if (len <= 0) {
break;
}
os.write(buffer, 0, CAPACITY);
doubleBuffer = os.toByteArray();
os.close();
}
in.close();
return makeResult();
}
I would like to change the String reconstitutedString into something else. What would be the best alternative considering I want to be able to get some information about the content of that data, information that I may get calling an IndexOf on a String
You may use StringBuffer or StringBuilder . This two class has almost like String class with the advantage of mutability.
Moreover you can easily convert them to String whenever you required some functionality that only String provides. To convert them you can just use the toString() method.
You may use some other data type as an alternative to String based on your situation. But in general StringBuffer and StringBuilder is the best alternative instead of string. Use StringBuffer for synchronization and StringBuilder in other case.
The best type to do split or indexOf on is String. Just use it.
The most natural choice would be CharBuffer. Like String and StringBuilder it implements the CharSequence interface, therefore it can be used with a lot of text oriented APIs, most notably the regex engine which is the back-end for most search, split, and replacing operations.
What makes CharBuffer the natural choice is that it is also the type that is used by the charset package which provides the necessary operations for converting characters from and to bytes. By dealing with this API you can do the conversion directly from and to CharBuffers without additional data copying steps.
Note that Java’s regex API is prepared for processing buffers containing partially read files and can report whether reading more data might change the result (see hitEnd() and requireEnd()).
These are the necessary tools for building applications which can process large files in smaller chunks and without creating String instance out of it (or only when necessary, e.g. when extracting a matching subsequence).
The default implementation of RandomAccessFile is 'broken', in the sense that you can't specify which encoding your file is in.
I'm looking for an alternative which matches the following criteria:
Encoding-aware
Random access! (dealing with very big files, need to be able to position the cursor using a byte offset without streaming the whole thing).
I had a poke around in Commons IO, but there's nothing there. I'd rather not have to implement this myself, because there are entirely too many places it could go wrong.
RandomAccessFile is intended for accessing binary data. It is not possible to efficiently create a random access encoded file which is appropriate in all situations.
Even if you find such a solution I would check it carefully to ensure it suits your needs.
If you were to write it, I would suggest considering a random position of row and column rather than character offset from the start of the file.
This has the advantage that you only have to remember where the start of each line is and you can scan the line to get your character. If you index the position of every character, this could use 4 bytes for every character (assuming the file is < 4 GB)
The answer turned out to be less painful than I assumed:
// This gives me access to buffering and charset magic
new BufferedReader(new InputStreamReader(Channels.newInputStream(randomAccessFile.getChannel()), encoding)), encoding
....
I can then implement a readLine() method which reads character by character. Using String.getBytes(encoding) I can keep track of the offset in the file. Calling seek() on the underlying RandomAccessFile allows me to reposition the cursor at will. There are probably some bugs lurking in there, but the basic tests seem to work.
public String readLine() throws IOException {
eol = "";
lastLineByteCount = 0;
StringBuilder builder = new StringBuilder();
char[] characters = new char[1];
int status = reader.read(characters, 0, 1);
if (status == -1) {
return null;
}
char c = characters[0];
while (status != -1) {
if (c == '\n') {
eol += c;
break;
}
if (c == '\r') {
eol += c;
} else {
builder.append(c);
}
status = reader.read(characters, 0, 1);
c = characters[0];
}
String line = builder.toString();
lastLineByteCount = line.getBytes(encoding).length + eol.getBytes(encoding).length;
return line;
}
I'm trying to use java to read a string from a file that was written with a .net binaryWriter.
I think the problem is because the .net binary writer uses some 7 bit format for it's strings. By researching online, I came across this code that is supposed to function like the binary reader's readString() method. This is in my CSDataInputStream class that extends DataInputStream.
public String readStringCS() throws IOException {
int stringLength = 0;
boolean stringLengthParsed = false;
int step = 0;
while(!stringLengthParsed) {
byte part = readByte();
stringLengthParsed = (((int)part >> 7) == 0);
int partCutter = part & 127;
part = (byte)partCutter;
int toAdd = (int)part << (step*7);
stringLength += toAdd;
step++;
}
char[] chars = new char[stringLength];
for(int i = 0; i < stringLength; i++) {
chars[i] = readChar();
}
return new String(chars);
}
The first part seems to be working as it is returning the correct amount of characters (7). But when it reads the characters they are all Chinese! I'm pretty sure the problem is with DataInputStream.readChar() but I have no idea why it isn't working... I have even tried using
Character.reverseBytes(readChar());
to read the char to see if that would work, but it would just return different Chinese characters.
Maybe I need to emulate .net's way of reading chars? How would I go about doing that?
Is there something else I'm missing?
Thanks.
Okay, so you've parsed the length correctly by the sounds of it - but you're then treating it as the length in characters. As far as I can tell from the documentation it's the length in bytes.
So you should read the data into a byte[] of the right length, and then use:
return new String(bytes, encoding);
where encoding is the appropriate coding based on whatever was written from .NET... it will default to UTF-8, but it can be specified as something else.
As an aside, I personally wouldn't extend DataInputStream - I would compose it instead, i.e. make your type or method take a DataInputStream (or perhaps just take InputStream and wrap that in a DataInputStream). In general, if you favour composition over inheritance it can make code clearer and easier to maintain, in my experience.