Using something else instead of String - java

I have a big file and I want to do some „operations” on it.(find some text, check if some text exists, get the offset of some text, maybe changing the file).
My current aproach is this:
public ResultSet getResultSet(String fileName) throws IOException {
InputStream in = new FileInputStream(fileName);
byte[] buffer = new byte[CAPACITY];
byte[] doubleBuffer = new byte[2 * CAPACITY];
long len = in.read(doubleBuffer);
while (true) {
String reconstitutedString = new String(doubleBuffer, 0 ,doubleBuffer.length);
//...do stuff
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write(doubleBuffer, CAPACITY, CAPACITY);
readUntilNow += len;
len = in.read(buffer);
if (len <= 0) {
break;
}
os.write(buffer, 0, CAPACITY);
doubleBuffer = os.toByteArray();
os.close();
}
in.close();
return makeResult();
}
I would like to change the String reconstitutedString into something else. What would be the best alternative considering I want to be able to get some information about the content of that data, information that I may get calling an IndexOf on a String

You may use StringBuffer or StringBuilder . This two class has almost like String class with the advantage of mutability.
Moreover you can easily convert them to String whenever you required some functionality that only String provides. To convert them you can just use the toString() method.
You may use some other data type as an alternative to String based on your situation. But in general StringBuffer and StringBuilder is the best alternative instead of string. Use StringBuffer for synchronization and StringBuilder in other case.

The best type to do split or indexOf on is String. Just use it.

The most natural choice would be CharBuffer. Like String and StringBuilder it implements the CharSequence interface, therefore it can be used with a lot of text oriented APIs, most notably the regex engine which is the back-end for most search, split, and replacing operations.
What makes CharBuffer the natural choice is that it is also the type that is used by the charset package which provides the necessary operations for converting characters from and to bytes. By dealing with this API you can do the conversion directly from and to CharBuffers without additional data copying steps.
Note that Java’s regex API is prepared for processing buffers containing partially read files and can report whether reading more data might change the result (see hitEnd() and requireEnd()).
These are the necessary tools for building applications which can process large files in smaller chunks and without creating String instance out of it (or only when necessary, e.g. when extracting a matching subsequence).

Related

How to convert clob to string with encoding in java

We are doing massive batch of xml processing and the logic to convert clob to string is shown below.
import java.sql.Clob
import org.apache.commons.io.IOUtils
String extractXml(Clob xmlClob) {
log.info "DefaultCharset: " + groovy.util.CharsetToolkit.getDefaultSystemCharset()
String sourceXml
try {
sourceXml = new String(IOUtils.toByteArray(xmlClob?.getCharacterStream()), encoding) // 1. Encoding not working
sourceXml = new String(IOUtils.toByteArray(xmlClob?.getCharacterStream(), encoding), encoding) // 2. Encoding working
} catch (Exception e) {
...
}
return sourceXml
}
My queries:
a. I am not sure why (1) doesn't work even though I am using getCharacterStream() instead of getAsciiStream().
but (2) seems to work fine may be I am using explicit overriding of system encoding ?
b. The solution (2) looks bit odd as you are specifing 2 times the encoding format (one for bytes array and one for string creation).
I am not sure if there are any performance issues or wondered if there are better ways to write them?
c. I thought of not using the Apache-commons libraries and use a simple java package solution.
But the suprising thing is, I did not give any explicit encoding but it seems to work perfectly.
Is it because It does "streams character -> straight to string buffering" ?
/*
* working perfectly and retuns encoding correctly
*/
String extractXmlWithoutApacheCommons(Clob xmlClob) {
log.info "DefaultCharset: " + groovy.util.CharsetToolkit.getDefaultSystemCharset()
StringBuffer sb = new StringBuffer((int) xmlClob.length())
try {
Reader r = xmlClob.getCharacterStream()
char[] cbuf = new char[2048]
int n = 0
while ((n = r.read(cbuf, 0, cbuf.length)) != -1) {
if (n > 0) {
sb.append(cbuf, 0, n)
}
}
} catch (Exception e) {
...
}
return sb.toString()
}
Can you guys please shed some light to understand them.
The Clob already has an encoding. It's whatever you've specified in the database, and once you read it on Java side it'll be a String (with the implicit UTF-16 encoding, not that it matters at all).
Whatever you think you're doing with all those encoding tricks is wrong and useless. You only need to specify an encoding when turning bytes to chars or the other way around. You're dealing with chars only (except in your first example where you for some unknown reason want to turn them to bytes).
If you want to use IOUtils, then readFully(Reader input, char[] buffer) would be the method to use.
The platform default encoding has no effect in this whole question, since you shouldn't be working with bytes at all.
Edit:
A slightly more modern way with the standard JDK classes would be to use Reader.read(CharBuffer target) like
CharBuffer cb = CharBuffer.allocate((int) xmlClob.length());
while(r.read(cb) != -1)
;
return cb.toString();
but it doesn't really make a huge difference (it's a bit nicer looking).

Java: number of lines in a file without processing it

I need to know the number of lines of a file before processing it, because I need to know the number of lines before read it, or in the worst case escenario read it twice..... so I made this code but It not works.. so maybe is just not possible ?
InputStream inputStream2 = getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(getInputStream()));
String line;
int numLines = 0;
while ((line = reader.readLine()) != null) {
numLines++;
}
TextFileDataCollection dataCollection = new TextFileDataCollection (numLines, 50);
BufferedReader reader2 = new BufferedReader(new InputStreamReader(inputStream2));
while ((line = reader2.readLine()) != null) {
StringTokenizer st = new StringTokenizer(reader2.readLine(), ",");
while (st.hasMoreElements()) {
System.out.println(st.nextElement());
}
}
Here's a similar question with java code, although it's a bit older:
Number of lines in a file in Java
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
EDIT:
Here's a reference related to inputstreams specifically:
From Total number of rows in an InputStream (or CsvMapper) in Java
"Unless you know the row count ahead of time, it is not possible without looping. You have to read that file in its entirety to know how many lines are in it, and neither InputStream nor CsvMapper have a means of reading ahead and abstracting that for you (they are both stream oriented interfaces).
None of the interfaces that ObjectReader can operate on support querying the underlying file size (if it's a file) or number of bytes read so far.
One possible option is to create your own custom InputStream that also provides methods for grabbing the total size and number of bytes read so far, e.g. if it is reading from a file, it can expose the underlying File.length() and also track the number of bytes read. This may not be entirely accurate, especially if Jackson buffers far ahead, but it could get you something at least."
You write
I need to know the number of lines of a file before processing it
but you don't present any file in your code; rather, you present only an InputStream. This makes a difference, because indeed no, you cannot know the number of lines in the input without examining the input to count them.
If you had a file name, File object, or similar mechanism by which you could access the data more than once, then that would be straightforward, but a stream is not guaranteed to be associated with any persistent file -- it might convey data piped from another process or communicated over a network connection, for example. Therefore, each byte provided by a generic InputStream can be read only once.
InputStream does provide an API for marking (mark()) a position and later returning to it (reset()), but stream implementations are not required to support it, and many do not. Those that do support it typically impose a limit on how far past the mark you can read before invalidating it. Readers support such a facility as well, with similar limitations.
Overall, if your only access to the data is via an InputStream, then your best bet is to process it without relying on advance knowledge of the contents. But if you want to be able to read the data twice, to count lines first, for example, then you need to make your own arrangements to stash the data somewhere in order to ensure your ability to do so. For example, you might copy it to a temporary file, or if you're prepared to rely on the input not being too large for it then you might store the contents in memory as a List of byte, byte[], char, or String.

How to inflate a git tree object?

I'm doing some Java classes to read informations from Git object. Every class works in the same way: the file is retrieved using the repo path and the hash, then it is opened, inflated and read a line at time. This works very well for blobs and commits, but somehow the inflating doesn't work for tree objects.
The code I use to read the files is the same everywhere:
FileInputStream fis = new FileInputStream(path);
InflaterInputStream inStream = new InflaterInputStream(fis);
BufferedReader bf = new BufferedReader(new InputStreamReader(inStream));
and it works without issues for every object beside trees. When I try to read a tree this way I get this:
tree 167100644 README.mdDRwJiU��#�%?^>n��40000 dir1*�j4ކ��K-�������100644 file1�⛲��CK�)�wZ���S�100644 file2�⛲��CK�)�wZ���S�100644 file4�⛲��CK�)�wZ���S�
It seems that the file names and the octal mode are decoded the right way, while the hashes aren't (and I didn't have any problem decoding the other hashes with the above code). Is there some difference between the encoding of the hashes in tree objects and in the other git objects?
The core of the problem is that there are two encoding inside a git tree file (and it isn't so clear from the documentation). Most of the file is encoded in ASCII, which means it can be read with whatever you like but the hashes are not encoded, they are simply raw bytes.
Since there are two differend encodings, the best solution is to read the file byte by byte, keeping in mind what's where.
My solution (I'm only interested in the name and hashes of the contents, so the rest is simply thrown away):
FileInputStream fis = new FileInputStream(this.filepath);
InflaterInputStream inStream = new InflaterInputStream(fis);
int i = -1;
while((i = inStream.read()) != 0){
//First line
}
//Content data
while((i = inStream.read()) != -1){
while((i = inStream.read()) != 0x20){ //0x20 is the space char
//Permission bytes
}
//Filename: 0-terminated
String filename = "";
while((i = inStream.read()) != 0){
filename += (char) i;
}
//Hash: 20 byte long, can contain any value, the only way
// to be sure is to count the bytes
String hash = "";
for(int count = 0; count < 20 ; count++){
i = inStream.read();
hash += Integer.toHexString(i);
}
}
OID's are stored raw in trees, not as text, so the answer to your question as asked in the title is "you're already doing it", and the answer to your question in the text is "yes."
To answer a why do it that way? follow-up, it's got its upsides and downsides, you hit a downside. Not much point talking about it, the pain/gain ratio on any change to that decision would be horrendous.
and read a line at time.
Don't Do That. One upside of the store-as-binary call is it breaks code that relies on never encountering an embedded newline much, much faster than would otherwise be the case. I recommend "if you misuse it or misunderstand it, it should break as fast as possible" as an excellent design rule to follow, right along with "be conservative in what you send, and liberal in what you accept".

Proper decoding of byte array to char array

In making password handling more secure by eliminating storage in Strings (which end up on the heap). I have the following existing code:
String pw = new String(buffer, 0, len, "UTF-32LE");
I came up with:
Charset charSet = Charset.forName("UTF-32LE");
ByteBuffer byteBuffer = ByteBuffer.wrap(buffer, 0, len);
CharBuffer charBuffer = charSet.decode(byteBuffer);
charArray = new char[charBuffer.length()];
for (int i = 0; i < charBuffer.length(); ++i)
{
charArray[i] = charBuffer.charAt(i);
}
Note that we support many different languages, so I'm not quite sure how best to thoroughly test this approach.
Is this correct? Are there caveats to this approach?
Is this the best approach or am I missing something simpler?
Thanks for any feedback or advice.
I am not sure what are you trying to achieve. At first I had thought you want to get rid of data being stored on the heap but then I saw array of chars. In java every array is an object and every object is stored on the heap. Reference variables can land on the stack but they are only handlers not the object itself.

Reading characters from a file written with .net

I'm trying to use java to read a string from a file that was written with a .net binaryWriter.
I think the problem is because the .net binary writer uses some 7 bit format for it's strings. By researching online, I came across this code that is supposed to function like the binary reader's readString() method. This is in my CSDataInputStream class that extends DataInputStream.
public String readStringCS() throws IOException {
int stringLength = 0;
boolean stringLengthParsed = false;
int step = 0;
while(!stringLengthParsed) {
byte part = readByte();
stringLengthParsed = (((int)part >> 7) == 0);
int partCutter = part & 127;
part = (byte)partCutter;
int toAdd = (int)part << (step*7);
stringLength += toAdd;
step++;
}
char[] chars = new char[stringLength];
for(int i = 0; i < stringLength; i++) {
chars[i] = readChar();
}
return new String(chars);
}
The first part seems to be working as it is returning the correct amount of characters (7). But when it reads the characters they are all Chinese! I'm pretty sure the problem is with DataInputStream.readChar() but I have no idea why it isn't working... I have even tried using
Character.reverseBytes(readChar());
to read the char to see if that would work, but it would just return different Chinese characters.
Maybe I need to emulate .net's way of reading chars? How would I go about doing that?
Is there something else I'm missing?
Thanks.
Okay, so you've parsed the length correctly by the sounds of it - but you're then treating it as the length in characters. As far as I can tell from the documentation it's the length in bytes.
So you should read the data into a byte[] of the right length, and then use:
return new String(bytes, encoding);
where encoding is the appropriate coding based on whatever was written from .NET... it will default to UTF-8, but it can be specified as something else.
As an aside, I personally wouldn't extend DataInputStream - I would compose it instead, i.e. make your type or method take a DataInputStream (or perhaps just take InputStream and wrap that in a DataInputStream). In general, if you favour composition over inheritance it can make code clearer and easier to maintain, in my experience.

Categories