Proper decoding of byte array to char array

Proper decoding of byte array to char array - java

In making password handling more secure by eliminating storage in Strings (which end up on the heap). I have the following existing code:
String pw = new String(buffer, 0, len, "UTF-32LE");
I came up with:
Charset charSet = Charset.forName("UTF-32LE");
ByteBuffer byteBuffer = ByteBuffer.wrap(buffer, 0, len);
CharBuffer charBuffer = charSet.decode(byteBuffer);
charArray = new char[charBuffer.length()];
for (int i = 0; i < charBuffer.length(); ++i)
{
charArray[i] = charBuffer.charAt(i);
}
Note that we support many different languages, so I'm not quite sure how best to thoroughly test this approach.
Is this correct? Are there caveats to this approach?
Is this the best approach or am I missing something simpler?
Thanks for any feedback or advice.

I am not sure what are you trying to achieve. At first I had thought you want to get rid of data being stored on the heap but then I saw array of chars. In java every array is an object and every object is stored on the heap. Reference variables can land on the stack but they are only handlers not the object itself.

Related

How to inflate a git tree object?

I'm doing some Java classes to read informations from Git object. Every class works in the same way: the file is retrieved using the repo path and the hash, then it is opened, inflated and read a line at time. This works very well for blobs and commits, but somehow the inflating doesn't work for tree objects.
The code I use to read the files is the same everywhere:
FileInputStream fis = new FileInputStream(path);
InflaterInputStream inStream = new InflaterInputStream(fis);
BufferedReader bf = new BufferedReader(new InputStreamReader(inStream));
and it works without issues for every object beside trees. When I try to read a tree this way I get this:
tree 167100644 README.mdDRwJiU��#�%?^>n��40000 dir1*�j4ކ��K-�������100644 file1�⛲��CK�)�wZ���S�100644 file2�⛲��CK�)�wZ���S�100644 file4�⛲��CK�)�wZ���S�
It seems that the file names and the octal mode are decoded the right way, while the hashes aren't (and I didn't have any problem decoding the other hashes with the above code). Is there some difference between the encoding of the hashes in tree objects and in the other git objects?

The core of the problem is that there are two encoding inside a git tree file (and it isn't so clear from the documentation). Most of the file is encoded in ASCII, which means it can be read with whatever you like but the hashes are not encoded, they are simply raw bytes.
Since there are two differend encodings, the best solution is to read the file byte by byte, keeping in mind what's where.
My solution (I'm only interested in the name and hashes of the contents, so the rest is simply thrown away):
FileInputStream fis = new FileInputStream(this.filepath);
InflaterInputStream inStream = new InflaterInputStream(fis);
int i = -1;
while((i = inStream.read()) != 0){
//First line
}
//Content data
while((i = inStream.read()) != -1){
while((i = inStream.read()) != 0x20){ //0x20 is the space char
//Permission bytes
}
//Filename: 0-terminated
String filename = "";
while((i = inStream.read()) != 0){
filename += (char) i;
}
//Hash: 20 byte long, can contain any value, the only way
// to be sure is to count the bytes
String hash = "";
for(int count = 0; count < 20 ; count++){
i = inStream.read();
hash += Integer.toHexString(i);
}
}

OID's are stored raw in trees, not as text, so the answer to your question as asked in the title is "you're already doing it", and the answer to your question in the text is "yes."
To answer a why do it that way? follow-up, it's got its upsides and downsides, you hit a downside. Not much point talking about it, the pain/gain ratio on any change to that decision would be horrendous.
and read a line at time.
Don't Do That. One upside of the store-as-binary call is it breaks code that relies on never encountering an embedded newline much, much faster than would otherwise be the case. I recommend "if you misuse it or misunderstand it, it should break as fast as possible" as an excellent design rule to follow, right along with "be conservative in what you send, and liberal in what you accept".

How to reverse string which place 2/3 of heap?

Recently I've had an interview and I was asked a strange(at least for me) question:
I should write a method which would inverse a string.
public static String myReverse(String str){
...
}
The problem is that str is a very very huge object (2/3 of memory).
I suppose only one solution:
create a new String where I will store the result, then reverse 1/2 of the source String. After using reflection, clear the second half(already reversed) of the source string underlying array and then continue to reverse.
Am I right?
Any other solutions?

If you are using reflection anyway, you could access the underlying character array of the string and reverse it in place, by traversing from both ends and swapping the chars at each end.
public static String myReverse(String str){
char[] content;
//Fill content with reflection
for (int a = 0, b = content.length - 1; a < b; a++, b--) {
char temp = content[b];
content[b] = content[a];
content[a] = temp;
}
return str;
}
I unfortunately can't think of a way that doesn't use reflection.

A String is internally a 16-bit char array. If we know the character set to be ASCII, meaning each char maps to a single byte, we can encode the string to a 8-bit byte array at only 50% of the memory cost. This fully utilizes the available memory during the transition. Then we let go of the input string to reclaim 2/3 of the memory, reverse the byte array and reconstruct the string.
public static String myReverse(String str) {
byte[] bytes = str.getBytes("ASCII");
// memory at full capacity
str = null;
// memory at 1/3 capacity
for (int i = 0; i < bytes.length / 2; i++) {
byte tmp = bytes[i];
bytes[i] = bytes[bytes.length - i - 1];
bytes[bytes.length - i - 1] = tmp;
}
return new String(bytes, "ASCII");
}
This, of course, assumes you have a little extra memory available for temporary objects created by the encoding process, array headers, etc.

It's unlikely that you can do that without using tricks like reflection or assuming that the String is stored in an efficient way (for example knowing it to be only ASCII characters). The problems in your way is that in java Strings are immutable. The other is the likely implementation of garbage collection.
The problem with the likely implementation of garbage collection is that the memory is reclaimed after the object can no longer be accessed. This means that there would be a brief period where both the input and the output of a transformation would need to occupy memory.
For example one could try to reverse the string by successively build the result and cut down the original string:
rev = rev + orig.substring(0,1);
orig = orig.substring(1);
But this relies oth that the previos incarnation of rev or orig respectively is collected as the new incarnation of rev or orig is being created so that they never occupy up to 2/3 of the memory at the same time.
To be more general one would study such a process. During the process there would be a set of objects that evolve throughout the process, both the set it self and (some of) the objects. At the start the original string would be in the set and at the end the reversed string would be there. It's clear that due to information content the total size of the objects in the set can never be lower than the original. The crucial point here is that the original string have to be deleted at some point. Before that time at most 50% of the information may exist in the other objects. So we need a construct that would at the same time delete a String object as it retains more than half of the information therein.
Such a construct would need you basically to call a method to an object returning another object an in the process remove the object as the result is being constructed. It's unlikely that the implementation would work in that way.
Your approach seem to rely on that String are indeed mutable somehow, and then there would be no problem in just reversing the string in place without having to use a lot of memory. You don't need to copy out anything there, you can do the whole thing in place: swap the [j] and then [len-1-j] (for all j<(len-1)/2)

Using something else instead of String

I have a big file and I want to do some „operations” on it.(find some text, check if some text exists, get the offset of some text, maybe changing the file).
My current aproach is this:
public ResultSet getResultSet(String fileName) throws IOException {
InputStream in = new FileInputStream(fileName);
byte[] buffer = new byte[CAPACITY];
byte[] doubleBuffer = new byte[2 * CAPACITY];
long len = in.read(doubleBuffer);
while (true) {
String reconstitutedString = new String(doubleBuffer, 0 ,doubleBuffer.length);
//...do stuff
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write(doubleBuffer, CAPACITY, CAPACITY);
readUntilNow += len;
len = in.read(buffer);
if (len <= 0) {
break;
}
os.write(buffer, 0, CAPACITY);
doubleBuffer = os.toByteArray();
os.close();
}
in.close();
return makeResult();
}
I would like to change the String reconstitutedString into something else. What would be the best alternative considering I want to be able to get some information about the content of that data, information that I may get calling an IndexOf on a String

You may use StringBuffer or StringBuilder . This two class has almost like String class with the advantage of mutability.
Moreover you can easily convert them to String whenever you required some functionality that only String provides. To convert them you can just use the toString() method.
You may use some other data type as an alternative to String based on your situation. But in general StringBuffer and StringBuilder is the best alternative instead of string. Use StringBuffer for synchronization and StringBuilder in other case.

The best type to do split or indexOf on is String. Just use it.

The most natural choice would be CharBuffer. Like String and StringBuilder it implements the CharSequence interface, therefore it can be used with a lot of text oriented APIs, most notably the regex engine which is the back-end for most search, split, and replacing operations.
What makes CharBuffer the natural choice is that it is also the type that is used by the charset package which provides the necessary operations for converting characters from and to bytes. By dealing with this API you can do the conversion directly from and to CharBuffers without additional data copying steps.
Note that Java’s regex API is prepared for processing buffers containing partially read files and can report whether reading more data might change the result (see hitEnd() and requireEnd()).
These are the necessary tools for building applications which can process large files in smaller chunks and without creating String instance out of it (or only when necessary, e.g. when extracting a matching subsequence).

Fastest way to store char[][] in shared preferences

In my app, the main set of data is a two-dimensional char array (char[][]), in which some of the values may be non-printable characters and even \0 characters. What would be the fastest way to store this array in the shared prefs and retrieve it later? Speed of retrieval is a lot more important to me than the speed of saving it. The arrays are not particularly large, probably no more than 100x100.
Presently, I'm converting it into a string by simply concatenating all characters, row-by-row, column-by-column, and storing the string along with the dimensions (as int).
I have also considered just serialising the array (writeObject into a ByteArrayOutputStreram and then use the stream's toString method), but haven't tried it yet.
Any other suggestions? Again, the fastest possible retrieval (and recreation as the char[][] array) is my primary concern.

I posted you a way which uses many native functions and therefore is likely fast. Be aware that this is untested and shall only be used inspirational.
public void save(char[][] chars) {
Set<String> strings = new LinkedHashSet<String>(chars.length);
for(int i = 0, len = chars.length; i < len; i++) {
strings[i] = new String(chars[i]);
}
getSharedPreferences().edit().putStringSet("data", strings).commit();
}
public char[][] read() {
Set<String> strings = getSharedPreferences().getStringSet("data", new LinkedHashSet<String>());
char[][] chars = new char[strings.size][];
int i = 0;
for(String line : strings) {
chars[i++] = line.toCharArray();
}
return chars;
}

Because StringSet methods (put and get) are only available from Android 3.0 and also because I found the preferences being less than reliable when storing long strings, especially those containing 0-chars, I use a different way of storing data in the app.
I use internal files (fileGetInput and fileGetOutput), I then create a HashMap<Integer, char[][]> and write it to the file using writeObject. As I have a few of those char arrays, identified by an integer ID, this way I'm saving them all in one go.
I do realise that I may be loosing something in terms of performance, however in this case reliability comes first.

Reading characters from a file written with .net

I'm trying to use java to read a string from a file that was written with a .net binaryWriter.
I think the problem is because the .net binary writer uses some 7 bit format for it's strings. By researching online, I came across this code that is supposed to function like the binary reader's readString() method. This is in my CSDataInputStream class that extends DataInputStream.
public String readStringCS() throws IOException {
int stringLength = 0;
boolean stringLengthParsed = false;
int step = 0;
while(!stringLengthParsed) {
byte part = readByte();
stringLengthParsed = (((int)part >> 7) == 0);
int partCutter = part & 127;
part = (byte)partCutter;
int toAdd = (int)part << (step*7);
stringLength += toAdd;
step++;
}
char[] chars = new char[stringLength];
for(int i = 0; i < stringLength; i++) {
chars[i] = readChar();
}
return new String(chars);
}
The first part seems to be working as it is returning the correct amount of characters (7). But when it reads the characters they are all Chinese! I'm pretty sure the problem is with DataInputStream.readChar() but I have no idea why it isn't working... I have even tried using
Character.reverseBytes(readChar());
to read the char to see if that would work, but it would just return different Chinese characters.
Maybe I need to emulate .net's way of reading chars? How would I go about doing that?
Is there something else I'm missing?
Thanks.

Okay, so you've parsed the length correctly by the sounds of it - but you're then treating it as the length in characters. As far as I can tell from the documentation it's the length in bytes.
So you should read the data into a byte[] of the right length, and then use:
return new String(bytes, encoding);
where encoding is the appropriate coding based on whatever was written from .NET... it will default to UTF-8, but it can be specified as something else.
As an aside, I personally wouldn't extend DataInputStream - I would compose it instead, i.e. make your type or method take a DataInputStream (or perhaps just take InputStream and wrap that in a DataInputStream). In general, if you favour composition over inheritance it can make code clearer and easier to maintain, in my experience.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.