XMLStreamReader doesn't read complete tag

XMLStreamReader doesn't read complete tag - java

I'm parsing XML using XMLStreamReader. In <dbresponse> tag there are some data loaded from database (WebRowSet object). The problem is that the content of this tag is very long (let's say several hundred kilobytes - the data are encoded in Base64), but input.getText() reads only 16.394 characters out of it.
I'm 100 % sure data coming to XMLStreamReader are OK.
I found some other answer here, but it doesn't solve my problem, I could of course use some other way how to read the data, but I would like to know what is the problem with this one.
Does somebody know how to get the whole content?
My code:
input = xmlFactory.createXMLStreamReader(new ByteArrayInputStream(xmlData.getBytes("UTF-8")));
while(input.hasNext()){
if(input.getEventType() == XMLStreamConstants.START_ELEMENT){
element = input.getName().getLocalPart();
switch(element.toLowerCase()){
case "transactionresponse":
int transactionStatus = 0;
transactionResponse = new TransactionResponse();
for(int i=0; i<input.getAttributeCount(); i++){
switch(input.getAttributeLocalName(i)){
case "status": transactionStatus = TransactionResponse.getStatusFromName(input.getAttributeValue(i));
}
}
transactionResponse.setStatus(transactionStatus);
break;
case "dbresponse":
for(int i=0; i<input.getAttributeCount(); i++){
switch(input.getAttributeLocalName(i)){
case "request_id": id = Integer.parseInt(input.getAttributeValue(i)); break;
case "status": status = Response.getStatusFromName(input.getAttributeValue(i));
}
}
break;
}
}else if(input.getEventType() == XMLStreamConstants.CHARACTERS){
switch(element.toLowerCase()){
case "dbresponse":
String data = input.getText();
if(!data.equals("\n")){
data = new String(Base64.decode(data), "UTF-8");
}
Response response = new Response(data, status, id);
if(transactionResponse != null){
transactionResponse.addResponse(response);
}else{
this.addResponse(response);
}
id = -1;
status = -1;
break;
}
element = "";
}else if(input.getEventType() == XMLStreamConstants.END_ELEMENT){
switch(input.getLocalName().toLowerCase()){
case "transactionresponse": this.addTransactionResponse(transactionResponse); transactionResponse = null; break;
}
}
input.next();

Event-driven XML parsers such as XMLStreamReader are designed to allow you to parse the XML without having to read it into memory all at one go, which is pretty essential in case you have a very large XML.
The design is such that it reads a certain buffer of data, and gives you events as it runs into "interesting" stuff, such as the beginning of a tag, the end of a tag, and so on.
But the buffer it reads is not infinite, as it is meant to handle large XML files, exactly like the one you have. For this reason, a large text in a tag may be represented by several consecutive CHARACTERS events.
That is, when you get a CHARACTERS event, there is no guarantee that it contains the whole text. If the text is too long for the reader's buffer, you will simply get more CHARACTERS events that follow.
Since you are only reading the data from the first CHARACTERS event, it is not the whole data.
The proper way to work with such a file is:
When you get a START_ELEMENT event for the element you are interested in, you make preparations for storing the text. For example, create a StringBuilder, or open a file for writing, etc.
For each CHARACTERS event that follows, you append the text to your storage (the StringBuilder, the file).
Once you get the END_ELEMENT event for the same element, you finish accumulating your data, and do whatever you need to do with it.
In fact, this is what the getElementText() method does for you - accumulates the data in a StringBuffer while going through CHARACTERS events until it hits the END_ELEMENT.
Bottom line: you only know you got the whole data when you hit the END_ELEMENT event. There is no guarantee that the text will be in a single CHARACTERS event.

I think the XMLStreamReader chunks the data, so maybe try looping getText() to concatenate all chunks ?
What about getElementText() method ?

Related

StringBuilder.append outofmemory

I'm using StringBuilder.append() to parse and process a file as following :
StringBuilder csvString = new StringBuilder();
bufferedReader.lines().filter(line -> !line.startsWith(HASH) && !line.isEmpty()).map(line -> line.trim())
.forEachOrdered(line -> csvString.append(line).append(System.lineSeparator()));
int startOfFileTagIndex = csvString.indexOf(START_OF_FILE_TAG);
int startOfFieldsTagIndex = csvString.indexOf(START_OF_FIELDS_TAG, startOfFileTagIndex);
int endOfFieldsTagIndex = csvString.indexOf(END_OF_FIELDS_TAG, startOfFieldsTagIndex);
int startOfDataTagIndex = csvString.indexOf(START_OF_DATA_TAG, endOfFieldsTagIndex);
int endOfDataTagIndex = csvString.indexOf(END_OF_DATA_TAG, startOfDataTagIndex);
int endOfFileTagIndex = csvString.indexOf(END_OF_FILE_TAG, endOfDataTagIndex);
int timeStartedIndex = csvString.indexOf("TIMESTARTED", endOfFieldsTagIndex);
int dataRecordsIndex = csvString.indexOf("DATARECORDS", endOfDataTagIndex);
int timeFinishedIndex = csvString.indexOf("TIMEFINISHED", endOfDataTagIndex);
if (startOfFileTagIndex != 0 || startOfFieldsTagIndex == -1 || endOfFieldsTagIndex == -1
|| startOfDataTagIndex == -1 || endOfDataTagIndex == -1 || endOfFileTagIndex == -1) {
log.error("not in correct format");
throw new Exception("not in correct format.");
}
The problem is that when the file is quite large i get an outofmemoryexception.
Can you help me transform my code to avoid that exception with large files?
Edit:
As I can understand charging a huge file into a string Builder is not a good idea and won't work.
So the question is which structure in Java is the more appropriate to use to parse my huge file, delete some lines , find the index of some lines and seperate the file into parts (where to store those parts thaht can be huge) according to the found indexes then creating an output file in the end?

The OOM seems to be due to the fact that you are storing all lines in the StringBuilder. When the file has too many lines, it will take up a huge amount of memory and may lead to OOM.
The strategy to avoid this depends upon what you are doing with appended strings.
As I can see in your code, you are only trying to verify the structure of the input file. In that case, you don't need to store all the lines in a StringBuilder instance. Instead,
Have multiple ints to hold each index you are interested in, (or have an array of ints)
Instead of adding the line to the StringBuilder, detect the presence of the "tag" or "index" you are looking for and save it in its designated int variable.
Finally, the check you are already doing may need to undergo a change to test not as -1 but relative to other indices. (This you are currently achieving using a start index in the indexOf() call.)
If there is a risk of a tag spanning across lines, then you may not be able to use streams, but will have to use a simple for loop in which to save some previous lines, append them and check. (Just one idea; you may have a better one.)

Processing bufferUntil() method only works with '\n'

TL,DR : bufferUntil() and readStringUntil() works fine when set to '\n' but creates problems for other characters.
The code that sends data to pc is below;
Serial.print(rollF);
Serial.print("/");
Serial.println(pitchF);
And the relevant parts from processing are;
myPort = new Serial(this, "COM3", 9600); // starts the serial communication
myPort.bufferUntil('\n');
void serialEvent (Serial myPort) {
// reads the data from the Serial Port up to the character '\n' and puts it into the String variable "data".
data = myPort.readStringUntil('\n');
// if you got any bytes other than the linefeed:
if (data != null) {
data = trim(data);
// split the string at "/"
String items[] = split(data, '/');
if (items.length > 1) {
//--- Roll,Pitch in degrees
roll = float(items[0]);
pitch = float(items[1]);
}
}
}
A picture from my incoming data(from arduino serial monitor):
0.62/-0.52
0.63/-0.52
0.63/-0.52
0.64/-0.53
0.66/-0.53
0.67/-0.53
0.66/-0.54
Until here, everything is fine as it should be. Nothing special. The problem occurs when I change the parameters of bufferUntil() and readStringUntil() functions to anything other than '\n'. Of course when I do that, I also change the corresponding parts from the arduino code. For example when replacing '\n' by 'k', the incoming data seen from arduino serial monitor looks like,
45.63/22.3k21.51/77.32k12.63/88.90k
and goes on like that. But the processing cannot get the second value in each buffer. When I check it by printing the values also on the console of processing I get the value of first one(roll) right however the second value(pitch) is shown as NaN. So what is the problem? What is the reason that it only works when it is '\n'.

I cannot check it right now but I think you might have two issues.
First off, you don't need to use bufferUntil() and readStringUntil() at the same time.
And second and more important, both functions take the character as an int so if you want to read until the character k you should do:
data = myPort.readStringUntil(int('k'));
Or, since k is ASCII code 107:
data = myPort.readStringUntil(107);
If you call the function with the wrong type as you are doing nothing will happen and the port will keep reading until it finds the default line feed.

Checking each line of data in a text file and identifying invalid data

so i've looked around and could'nt find anything specificaly related to what i'm wanting to accomplish, so i'm here to ask some of you folks if ya'll could help. I am a Uni student, and am struggling to wrap my head around a specfific task.
The task revolves around the following:
Being able to have the program we develop check each line of data in a file we input, and report any errors (such as missing data) to the console via messages.
I am currently using Scanner to scan the file and .split to split the text at each hyphen that it finds and then placing that data into a String[] splitText array... the code for that is as follows:
File Fileobject = new File(importFile);
Scanner fileReader = new Scanner(Fileobject);
while(fileReader.hasNext())
{
String line = fileReader.nextLine();
String[] splitText = line.split("-");
}
The text contained within the file we are scanning, is formatted as follows:
Title - Author - Price - Publisher - ISBN
Title, Author and Publisher are varying lengths - and ISBN is 11characters, Price is to two decimal places. I am able to easily print Valid data to the console, though it's the whole validating and printing errors (such as: "The book title may be missing.") to the console which has my head twisted.
Would IF statements be suited to checking each line of data? And if so, how would those be structured?

If you want to check the length/presence of each of the five columns, then consider the following:
while (fileReader.hasNext()) {
String line = fileReader.nextLine();
String[] splitText = line.split("-");
if (splitText.length < 5) {
System.out.println("One or more columns is entirely missing.");
continue; // skip this line
}
if (splitText[0].length == 0) {
System.out.println("Title is missing")
}
if (splitText[1].length == 0) {
System.out.println("Author is missing")
}
boolean isValidPrice = true;
try {
Double.parseDouble(splitText[2]);
}
catch (Exception e) {
isValidPrice = false;
}
if (!isValidPrice) {
System.out.println("Found an invalid price " + splitText[2] + " but expected a decimal.");
}
if (splitText[4].length != 11) {
System.out.println("Found an invalid ISBN.");
}
I do a two level validation above. If splitting the current line on dash does not yield 5 terms, then we have missing columns and we do not attempt to even guess what data might actually be there. If there are 5 expected columns, then we do a validation on each field by length and/or by expected value.

Yes, your best bet is to use if statements (I can't think of another way?). For cleanliness, I recommend you create a validateData(String data) method, or multiple validator functions.
For example, because you know each line is going to be in the Title - Author - Price - Publisher - ISBN format, you can write code like this:
public void validatePrice(String data) {
//Write your logic to validate.
}
public void validateAuthor(String data) {
//Write your logic to validate.
}
...
Then in your while loop you can call
validatePrice([splitText[0]);
validateAuthor([splitText[1]);
for each validator method.
Depending on your needs you can turn this more a bit more OOP style, but this is one cleanish way to do it.

The first thing you want to check for validation is that you have the proper number of entries (in this case check that the array is of size 5), and after that, you want to check each piece of data
If statements are a good way to go, and you can do something as simple as:
if(title.isBad()) print("error");
if(author.isBad()) print("error");
if(price.isBad()) print("error");
if(publisher.isBad()) print("error");
if(isbn.isBad()) print("error");
Replacing the .isBad with which ever clauses you are checking, such as string[i].isEmpty(), the length of the ISBN, etc.
For ones that take longer to check, such as the Price, you'll want to make some nested for loops, checking if it contains a period, contains only numbers, and on'y has 2 digits after the period.
Something helpful to know is the Wrapper classes for the primitive data types, if allows you to do
Character.isLetter(strings[i].charAt[j])
in the place of
(strings[i].charAt[j] >= 'A' && strings[i].charAt[j] <= 'Z') &&
(strings[i].charAt[j] >= 'a' && strings[i].charAt[j] <= 'z')
and
try{
Double.parseDouble(strings[i]);
}
instead of manually checking the price.
Hope this helps!

How to inflate a git tree object?

I'm doing some Java classes to read informations from Git object. Every class works in the same way: the file is retrieved using the repo path and the hash, then it is opened, inflated and read a line at time. This works very well for blobs and commits, but somehow the inflating doesn't work for tree objects.
The code I use to read the files is the same everywhere:
FileInputStream fis = new FileInputStream(path);
InflaterInputStream inStream = new InflaterInputStream(fis);
BufferedReader bf = new BufferedReader(new InputStreamReader(inStream));
and it works without issues for every object beside trees. When I try to read a tree this way I get this:
tree 167100644 README.mdDRwJiU��#�%?^>n��40000 dir1*�j4ކ��K-�������100644 file1�⛲��CK�)�wZ���S�100644 file2�⛲��CK�)�wZ���S�100644 file4�⛲��CK�)�wZ���S�
It seems that the file names and the octal mode are decoded the right way, while the hashes aren't (and I didn't have any problem decoding the other hashes with the above code). Is there some difference between the encoding of the hashes in tree objects and in the other git objects?

The core of the problem is that there are two encoding inside a git tree file (and it isn't so clear from the documentation). Most of the file is encoded in ASCII, which means it can be read with whatever you like but the hashes are not encoded, they are simply raw bytes.
Since there are two differend encodings, the best solution is to read the file byte by byte, keeping in mind what's where.
My solution (I'm only interested in the name and hashes of the contents, so the rest is simply thrown away):
FileInputStream fis = new FileInputStream(this.filepath);
InflaterInputStream inStream = new InflaterInputStream(fis);
int i = -1;
while((i = inStream.read()) != 0){
//First line
}
//Content data
while((i = inStream.read()) != -1){
while((i = inStream.read()) != 0x20){ //0x20 is the space char
//Permission bytes
}
//Filename: 0-terminated
String filename = "";
while((i = inStream.read()) != 0){
filename += (char) i;
}
//Hash: 20 byte long, can contain any value, the only way
// to be sure is to count the bytes
String hash = "";
for(int count = 0; count < 20 ; count++){
i = inStream.read();
hash += Integer.toHexString(i);
}
}

OID's are stored raw in trees, not as text, so the answer to your question as asked in the title is "you're already doing it", and the answer to your question in the text is "yes."
To answer a why do it that way? follow-up, it's got its upsides and downsides, you hit a downside. Not much point talking about it, the pain/gain ratio on any change to that decision would be horrendous.
and read a line at time.
Don't Do That. One upside of the store-as-binary call is it breaks code that relies on never encountering an embedded newline much, much faster than would otherwise be the case. I recommend "if you misuse it or misunderstand it, it should break as fast as possible" as an excellent design rule to follow, right along with "be conservative in what you send, and liberal in what you accept".

Java iteration reading & parsing

I have a log file that I am reading to a string
public static String read (String path) throws IOException {
StringBuilder sb = new StringBuilder();
FileInputStream fs = new FileInputStream(path);
InputStream in = new BufferedInputStream(fs);
int r;
while ((r = in.read()) != -1) {
sb.append((char)r);
}
fs.close();
in.close();
return sb.toString();
}
Then I have a parser that iterates over the entire string once
void parse () {
String con = read("log.txt");
for (int i = 0; i < con.length; i++) {
/* parsing action */
}
}
This is hugely a waste of cpu cycles. I loop over all the content in Read. Then I loop over all the content in Parse. I could just place the /* parsing action */ under the while loop in the Read method, which would be find but I don't want to copy the same code all over the place.
How can I parse the file in one iteration over the contents and still have separate methods for parsing and reading?
In C# I understand there is some sort of yield return thing, but I'm locked with Java.
What are my options in Java?

This is hugely a waste of cpu cycles. I loop over all the content in Read. Then I loop over all the content in Parse. I could just place the /* parsing action */ under the while loop in the Read method, which would be find but I don't want to copy the same code all over the place.
It's worse than just a huge waste of cpu cycles. It's a huge waste of memory to read the entire file into a string, if you're only going to use it once and the use is looking at one character at a time moving forward, as your code indicates. And if your file is large, you'll exhaust memory.
You should parse as you read, and never have the entire file loaded into memory at once.
If the parsing action needs to be called from more than one place, make it a function and call it rather than copying the same code all over the place. Copying a single-line function call is fine.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.