Java split xml file - java

I'm working on a piece of code to split files.
I want to split flat file (that's ok, it is working fine) and xml file.
The idea is to split based of a number of files to split:
I have a file, and I want to split it in x files (x is a parameters).
I'm doing the split by taking the size of the file and spliting the size by the number of files to split.
Then, mysolution was to use a BufferedReader and to use it like
while ((n = reader.read(buffer, 0, buffer.length)) != -1) {
{
The main problem is that for the xml file I cannot just split it, but I have to split it based on a block delimited by a start xml tag and end xml tag:
<start tag>
bla bla xml stuff
</end tag>
So I cannot cut a block at the middle. So if when I'm at the half of a block, is the size of my new file is greater than my max, I will have to read until the end of the tag, and then, to start a next file.
The problem is that I have all sort of cases, and is a bit difficult to search the end tag.
- the block reads a text until the middle of the end tag
- the block reads a text until the end of the end tag, and no more other caracter after
- etc
and in the same time to have a loop and read the next block.
Some times the end of a block concatenated with the start of the next one, I have the end xml tag.
I hope you get the idea.
My question is, does anyone have some algorithm that does that more accurate and who i treating all special cases ?
The idea is to split the file as quickly as possible.
I did not want to use a lib to treat the file as a xml file because the size of a block cand be smaller or very large, and I don't know if the memory will be enough. Or there is some lib that does not load all in memory?
Thanks alot.
Here below an example of my xml file;
<?xml version="1.0" encoding="UTF-8" ?>
<myTag service="toto" version="1.5.18" >
<endOfPeriodTradeNotification version="1.5.18">
.............
</endOfPeriodTradeNotification>
<endOfPeriodTradeNotification version="1.5.18">
.............
</endOfPeriodTradeNotification>
<endOfPeriodTradeNotification version="1.5.18">
.............
</endOfPeriodTradeNotification>
<inventoryDate>2009-12-31</inventoryDate>
<!-- reporting date -->
<processingDate>2010-01-29T00:00:00</processingDate>
</myTag>
I forgot one thing: my xml file could be all written on the first line,
so I cannot gues that one line has one tag.

Although you have stated that you don't want to use a lib that treats it as an XML file. You might want to consider using SAX.
Using SAX, rather than DOM, your fears about memory are allayed, as the whole file is not loaded into memory, but events occur as your application reads the file and encounters XML landmarks such as start and end tags.
SAX is also pretty fast.
This quickstart guide should help: http://www.saxproject.org/quickstart.html

Provided the end-tags that you're after are on lines by them selves, you could simply do
String line;
while ((line = reader.readLine()) != null)
instead of:
while ((n = reader.read(buffer, 0, buffer.length)) != -1)
and then split into a new file whenever line matches an end-tag and the current file is large enough.
If they are not lines by them selves, you could line.find(...) the tag instead, split the line, put the first part in the current file, and save the second part for the next file.
However, as pointed out in the comments, the splitted xml-files will be far from valid xml, unless you take care of a few things. For instance, the first part may look like:
<?xml version="1.0" encoding="UTF-8" ?>
<myTag service="toto" version="1.5.18" >
<endOfPeriodTradeNotification version="1.5.18">
.............
</endOfPeriodTradeNotification>
<endOfPeriodTradeNotification version="1.5.18">
and that's not valid xml. neither is
<inventoryDate>2009-12-31</inventoryDate>
<!-- reporting date -->
<processingDate>2010-01-29T00:00:00</processingDate>
</myTag>

The best tool to split xml files is, hands down, vtd-xml. Not only is it super fast, it is also super easy to code your app, eg using xpath.

Related

Windows Bat Find and Replace Line Breaks in Specific Lines

I am not a pro developer and need a simple solution. I have tried using fart.exe within a Windows Bat file to accomplish this, but having trouble finding the exact lines I need to replace line breaks. In an XML file, here is what I am trying do.
I need to go from this (a few lines in the middle of a larger file):
<meta name="xyz:moreinfohere" content="some content"/>
<meta name="abc:evenmoreinfo" content="more content
and here is where
the problem lies"/>
<meta name="abc:infoagain" content="this is confusing"/>
<meta name="xyz:blahblah" content="please help"/>
to this:
<meta name="xyz:moreinfohere" content="some content"/>
<meta name="abc:evenmoreinfo" content="more content
and here is where
the problem lies"/>
<meta name="abc:infoagain" content="this is confusing"/>
<meta name="xyz:blahblah" content="please help"/>
The data filled in these fields will be variable, and this is a fictitious example. Basically, i am trying to replace the line breaks with the XA code, but only certain lines as you can see. I have managed to use fart.exe to replace all instances of \n\r but i can't figure out how to only do the needed ones. Not every line starts with "meta...". However every line in the files is supposed to end with ">" ...its the only constant/fixed character on every line in the files. Please help! I open to anything that works in a standard Windows Bat file (fart, java, etc.)
As you found out, a standard-compliant XML parser will replace a line feed in an attribute's value with a space unless the line feed is encoded using a character reference (e.g.
). (Reference)
So while I would normally recommend using a proper XML parser, that won't work here because we're trying to fix broken XML (i.e. XML that means something different than what we want it to mean).
We could write a proper XML parser that simply doesn't perform the line feed to space substitution and use that to fix the file, but that's a lot of work. The following is probably sufficient.
Assumptions:
All attributes values that need fixing use double-quotes (not single-quotes).
Double-quotes are always found in pairs in the documents to be fixed.
fix.pl:
use strict;
use warnings;
local $/;
while (<>) {
while (1) {
/\G ( [^"]+ ) /xgc
and print $1;
/\G \z /xgc
and last;
/\G ( " [^"]* " ) /xgc
and do {
print $1 =~ s/\n/
/rg;
next;
};
die("Unbalanced quotes");
}
}
Usage:
perl fix.pl file_to_fix.xml >fixed_file.xml
or
perl -i.bak fix.pl file_to_fix.xml
The latter modifies the file in-place after making a backup.
After you use this tool, use a file comparison tool (e.g. Beyond Compare) to make sure the fix was properly applied.

Write stylesheet tag with XML API (STaX/DOM/..)

i'm having some trouble to write a particular xml tag (using an XmlStreamWriter).
Basically, we have an XMLWriter that is based on "javax.xml.stream.XMLStreamWriter" (STaX) which is working fine.
All the xml files that are written begin automatically with the tag :
< ?xml version="1.0" encoding="ISO-8859-1"?> (first space is added to display the xml line)
What we need now is to add a new line (stylesheet) to write every single xml file with the beginning lines :
< ?xml version="1.0" encoding="ISO-8859-1"?> (same as above)
< ?xml-stylesheet type="text/xsl" href="myXsl.xsl"?> (same as above)
I tried to do it the hard-coded way, using the XmlStreamWriter.writeCharacters(String) but the problem is that "<" and ">" are special characters so the output in the xml file is "<"/">".
Also, this is not very clean coding..
In the same way that STaX writes the first line using "XMLStreamWriter.writeStartDocument(String encoding, String version)", does anyone know an XML (XSL/XSLT?) API which WRITER does write the tag :
< ?xml-stylesheet type="text/xsl" href="myXsl.xsl"?> (same as above)
Any help would be much appreciated :)
It is called a processing instruction.
See XMLStreamWriter.writeProcessingInstruction, for instance.
In your case:
writer.writeProcessingInstruction("xml-stylesheet",
"type=\"text/xsl\" href=\"myXsl.xsl\"");
(Not tested.)

java: Adding recursively to an xml

I need a help. how can i add content recursively to an xml file. i have a program which processes a file and send 'line information's. these line informations needs to be written to an XML file, like shown below. what i do now is I read each line info and then send it a fn which writes XML. I want to know if there is any way to Buffer the Document object and then keep keeping appending to that Document object when each new line comes.
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
- <Rev_1.28>
- <OP type="SAM">
<SRC_LN_FROM>612612</SRC_LN_FROM>
<SRC_LN_TO>703703</SRC_LN_TO>
<NO_LINES>92</NO_LINES>
</OP>
- <OP type="MOV">
<SRC_LN_FROM>6122</SRC_LN_FROM>
<SRC_LN_TO>7033</SRC_LN_TO>
<NO_LINES>9</NO_LINES>
</OP>
</Rev_1.28>
You can use DOM parser to create use org.w3c.dom.Document object as shown here.
The data is stored in primary memory, so this approach is acceptable if the data to be written is relativley small.

java reads a weird character at the beginning of the file which doesn't exist

I have a simple xml file on my hard drive.
When I open it with notepad++ this is what I see:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<content>
... more stuff here ...
</content>
But when I read it using a FileInputStream I get:
?<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<content>...
I'm using JAXB to parse xml's and it throws an exception of "content not allowed in prolog" because of that "?" sign.
What is this extra "?" sign? why is it there and how do I get rid of it?
That extra character is a byte order mark, a special Unicode character code which lets the XML parser know what the byte order (little endian or big endian) of the bytes in the file is.
Normally, your XML parser should be able to understand this. (If it doesn't, I would regard that a bug in the XML parser).
As a workaround, make sure that the program that produces this XML leaves off the BOM.
Check the encoding of the file, I've seen a similar thing, openeing the file in most editors and it looked fine, turned out it was encoded with UTF-8 without BOM (or with, I can't recall off the top of my head). Notepad++ should be ok to switch between the two.
You can use Notepad++ to see show all symbols from the View > Show Symbols > Show All Characters menu. It would show you the extra bytes present in the beginning. There is a possibility that it is the byte order mark. If the extra bytes are indeed byte order mark, this approach would not help. In that case, you will need to download a hex editor or if you have Cygwin installed, follow the steps in the last paragraph of this response. Once you can see the file in terms of hex codes, look for the first two characters. Do they have one of the codes mentioned at http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
If they indeed are byte order mark or if you are unable to determine the cause of the error, just try this:
From the menu select, Encoding > Encoding in UTF-8 without BOM, and then save the file.
(On Linux, one can use command line tools to check what's the in the beginning. e.g. xxd -g1 filename | head or od -t cx1 filename | head.)
You might be having a newline. Delete that.
Select View > Show Symbol > Show All Characters in Notepad++ to see what's happening.
this is not a jaxb problem, the problem resides in the way you use to read the xml ... try using an inputstream
...
Unmarshaller u = jaxbContext.createUnmarshaller();
XmlDataObject xmlDataObject = (XmlDataObject) u.unmarshal(new FileInputStream("foo.xml"));
...
Next to the FileInputStream a ByteArrayInputStream worked also with me:
JAXB.unmarshal(new ByteArrayInputStream(string.getBytes("UTF-8")), Delivery.class);
=> No unmarshaling error anymore.

Load txt's file into Java application and save it to XML's file

I read the next answer about load file into java application.
I need to write a program that load .txt, which contains a list of records. After I parse it, I need to match the records (with conditions that I will check), and save the result to XML's file.
I am stuck on this issue, and I will happy for answer to next questions:
How I load the .txt file into Java?
After I load the file, how I can acsses to the information into it? for example, How I can asked if the first line of one of the records is equal to "1";
How I export the result to XML's file.
one: you need a sample-code for reading a file line by line
two: the split-method of a string might be helpful. For instance getting the number of the first element if information is seperated by a space
String myLine;
String[] components = myLine.split(" ");
if(components != null && components.length >= 1) {
int num = Integer.parseInt(components[0]);
....
}
three: you can just write it like any text-file, or use any XML-Writer you want
Basic I/O
Integer.parseInt(1stLine)
There are a plethora of choices.
Create POJO's to represent the records and write them using XMLEncoder
SAX
DOM..

Categories