Using StAX to create index for XML for quick access

Using StAX to create index for XML for quick access - java

Is there a way to use StAX and JAX-B to create an index and then get quick access to an XML file?
I have a large XML file and I need to find information in it. This is used in a desktop application and so it should work on systems with few RAM.
So my idea is this: Create an index and then quickly access data from the large file.
I can't just split the file because it's an official federal database that I want to use unaltered.
Using a XMLStreamReader I can quickly find some element and then use JAXB for unmarshalling the element.
final XMLStreamReader r = xf.createXMLStreamReader(filename, new FileInputStream(filename));
final JAXBContext ucontext = JAXBContext.newInstance(Foo.class);
final Unmarshaller unmarshaller = ucontext.createUnmarshaller();
r.nextTag();
while (r.hasNext()) {
final int eventType = r.next();
if (eventType == XMLStreamConstants.START_ELEMENT && r.getLocalName().equals("foo")
&& Long.parseLong(r.getAttributeValue(null, "bla")) == bla
) {
// JAX-B works just fine:
final JAXBElement<Foo> foo = unmarshaller.unmarshal(r,Foo.class);
System.out.println(foo.getValue().getName());
// But how do I get the offset?
// cache.put(r.getAttributeValue(null, "id"), r.getCursor()); // ???
break;
}
}
But I can't get the offset. I'd like to use this to prepare an index:
(id of element) -> (offset in file)
Then I should be able use the offset to just unmarshall from there: Open file stream, skip that many bytes, unmarshall.
I can't find a library that does this. And I can't do it on my own without knowing the position of the file cursor. The javadoc clearly states that there is a cursor, but I can't find a way of accessing it.
Edit:
I'm just trying to offer a solution that will work on old hardware so people can actually use it. Not everyone can afford a new and powerful computer. Using StAX I can get the data in about 2 seconds, which is a bit long. But it doesn't require RAM. It requires 300 MB of RAM to just use JAX-B. Using some embedded db system would just be a lot of overhead for such a simple task. I'll use JAX-B anyway. Anything else would be useless for me since the wsimport-generated classes are already perfect. I just don't want to load 300 MB of objects when I only need a few.
I can't find a DB that just needs an XSD to create an in-memory DB, which doesn't use that much RAM. It's all made for servers or it's required to define a schema and map the XML. So I assume it just doesn't exist.

You could work with a generated XML parser using ANTLR4.
The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB.
1. Get XML Grammar
cd /tmp
git clone https://github.com/antlr/grammars-v4
2. Generate Parser
cd /tmp/grammars-v4/xml/
mvn clean install
3. Copy Generated Java files to your Project
cp -r target/generated-sources/antlr4 /path/to/your/project/gen
4. Hook in with a Listener to collect character offsets
package stack43366566;
import java.util.ArrayList;
import java.util.List;
import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTreeWalker;
import stack43366566.gen.XMLLexer;
import stack43366566.gen.XMLParser;
import stack43366566.gen.XMLParser.DocumentContext;
import stack43366566.gen.XMLParserBaseListener;
public class FindXmlOffset {
List<Integer> offsets = null;
String searchForElement = null;
public class MyXMLListener extends XMLParserBaseListener {
public void enterElement(XMLParser.ElementContext ctx) {
String name = ctx.Name().get(0).getText();
if (searchForElement.equals(name)) {
offsets.add(ctx.start.getStartIndex());
}
}
}
public List<Integer> createOffsets(String file, String elementName) {
searchForElement = elementName;
offsets = new ArrayList<>();
try {
XMLLexer lexer = new XMLLexer(new ANTLRFileStream(file));
CommonTokenStream tokens = new CommonTokenStream(lexer);
XMLParser parser = new XMLParser(tokens);
DocumentContext ctx = parser.document();
ParseTreeWalker walker = new ParseTreeWalker();
MyXMLListener listener = new MyXMLListener();
walker.walk(listener, ctx);
return offsets;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
public static void main(String[] arg) {
System.out.println("Search for offsets.");
List<Integer> offsets = new FindXmlOffset().createOffsets("/tmp/dewiki-20170501-pages-articles-multistream.xml",
"page");
System.out.println("Offsets: " + offsets);
}
}
5. Result
Prints:
Offsets: [2441, 10854, 30257, 51419 ....
6. Read from Offset Position
To test the code I've written class that reads in each wikipedia page to a java object
#JacksonXmlRootElement
class Page {
public Page(){};
public String title;
}
using basically this code
private Page readPage(Integer offset, String filename) {
try (Reader in = new FileReader(filename)) {
in.skip(offset);
ObjectMapper mapper = new XmlMapper();
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
Page object = mapper.readValue(in, Page.class);
return object;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
Find complete example on github.

I just had to solve this problem, and spent way too much time figuring it out. Hopefully the next poor soul who comes looking for ideas can benefit from my suffering.
The first problem to contend with is that most XMLStreamReader implementations provide inaccurate results when you ask them for their current offsets. Woodstox however seems to be rock-solid in this regard.
The second problem is the actual type of offset you use. You have to use char offsets if you need to work with a multi-byte charset, which means the random-access retrieval from the file using the provided offsets is not going to be very efficient - you can't just set a pointer into the file at your offset and start reading, you have to read through until you get to the offset (that's what skip does under the covers in a Reader), then start extracting. If you're dealing with very large files, that means retrieval of content near the end of the file is too slow.
I ended up writing a FilterReader that keeps a buffer of byte offset to char offset mappings as the file is read. When we need to get the byte offset, we first ask Woodstox for the char offset, then get the custom reader to tell us the actual byte offset for the char offset. We can get the byte offset from the beginning and end of the element, giving us what we need to go in and surgically extract the element from the file by opening it as a RandomAccessFile, which means it's super fast at any point in the file.
I created a library for this, it's on GitHub and Maven Central. If you just want to get the important bits, the party trick is in the ByteTrackingReader.
Some people have commented about how this whole thing is a bad idea and why would you want to do it? XML is a transport mechanism, you should just import it to a DB and work with the data with more appropriate tools. For most cases this is true, but if you're building applications or integrations that communicate via XML, you need tooling to analyze and operate on the files that are exchanged. I get daily requests to verify feed contents, having the ability to quickly extract a specific set of items from a massive file and verify not only the contents, but the format itself is essential.
Anyhow, hopefully this can save someone a few hours, or at least get them closer to a solution.

Related

Parsing a text file using java with multiple values per line to be extracted

I'm not going to lie I'm really bad at making regular expressions. I'm currently trying to parse a text file that is giving me a lot of issues. The goal is to extract the data between their respective "tags/titles". The file in question is a .qbo file laid out as follows personal information replaced with "DATA": The parts that I care about retrieving are between the "STMTTRM" and "/STMTTRM" tags as the rest I don't plan on putting in my database, but I figured it would help others see the file content I'm working with. I apologize for any confusion prior to this update.
FXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
<OFX>
<SIGNONMSGSRSV1><SONRS>
<STATUS><CODE>0</CODE><SEVERITY>INFO</SEVERITY></STATUS>
<DTSERVER>20190917133617.000[-4:EDT]</DTSERVER>
<LANGUAGE>ENG</LANGUAGE>
<FI>
<ORG>DATA</ORG>
<FID>DATA</FID>
</FI>
<INTU.BID>DATA</INTU.BID>
<INTU.USERID>DATA</INTU.USERID>
</SONRS></SIGNONMSGSRSV1>
<BANKMSGSRSV1>
<STMTTRNRS>
<TRNUID>0</TRNUID>
<STATUS><CODE>0</CODE><SEVERITY>INFO</SEVERITY></STATUS>
<STMTRS>
<CURDEF>USD</CURDEF>
<BANKACCTFROM>
<BANKID>DATA</BANKID>
<ACCTID>DATA</ACCTID>
<ACCTTYPE>CHECKING</ACCTTYPE>
<NICKNAME>FREEDOM CHECKING</NICKNAME>
</BANKACCTFROM>
<BANKTRANLIST>
<DTSTART>20190717</DTSTART><DTEND>20190917</DTEND>
<STMTTRN><TRNTYPE>POS</TRNTYPE><DTPOSTED>20190717071500</DTPOSTED><TRNAMT>-5.81</TRNAMT><FITID>3893120190717WO</FITID><NAME>DATA</NAME><MEMO>POS Withdrawal</MEMO></STMTTRN>
<STMTTRN><TRNTYPE>DIRECTDEBIT</TRNTYPE><DTPOSTED>20190717085000</DTPOSTED><TRNAMT>-728.11</TRNAMT><FITID>4649920190717WE</FITID><NAME>CHASE CREDIT CRD</NAME><MEMO>DATA</MEMO></STMTTRN>
<STMTTRN><TRNTYPE>ATM</TRNTYPE><DTPOSTED>20190717160900</DTPOSTED><TRNAMT>-201.99</TRNAMT><FITID>6674020190717WA</FITID><NAME>DATA</NAME><MEMO>ATM Withdrawal</MEMO></STMTTRN>
</BANKTRANLIST>
<LEDGERBAL><BALAMT>2024.16</BALAMT><DTASOF>20190917133617.000[-4:EDT]</DTASOF></LEDGERBAL>
<AVAILBAL><BALAMT>2020.66</BALAMT><DTASOF>20190917133617.000[-4:EDT]</DTASOF></AVAILBAL>
</STMTRS>
</STMTTRNRS>
</BANKMSGSRSV1>
</OFX>
I want to be able to end with data that looks or acts like the following so that each row of data can easily be added to a database:
Example Parse

As David has already answered, It is good to parse the POS output XML using Java. If you are more interested about about regex to get all the information, you can use this regular expression.
<[^>]+>|\\n+
You can test in the following sites.
https://rubular.com/
https://www.regextester.com/

Given this is XML, I would do one of two things:
either use the Java DOM objects to marshall/unmarshall to/from Java objects (nodes and elements), or
use JAXB to achieve something similar but with better POJO representation.
Mkyong has tutorials for both. Try the dom parsing or jaxb. His tutorials are simple and easy to follow.
JAXB requires more work and dependencies. So try DOM first.

I would propose the following approach.
Read file line by line with Files:
final List<String> lines = Files.readAllLines(Paths.get("/path/to/file"));
At this point you would have all file line separated and ready to convert the string lines into something more useful. But you should create class beforehand.
Create a class for your data in line, something like:
public class STMTTRN {
private String TRNTYPE;
private String DTPOSTED;
...
...
//constructors
//getters and setters
}
Now when you have a data in each separate string and a class to hold the data, you can convert lines to objects with Jackson:
final XmlMapper xmlMapper = new XmlMapper();
final STMTTRN stmttrn = xmlMapper.readValue(lines[0], STMTTRN.class);
You may want to create a loop or make use of stream with a mapper and a collector to get the list of STMTTRN objects:
final List<STMTTRN> stmttrnData = lines.stream().map(this::mapLine).collect(Collectors.toList());
Where the mapper might be:
private STMTTRN mapLine(final String line) {
final XmlMapper xmlMapper = new XmlMapper();
try {
return xmlMapper.readValue(line, STMTTRN.class);
} catch (IOException e) {
throw new RuntimeException(e);
}
}

Attempting to parse a file (mmp format) from legacy software using Python

I have a piece of Legacy software called Mixmeister that saved off playlist files in an MMP format.
This format appears to contain binary as well as file paths.
I am looking to extract the file paths along with any additional information I can from these files.
I see this has been done using JAVA (I do not know JAVA) here (see aorund ln 56):
https://github.com/liesen/CueMeister/blob/master/src/mixmeister/mmp/MixmeisterPlaylist.java
and Haskell here:
https://github.com/larjo/MixView/blob/master/ListFiles.hs
So far, I have tried reading the file as binary (got stuck); using Regex expressions (messy output with moderate success) and attempting to try some code to read chunks (beyond my skill level).
The code I am using with moderate success for Regex is:
file='C:\\Users\\xxx\\Desktop\\mixmeisterfile.mmp'
with open(file, 'r', encoding="Latin-1") as filehandle:
#with open(file, 'rb') as filehandle:
for text in filehandle:
b = re.search('TRKF(.*)TKLYTRKM', text)
if b:
print(b.group())
Again, this gets me close but is messy (the data is not all intact and surrounded by ascii and binary characters). Basically, my logic is just searching between two strings to attempt to extract the filenames. What I am really trying to do is get closer to something like what the JAVA has in GIT, which is (the code below is sampled from the GIT link):
List<Track> tracks = new ArrayList<Track>();
Marker trks = null;
for (Chunk chunk : trkl.getChunks()) {
TrackHeader header = new TrackHeader();
String file = "";
List<Marker> meta = new LinkedList<Marker>();
if (chunk.canContainSubchunks()) {
for (Chunk chunk2 : ((ChunkContainer) chunk).getChunks()) {
if ("TRKH".equals(chunk2.getIdentifier())) {
header = readTrackHeader(chunk2);
} else if ("TRKF".equals(chunk2.getIdentifier())) {
file = readTrackFile(chunk2);
} else {
if (chunk2.canContainSubchunks()) {
for (Chunk chunk3 : ((ChunkContainer) chunk2).getChunks()) {
if ("TRKM".equals(chunk3.getIdentifier())) {
meta.add(readTrackMarker(chunk3));
} else if ("TRKS".equals(chunk3.getIdentifier())) {
trks = readTrackMarker(chunk3);
}
}
}
}
}
}
Track tr = new Track(header, file, meta);
I am guessing this would either use RIFF or the chunk library in Python if not done using a Regex? Although I read the documentation at https://docs.python.org/2/library/chunk.html, I am not sure that I understand how to go about something like this - mainly I do not understand how to properly read the binary file which has the visible mixed in file paths.

I don't really know what's going on here but I'll try my best and if it doesn't work out then please excuse my stupidity. When I had a project for parsing weather data for a Metar, I realized that my main issue was that I was trying to turn everything into a String type, which wasn't suitable for all the data and so it would just come out as nothing. Your for loop should work just fine. However, when you traverse, have you tried making everything the same type, such as a Character/String type? Perhaps there are certain elements messed up simply because they don't match the type you are going for.

Using I/O stream to parse CSV file

I have a CSV file of US population data for every county in the US. I need to get each population from the 8th column of the file. I'm using a fileReader() and bufferedStream() and not sure how to use the split method to accomplish this. I know this isn't much information but I know that I'll be using my args[0] as the destination in my class.
I'm at a loss to where to being to be honest.
import java.io.FileReader;
public class Main {
public static void main(String[] args) {
BufferedReader() buff = new BufferedReader(new FileReader(args[0]));
String
}
try {
}
}
The output should be an integer of the total US population. Any help with pointing me in the right direction would be great.

Don't reinvent the wheel, don't parse CSV yourself: use a library. Even such a simple format as CSV has nuances: fields can be escaped with quotes or unescaped, the file can have or have not a header and so on. Besides that you have to test and maintain the code you've wrote. So writing less code and reusing libraries is good.
There are a plenty of libraries for CSV in Java:
Apache Commons CSV
OpenCSV
Super CSV
Univocity
flatpack
IMHO, the first two are the most popular.
Here is an example for Apache Commons CSV:
final Reader in = new FileReader("counties.csv");
final Iterable<CSVRecord> records = CSVFormat.DEFAULT.parse(in);
for (final CSVRecord record : records) { // Simply iterate over the records via foreach loop. All the parsing is handler for you
String populationString = record.get(7); // Indexes are zero-based
String populationString = record.get("population"); // Or, if your file has headers, you can just use them
… // Do whatever you want with the population
}
Look how easy it is! And it will be similar with other parsers.

How do we deal with a large GATE Document

I'm getting Error java.lang.OutOfMemoryError: GC overhead limit exceeded when I try to execute Pipeline if the GATE Document I use is slightly large.
The code works fine if the GATE Document is small.
My JAVA code is something like this:
TestGate Class:
public void gateProcessor(Section section) throws Exception {
Gate.init();
Gate.getCreoleRegister().registerDirectories(....
SerialAnalyserController pipeline .......
pipeline.add(All the language analyzers)
pipeline.add(My Jape File)
Corpus corpus = Factory.newCorpus("Gate Corpus");
Document doc = Factory.newDocument(section.getContent());
corpus.add(doc);
pipeline.setCorpus(corpus);
pipeline.execute();
}
The Main Class Contains:
StringBuilder body = new StringBuilder();
int character;
FileInputStream file = new FileInputStream(
new File(
"filepath\\out.rtf")); //The Document in question
while (true)
{
character = file.read();
if (character == -1) break;
body.append((char) character);
}
Section section = new Section(body.toString()); //Creating object of Type Section with content field = body.toString()
TestGate testgate = new TestGate();
testgate.gateProcessor(section);
Interestingly this thing fails in GATE Developer tool as well the tools basically gets stuck if the document is more than a sepcific limit, say more than 1 page.
This proves that my code is logically correct but my approach is wrong. How do we deal with large chunks data in GATE Document.

You need to call
corpus.clear();
Factory.deleteResource(doc);
after each document, otherwise you'll eventually get OutOfMemory on any size of docs if you run it enough times (Although by the way you initialize gate in the method it seems like you really need to process a single document only once).
Besides that, annotations and features usually take lots of memory. If you have an annotation-intensive pipeline, i.e. you generate lots of annotations with lots of features and values you may run out of memory. Make sure you don't have a processing resource that generates annotations exponentially - for instance a jape or groovy generates n to the power of W annotations, where W is number of words in your doc. Or if you have a feature for each possible word combination in your doc, that would generate factorial of W strings.

every time its create pipeline object that's why it takes huge memory. That's why every time you use 'Annie' cleanup.
pipeline.cleanup();
pipeline=null;

Best API model design

I have offline JSON definitions (in assets folder) - and with them I create my data model. It has like 8 classes which all inherit (extend) one abstract Model class.
Would it be better solution if I parse the JSON and keep the model in memory (more or less everything is Integer or String) through the whole life cycle of the App or would it be smarter if I parse the JSON files as they are needed?
thanks

Parsing the files and storing all the data in the memory will definitely give you a speed advantage. The problem with this solution is that if your application will go to back-ground (the user receives a phone call or just leaves the app by his will), no one can guarantee that the data will stay intact in memory.
This data can be clear by the GC if the system decided that it needs more memory.
This means that when the user comes back to the application and if you relay on the fact that the data is in the memory you might face an exception. So you need to consider this situation.
And from that point of you it is good to store you data on a file that can be parsed at a desired time, even thought this might be a slower solution.
Another solution you may look at is to parse this data at first application start-up to an SQLite DB and use it from there, or even store it in the DB in the first place. This will give you the advantages of both worlds, you would not have to parse the data before using it, and you will have a quick access to it using a Cursor and you are not facing the problem of data deletion in case of insufficient memory in the system.

I'd read all the file content at once and keep it as a static String somewhere in my application that is available to all application components (SingleTone Pattern) since usually Maintaining a small string in the memory is much cheaper than opening and closing files frequently.
To solve the GC point #Emil pointed out you can write your code something like this:
public class DataManager {
private static String myData;
public static String getData(Context context){
if(myData == null){
loadData(context);
}
return myData;
}
private static void LoadData(Context context){
context.getAssets().
try {
BufferedReader reader = new BufferedReader(
new InputStreamReader(getAssets().open("data.txt"), "UTF-8"));
StringBuilder builder = new StringBuilder();
do {
String mLine = reader.readLine();
builder.append(mLine);
} while (mLine != null)
reader.close();
myData = builder.toString();
} catch (IOException e) {
}
}
}
And from any class in your application that has a valid Context reference:
String data = DataManager.getData(context);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using StAX to create index for XML for quick access - java

Related

Parsing a text file using java with multiple values per line to be extracted

Attempting to parse a file (mmp format) from legacy software using Python

Using I/O stream to parse CSV file

How do we deal with a large GATE Document

Best API model design

Categories

Resources