Iterating massive CSVs for comparisons - java

I have two very large CSV files that will only continue to get larger with time. The documents I'm using to test are 170 columns wide and roughly 57,000 rows. This is using data from 2018 to now, ideally the end result will be sufficient to run on CSVs with data going as far back as 2008 which will result in the CSVs being massive.
Currently I'm using Univocity, but the creator has been inactive on answering questions for quite some time and their website has been down for weeks, so I'm open to changing parsers if need be.
Right now I have the following code:
public void test() {
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setHeaderExtractionEnabled(false);
CsvParser sourceParser = new CsvParser(parserSettings);
sourceParser.beginParsing(sourceFile));
Writer writer = new OutputStreamWriter(new FileOutputStream(outputPath), StandardCharsets.UTF_8);
CsvWriterSettings writerSettings = new CsvWriterSettings();
CsvWriter csvWriter = new CsvWriter(writer, writerSettings);
csvWriter.writeRow(headers);
String[] sourceRow;
String[] compareRow;
while ((sourceRow = sourceParser.parseNext()) != null) {
CsvParser compareParser = new CsvParser(parserSettings);
compareParser.beginParsing(Path.of("src/test/resources/" + compareCsv + ".csv").toFile());
while ((compareRow = compareParser.parseNext()) != null) {
if (Arrays.equals(sourceRow, compareRow)) {
break;
} else {
if (compareRow[KEY_A].trim().equals(sourceRow[KEY_A].trim()) &&
compareRow[KEY_B].trim().equals(sourceRow[KEY_B].trim()) &&
compareRow[KEY_C].trim().equals(sourceRow[KEY_C].trim())) {
for (String[] result : getOnlyDifferentValues(sourceRow, compareRow)) {
csvWriter.writeRow(result);
}
break;
}
}
}
compareParser.stopParsing();
}
}
This all works exactly as I need it to, but of course as you can obviously tell it takes forever. I'm stopping and restarting the parsing of the compare file because order is not guaranteed in these files, so what is in row 1 in the source CSV could be in row 52,000 in the compare CSV.
The Question:
How do I get this faster? Here are my requirements:
Print row under following conditions:
KEY_A, KEY_B, KEY_C are equal but any other column is not equal
Source row is not found in compare CSV
Compare row is not found in source CSV
Presently I only have the first requirement working, but I need to tackle the speed issue first and foremost. Also, if I try to parse the file into memory I immediately run out of heap space and the application laughs at me.
Thanks in advance.

Also, if I try to parse the file into memory I immediately run out of heap space
Have you tried increasing the heap size? You don't say how large your data file is, but 57000 rows * 170 columns * 100 bytes per cell = 1 GB, which should pose no difficulty on a modern hardware. Then, you can keep the comparison file in a HashMap for efficient lookup by key.
Alternatively, you could import the CSVs into a database and make use of its join algorithms.
Or if you'd rather reinvent the wheel while scrupolously avoiding memory use, you could first sort the CSVs (by partitioning them into sets small enough to sort in memory, and then doing a k-way merge to merge the sublists), and then to a merge join. But the other solutions are likely to be a lot easier to implement :-)

Related

Parsing a massive CSV into JSON using Java

I'm trying to parse a huge CSV (56595 lines) into a JSONArray but it's taking a considerable amount of time. This is what my code looks like and it takes ~17 seconds to complete. I'm limiting my results based on one of the columns but the code still has to go through the entire CSV file.
Is there a more efficient way to do this? I've excluded the catch's, finally's and throws to save space.
File
Code
...
BufferedReader reader = null;
String line = "";
//jArray is retrieved by an ajax call and used in a graph
JSONArray jArray = new JSONArray();
HttpClient httpClient = new DefaultHttpClient();
try {
//url = CSV file
HttpGet httpGet = new HttpGet(url);
HttpResponse response = httpClient.execute(httpGet);
int responseCode = response.getStatusLine().getStatusCode();
if (responseCode == 200) {
try {
reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));
while (((line = reader.readLine()) != null)) {
JSONObject json = new JSONObject();
String[] row = line.split(",");
//skips first three rows
if(row.length > 2){
//map = 4011
if(row[1].equals(map)) {
json.put("col0", row[0]);
json.put("col1", row[1]);
json.put("col2", row[2]);
json.put("col3", row[3]);
json.put("col4", row[4]);
json.put("col5", row[5]);
json.put("col6", row[6]);
jArray.put(json);
}
}
return jArray;
}
...
Unfortunately, the main delay will predictably be at downloading the file from HTTP, so all your chances will rest upon optimizing your code. So, based upon the info you provided, I can suggest some enhancements to optimize your algorithm:
It was a good idea to process the input file in streaming mode, reading line by line with a with BufferedReader. Usually it is a good practice to set an explicit buffer size (BufferedReader's default size is 8Kb), but being the source a network connection, I doubt it will be any better in this case. Anyway, you should try 16Kb, for instance.
Since the number of output items is very low (49, you said), it doesn't matter to store it in an array (for a higher amount, I would have recommend you to chose another collection, like LinkedList), but it is always useful to pre-size it with an estimated size. In JSONArray, I suppose it would be enough to put a null item at position 100 (for example) at the beginning of your method.
The biggest deal I think of is the line line.split(","), because that makes the program go through the whole line, duplicate its contents character by character into an array, and the worst of it all, for eventually use it only in a 0.1% of cases.
And there might be even a worse drawback: Merely splitting by comma might be not a good way to properly parse a JSON line. I mean: Are you sure the json values cannot contain a comma as part of user data?
Well, to solve this problem I suggest you to code your own json custom parsing algorithm, which might be a little hard, but it will be worth the effort. You must code a state machine in which you detect the second value and, if the key coincides with the filtering value ("4011"), continue parsing the rest of the line. In this way, you will save a big amount of time and memory.

Using StAX to create index for XML for quick access

Is there a way to use StAX and JAX-B to create an index and then get quick access to an XML file?
I have a large XML file and I need to find information in it. This is used in a desktop application and so it should work on systems with few RAM.
So my idea is this: Create an index and then quickly access data from the large file.
I can't just split the file because it's an official federal database that I want to use unaltered.
Using a XMLStreamReader I can quickly find some element and then use JAXB for unmarshalling the element.
final XMLStreamReader r = xf.createXMLStreamReader(filename, new FileInputStream(filename));
final JAXBContext ucontext = JAXBContext.newInstance(Foo.class);
final Unmarshaller unmarshaller = ucontext.createUnmarshaller();
r.nextTag();
while (r.hasNext()) {
final int eventType = r.next();
if (eventType == XMLStreamConstants.START_ELEMENT && r.getLocalName().equals("foo")
&& Long.parseLong(r.getAttributeValue(null, "bla")) == bla
) {
// JAX-B works just fine:
final JAXBElement<Foo> foo = unmarshaller.unmarshal(r,Foo.class);
System.out.println(foo.getValue().getName());
// But how do I get the offset?
// cache.put(r.getAttributeValue(null, "id"), r.getCursor()); // ???
break;
}
}
But I can't get the offset. I'd like to use this to prepare an index:
(id of element) -> (offset in file)
Then I should be able use the offset to just unmarshall from there: Open file stream, skip that many bytes, unmarshall.
I can't find a library that does this. And I can't do it on my own without knowing the position of the file cursor. The javadoc clearly states that there is a cursor, but I can't find a way of accessing it.
Edit:
I'm just trying to offer a solution that will work on old hardware so people can actually use it. Not everyone can afford a new and powerful computer. Using StAX I can get the data in about 2 seconds, which is a bit long. But it doesn't require RAM. It requires 300 MB of RAM to just use JAX-B. Using some embedded db system would just be a lot of overhead for such a simple task. I'll use JAX-B anyway. Anything else would be useless for me since the wsimport-generated classes are already perfect. I just don't want to load 300 MB of objects when I only need a few.
I can't find a DB that just needs an XSD to create an in-memory DB, which doesn't use that much RAM. It's all made for servers or it's required to define a schema and map the XML. So I assume it just doesn't exist.
You could work with a generated XML parser using ANTLR4.
The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB.
1. Get XML Grammar
cd /tmp
git clone https://github.com/antlr/grammars-v4
2. Generate Parser
cd /tmp/grammars-v4/xml/
mvn clean install
3. Copy Generated Java files to your Project
cp -r target/generated-sources/antlr4 /path/to/your/project/gen
4. Hook in with a Listener to collect character offsets
package stack43366566;
import java.util.ArrayList;
import java.util.List;
import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTreeWalker;
import stack43366566.gen.XMLLexer;
import stack43366566.gen.XMLParser;
import stack43366566.gen.XMLParser.DocumentContext;
import stack43366566.gen.XMLParserBaseListener;
public class FindXmlOffset {
List<Integer> offsets = null;
String searchForElement = null;
public class MyXMLListener extends XMLParserBaseListener {
public void enterElement(XMLParser.ElementContext ctx) {
String name = ctx.Name().get(0).getText();
if (searchForElement.equals(name)) {
offsets.add(ctx.start.getStartIndex());
}
}
}
public List<Integer> createOffsets(String file, String elementName) {
searchForElement = elementName;
offsets = new ArrayList<>();
try {
XMLLexer lexer = new XMLLexer(new ANTLRFileStream(file));
CommonTokenStream tokens = new CommonTokenStream(lexer);
XMLParser parser = new XMLParser(tokens);
DocumentContext ctx = parser.document();
ParseTreeWalker walker = new ParseTreeWalker();
MyXMLListener listener = new MyXMLListener();
walker.walk(listener, ctx);
return offsets;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
public static void main(String[] arg) {
System.out.println("Search for offsets.");
List<Integer> offsets = new FindXmlOffset().createOffsets("/tmp/dewiki-20170501-pages-articles-multistream.xml",
"page");
System.out.println("Offsets: " + offsets);
}
}
5. Result
Prints:
Offsets: [2441, 10854, 30257, 51419 ....
6. Read from Offset Position
To test the code I've written class that reads in each wikipedia page to a java object
#JacksonXmlRootElement
class Page {
public Page(){};
public String title;
}
using basically this code
private Page readPage(Integer offset, String filename) {
try (Reader in = new FileReader(filename)) {
in.skip(offset);
ObjectMapper mapper = new XmlMapper();
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
Page object = mapper.readValue(in, Page.class);
return object;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
Find complete example on github.
I just had to solve this problem, and spent way too much time figuring it out. Hopefully the next poor soul who comes looking for ideas can benefit from my suffering.
The first problem to contend with is that most XMLStreamReader implementations provide inaccurate results when you ask them for their current offsets. Woodstox however seems to be rock-solid in this regard.
The second problem is the actual type of offset you use. You have to use char offsets if you need to work with a multi-byte charset, which means the random-access retrieval from the file using the provided offsets is not going to be very efficient - you can't just set a pointer into the file at your offset and start reading, you have to read through until you get to the offset (that's what skip does under the covers in a Reader), then start extracting. If you're dealing with very large files, that means retrieval of content near the end of the file is too slow.
I ended up writing a FilterReader that keeps a buffer of byte offset to char offset mappings as the file is read. When we need to get the byte offset, we first ask Woodstox for the char offset, then get the custom reader to tell us the actual byte offset for the char offset. We can get the byte offset from the beginning and end of the element, giving us what we need to go in and surgically extract the element from the file by opening it as a RandomAccessFile, which means it's super fast at any point in the file.
I created a library for this, it's on GitHub and Maven Central. If you just want to get the important bits, the party trick is in the ByteTrackingReader.
Some people have commented about how this whole thing is a bad idea and why would you want to do it? XML is a transport mechanism, you should just import it to a DB and work with the data with more appropriate tools. For most cases this is true, but if you're building applications or integrations that communicate via XML, you need tooling to analyze and operate on the files that are exchanged. I get daily requests to verify feed contents, having the ability to quickly extract a specific set of items from a massive file and verify not only the contents, but the format itself is essential.
Anyhow, hopefully this can save someone a few hours, or at least get them closer to a solution.

How to handle importing a CSV file with differing column lengths

Im working on a project for school and am having a really hard time figuring out how to import and format a CSV file into a usable format. The CSV contains a movie name in the first column, and showtimes in the rows, so it would look something like this.
movie1, 7pm, 8pm, 9pm, 10pm
movie2, 5pm, 8pm
movie3, 3pm, 7pm, 10pm
I think I want to split each row into its own array, maybe an arraylist of the arrays? I really dont know where to even start so any pointers would be appreciated.
Preferably dont want to use any external libraries.
I would go with a Map having movie name as key and timings as list like the one below:
Map<String, List<String>> movieTimings = new HashMap<>();
It will read through csv file and put the values into this map. If the key already exists then we just need to add the value into the list. You can use computeIfAbsent method of Map (Java 8) to see whether the entry exists or not, e.g.:
public static void main(String[] args) {
Map<String, List<String>> movieTimings = new HashMap<>();
String timing = "7pm";//It will be read from csv
movieTimings.computeIfAbsent("test", s -> new ArrayList<>()).add(timing);
System.out.println(movieTimings);
}
This will populate your map once the file is read. As far as reading of file is concerned, you can use BuffferedReader or OpenCSV (if your project allows you to use third party libraries).
I have no affiliation with Univocity - but their Java CSV parser is amazing and free. When I had a question, one of the developers got back to me immediately. http://www.univocity.com/pages/about-parsers
You read in a line and then cycle through the fields. Since you know the movie name is always there and at least one movie time, you can set it up any way you like including an arraylist of arraylists (so both are variable length arrays).
It works well with our without quotes around the fields (necessary when there are apostrophes or commas in the movie names). In the problem I solved, all rows had the same number of columns, but I did not know the number of columns before I parsed the file and each file often had a different number of columns and column names and it worked perfectly.
You can use opencsv to read the CSV file and add each String[] to an ArrayList. There are examples in the FAQ section of opencsv's website.
Edit: If you don't want to use external libraries you can read the CSV using a BufferedReader and split the lines by commas.
BufferedReader br = null;
try{
List<String[]> data = new ArrayList<String[]>();
br = new BufferedReader(new FileReader(new File("csvfile")));
String line;
while((line = br.readLine()) != null){
String[] lineData = line.split(",");
data.add(lineData);
}
}catch(Exception e){
e.printStackTrace();
}finally{
if(br != null) try{ br.close(); } catch(Exception e){}
}

How do we deal with a large GATE Document

I'm getting Error java.lang.OutOfMemoryError: GC overhead limit exceeded when I try to execute Pipeline if the GATE Document I use is slightly large.
The code works fine if the GATE Document is small.
My JAVA code is something like this:
TestGate Class:
public void gateProcessor(Section section) throws Exception {
Gate.init();
Gate.getCreoleRegister().registerDirectories(....
SerialAnalyserController pipeline .......
pipeline.add(All the language analyzers)
pipeline.add(My Jape File)
Corpus corpus = Factory.newCorpus("Gate Corpus");
Document doc = Factory.newDocument(section.getContent());
corpus.add(doc);
pipeline.setCorpus(corpus);
pipeline.execute();
}
The Main Class Contains:
StringBuilder body = new StringBuilder();
int character;
FileInputStream file = new FileInputStream(
new File(
"filepath\\out.rtf")); //The Document in question
while (true)
{
character = file.read();
if (character == -1) break;
body.append((char) character);
}
Section section = new Section(body.toString()); //Creating object of Type Section with content field = body.toString()
TestGate testgate = new TestGate();
testgate.gateProcessor(section);
Interestingly this thing fails in GATE Developer tool as well the tools basically gets stuck if the document is more than a sepcific limit, say more than 1 page.
This proves that my code is logically correct but my approach is wrong. How do we deal with large chunks data in GATE Document.
You need to call
corpus.clear();
Factory.deleteResource(doc);
after each document, otherwise you'll eventually get OutOfMemory on any size of docs if you run it enough times (Although by the way you initialize gate in the method it seems like you really need to process a single document only once).
Besides that, annotations and features usually take lots of memory. If you have an annotation-intensive pipeline, i.e. you generate lots of annotations with lots of features and values you may run out of memory. Make sure you don't have a processing resource that generates annotations exponentially - for instance a jape or groovy generates n to the power of W annotations, where W is number of words in your doc. Or if you have a feature for each possible word combination in your doc, that would generate factorial of W strings.
every time its create pipeline object that's why it takes huge memory. That's why every time you use 'Annie' cleanup.
pipeline.cleanup();
pipeline=null;

Sorting a 100MB XML file with Java?

How long does sorting a 100MB XML file with Java take ?
The file has items with the following structure and I need to sort them by event
<doc>
<id>84141123</id>
<title>kk+ at Hippie Camp</title>
<description>photo by SFP</description>
<time>18945840</time>
<tags>elphinstone tribalharmonix vancouver intention intention7 newyears hippiecamp bc sunshinecoast woowoo kk kriskrug sunglasses smoking unibomber møtleykrüg </tags>
<geo></geo>
<event>47409</event>
</doc>
I'm on a Intel Dual Duo Core and 4GB RAM.
Minutes ? Hours ?
thanks
Here are the timings for a similar task executed using Saxon XQuery on a 100Mb input file.
Saxon-EE 9.3.0.4J from Saxonica
Java version 1.6.0_20
Analyzing query from {for $i in //item order by location return $i}
Analysis time: 195 milliseconds
Processing file:/e:/javalib/xmark/xmark100.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/e:/javalib/xmark/xmark100.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 6158 milliseconds
Tree size: 4787932 nodes, 79425460 characters, 381878 attributes
Execution time: 3.466s (3466ms)
Memory used: 471679816
So: about 6 seconds for parsing the input file and building a tree, 3.5 seconds for sorting it. That's invoked from the command line, but invoking it from Java will get very similar performance. Don't try to code the sort yourself - it's only a one-line query, and you are very unlikely to match the performance of an optimized XQuery engine.
i would say minutes - you shud be able to do that completely in-memory, so with a sax parser that would be reading-sorting-writing, should not be a problem for your hardware
I think a problem like this would be better sorted using serialisation.
Deserialise the XML file into an ArrayList of 'doc'.
Using straight Java code, apply sort on the event attribute and stored sorted arraylist in another variable.
Serialise out the sorted 'doc' ArrayList to file
If you do it in memory, you should be able to do this in under 10 seconds. You would be pusshing to do this under 2 seconds because it will spend that much times reading/writing to disk.
This program should use no more than 4-5x times the original file size. about 500 MB in your case.
String[] records = FileUtils.readFileToString(new File("my-file.xml")).split("</?doc>");
Map<Long, String> recordMap = new TreeMap<Long, String>();
for(int i=1;i<records.length;i+=2) {
String record = records[i];
int pos1 = record.indexOf("<id>");
int pos2 = record.indexOf("</id>", pos1+4);
long num = Long.parseLong(record.substring(pos1+3, pos2));
recordMap.put(num, record);
}
StringBuilder sb = new StringBuilder(records[0]);
for (String s : recordMap.values()) {
sb.append("<doc>").append(s).append("</doc>");
}
sb.append(records[records.length-1]);
FileUtils.writeStringToFile(new File("my-output-file.xml"), sb.toString());

Categories