Parsing a massive CSV into JSON using Java

Parsing a massive CSV into JSON using Java - java

I'm trying to parse a huge CSV (56595 lines) into a JSONArray but it's taking a considerable amount of time. This is what my code looks like and it takes ~17 seconds to complete. I'm limiting my results based on one of the columns but the code still has to go through the entire CSV file.
Is there a more efficient way to do this? I've excluded the catch's, finally's and throws to save space.
File
Code
...
BufferedReader reader = null;
String line = "";
//jArray is retrieved by an ajax call and used in a graph
JSONArray jArray = new JSONArray();
HttpClient httpClient = new DefaultHttpClient();
try {
//url = CSV file
HttpGet httpGet = new HttpGet(url);
HttpResponse response = httpClient.execute(httpGet);
int responseCode = response.getStatusLine().getStatusCode();
if (responseCode == 200) {
try {
reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));
while (((line = reader.readLine()) != null)) {
JSONObject json = new JSONObject();
String[] row = line.split(",");
//skips first three rows
if(row.length > 2){
//map = 4011
if(row[1].equals(map)) {
json.put("col0", row[0]);
json.put("col1", row[1]);
json.put("col2", row[2]);
json.put("col3", row[3]);
json.put("col4", row[4]);
json.put("col5", row[5]);
json.put("col6", row[6]);
jArray.put(json);
}
}
return jArray;
}
...

Unfortunately, the main delay will predictably be at downloading the file from HTTP, so all your chances will rest upon optimizing your code. So, based upon the info you provided, I can suggest some enhancements to optimize your algorithm:
It was a good idea to process the input file in streaming mode, reading line by line with a with BufferedReader. Usually it is a good practice to set an explicit buffer size (BufferedReader's default size is 8Kb), but being the source a network connection, I doubt it will be any better in this case. Anyway, you should try 16Kb, for instance.
Since the number of output items is very low (49, you said), it doesn't matter to store it in an array (for a higher amount, I would have recommend you to chose another collection, like LinkedList), but it is always useful to pre-size it with an estimated size. In JSONArray, I suppose it would be enough to put a null item at position 100 (for example) at the beginning of your method.
The biggest deal I think of is the line line.split(","), because that makes the program go through the whole line, duplicate its contents character by character into an array, and the worst of it all, for eventually use it only in a 0.1% of cases.
And there might be even a worse drawback: Merely splitting by comma might be not a good way to properly parse a JSON line. I mean: Are you sure the json values cannot contain a comma as part of user data?
Well, to solve this problem I suggest you to code your own json custom parsing algorithm, which might be a little hard, but it will be worth the effort. You must code a state machine in which you detect the second value and, if the key coincides with the filtering value ("4011"), continue parsing the rest of the line. In this way, you will save a big amount of time and memory.

Related

Iterating massive CSVs for comparisons

I have two very large CSV files that will only continue to get larger with time. The documents I'm using to test are 170 columns wide and roughly 57,000 rows. This is using data from 2018 to now, ideally the end result will be sufficient to run on CSVs with data going as far back as 2008 which will result in the CSVs being massive.
Currently I'm using Univocity, but the creator has been inactive on answering questions for quite some time and their website has been down for weeks, so I'm open to changing parsers if need be.
Right now I have the following code:
public void test() {
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setHeaderExtractionEnabled(false);
CsvParser sourceParser = new CsvParser(parserSettings);
sourceParser.beginParsing(sourceFile));
Writer writer = new OutputStreamWriter(new FileOutputStream(outputPath), StandardCharsets.UTF_8);
CsvWriterSettings writerSettings = new CsvWriterSettings();
CsvWriter csvWriter = new CsvWriter(writer, writerSettings);
csvWriter.writeRow(headers);
String[] sourceRow;
String[] compareRow;
while ((sourceRow = sourceParser.parseNext()) != null) {
CsvParser compareParser = new CsvParser(parserSettings);
compareParser.beginParsing(Path.of("src/test/resources/" + compareCsv + ".csv").toFile());
while ((compareRow = compareParser.parseNext()) != null) {
if (Arrays.equals(sourceRow, compareRow)) {
break;
} else {
if (compareRow[KEY_A].trim().equals(sourceRow[KEY_A].trim()) &&
compareRow[KEY_B].trim().equals(sourceRow[KEY_B].trim()) &&
compareRow[KEY_C].trim().equals(sourceRow[KEY_C].trim())) {
for (String[] result : getOnlyDifferentValues(sourceRow, compareRow)) {
csvWriter.writeRow(result);
}
break;
}
}
}
compareParser.stopParsing();
}
}
This all works exactly as I need it to, but of course as you can obviously tell it takes forever. I'm stopping and restarting the parsing of the compare file because order is not guaranteed in these files, so what is in row 1 in the source CSV could be in row 52,000 in the compare CSV.
The Question:
How do I get this faster? Here are my requirements:
Print row under following conditions:
KEY_A, KEY_B, KEY_C are equal but any other column is not equal
Source row is not found in compare CSV
Compare row is not found in source CSV
Presently I only have the first requirement working, but I need to tackle the speed issue first and foremost. Also, if I try to parse the file into memory I immediately run out of heap space and the application laughs at me.
Thanks in advance.

Also, if I try to parse the file into memory I immediately run out of heap space
Have you tried increasing the heap size? You don't say how large your data file is, but 57000 rows * 170 columns * 100 bytes per cell = 1 GB, which should pose no difficulty on a modern hardware. Then, you can keep the comparison file in a HashMap for efficient lookup by key.
Alternatively, you could import the CSVs into a database and make use of its join algorithms.
Or if you'd rather reinvent the wheel while scrupolously avoiding memory use, you could first sort the CSVs (by partitioning them into sets small enough to sort in memory, and then doing a k-way merge to merge the sublists), and then to a merge join. But the other solutions are likely to be a lot easier to implement :-)

How to parse a text file with separated json objects in Java? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a text file which gets updated in every 15-16 minute with some json data. These json data are separated by #### lines in between. Snippet of the file is :
[{"accountId":"abc","items":[{"serviceName":"XYZ","dataCenter":"TG","startTimeUtc":"2017-04-05T19:57:33.509+0000","endTimeUtc":"2017-04-05T19:57:33.509+0000","usage":[{"resourceName":"XYZ_EE_PAAS_GATEWAYS","quantity":7,"units":"number"}]}]},{"accountId":"XYZp1cm9mbe","items":[{"serviceName":"XYZ","dataCenter":"TG","startTimeUtc":"2017-04-05T19:57:33.509+0000","endTimeUtc":"2017-04-05T19:57:33.509+0000","usage":[{"resourceName":"XYZ_EE_PAAS_GATEWAYS","quantity":6,"units":"number"}]}]}]
######################
[{"accountId":"abc","items":[{"serviceName":"XYZ","dataCenter":"TG","startTimeUtc":"2017-04-05T19:59:33.523+0000","endTimeUtc":"2017-04-05T19:59:33.523+0000","usage":[{"resourceName":"XYZ_EE_PAAS_GATEWAYS","quantity":7,"units":"number"}]}]},{"accountId":"XYZp1cm9mbe","items":[{"serviceName":"XYZ","dataCenter":"TG","startTimeUtc":"2017-04-05T19:59:33.523+0000","endTimeUtc":"2017-04-05T19:59:33.523+0000","usage":[{"resourceName":"XYZ_EE_PAAS_GATEWAYS","quantity":6,"units":"number"}]}]}]
######################
[{"accountId":"abc","items":[{"serviceName":"XYZ","dataCenter":"TG","startTimeUtc":"2017-04-05T20:01:33.531+0000","endTimeUtc":"2017-04-05T20:01:33.531+0000","usage":[{"resourceName":"XYZ_EE_PAAS_GATEWAYS","quantity":7,"units":"number"}]}]},{"accountId":"XYZp1cm9mbe","items":[{"serviceName":"XYZ","dataCenter":"TG","startTimeUtc":"2017-04-05T20:01:33.531+0000","endTimeUtc":"2017-04-05T20:01:33.531+0000","usage":[{"resourceName":"XYZ_EE_PAAS_GATEWAYS","quantity":6,"units":"number"}]}]}]
######################
This file gets updated every 15-16 minute with new entry. I want to read the file and store the latest entry excluding the #### line in a json object. How to do it in java ? I don't want to use the 15 min interval as it is not constant.
My simple requirement is at any point of time I would read the file and want to retrieve the last json in above the ### line.

With Java 8, you can do it like this:
public JsonObject retrieveLastEntry(Path path) throws IOException {
String[] jsonLines = Files.lines(path)
.filter(line -> !line.equals("######################")
.toArray();
String lastJsonLine = jsonLines[jsonLines.length - 1];
return MyFavoriteJsonParser.parse(lastJsonLine);
}
MyFavoriteJsonParser refers to whatever JSON library you want to use (maybe have a look at this question). There might be few performance considerations here. If your file is very large (considerably more than a few MB), then the .toArray() call maybe not right for you. In fact, if performance is extremely crucial, you might even need to consider parsing the file backwards. But the golden rule for performance optimization is to go with a simple solution first and see if (and where) it might be not performant enough.
If your JSON goes across lines, however, the Stream API is not the best choice. In that case, a regular iteration comes to the rescue:
public JsonObject retrieveLastEntry(File file) throws IOException {
String lastJson = "";
StringBuffer sb = new StringBuffer();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(new FileReader(file), "UTF-8")))) {
String line;
while ((line = reader.readLine()) != null) {
if (line.equals("######################") {
lastJson = sb.toString(); sb.setLength(0);
} else {
sb.append(line).append('\n');
}
}
return MyFavoriteJsonParser.parse(lastJsonLine);
}
The basic idea is to aggregate lines between the ###... and put them into a variable whenever a new separator is reached. You still might want to consider the case of having no entry at all and handle IOExceptions properly.
I think this is pretty much the idiomatic way of doing it.

Why we need to use BufferedReader instead of String while getting the response from server

I am doing an android application, Which is getting some JSON values from server. So I made some studies and develop the code for get data from server to my app.
Simply I am using the below code for that.
HttpResponse response;
Object content = null;
HttpGet httpget = new HttpGet(url);
response = client.execute(targetHost, httpget,localContext);
HttpEntity entity = response.getEntity();
content = EntityUtils.toString(entity);
Log.d("content", "OK: " + content.toString());
When I looked in to some tutorials, They used BufferedReader for the same operation like this.
HttpResponse httpResponse = httpClient.execute(httpPost);
InputStream inputStream = httpResponse.getEntity().getContent();
InputStreamReader inputStreamReader = new InputStreamReader(inputStream);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
StringBuilder stringBuilder = new StringBuilder();
String bufferedStrChunk = null;
while((bufferedStrChunk = bufferedReader.readLine()) != null){
stringBuilder.append(bufferedStrChunk);
}
Log.d("content", "OK: " + stringBuilder.toString());
So my questions are:
Why using the BufferedReader method? Is there any advantage for using this?
Till now the first method is working fine for me, is there any chance errors or trouble because of the first method in future?
Thanks :).

Why using the BufferedReader method? Is there any advantage for using this?
The java.io.BufferedReader class reads text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines.
Buffering can speed up IO quite a bit. Rather than read one character at a time from the network or disk, you read a larger block at a time.
there any chance errors or trouble because of the first method in future?
No, there will not be any error if you do it directly in your code. But it will degrade your performance little bit (maybe in milliseconds) and another thing is, if you don't use BufferedReader then size stream coming from server will not be known.

The BufferedReader class provides buffering to your Reader's.
Buffering can speed up IO quite a bit. Rather than read one character
at a time from the network or disk, you read a larger block at a time.
This is typically much faster, especially for disk access and larger
data amounts.
The main difference between BufferedReader and BufferedInputStream is
that Reader's work on characters (text), wheres InputStream's works on
raw bytes.
Great explanation is here: Java IO: BufferedReader
The StringBuffer class is used when there is a necessity to make a lot of modifications to Strings of characters.
Unlike Strings, objects of type StringBuffer can be modified over and over again with out leaving behind a lot of new unused objects.
Analogy on this line:
StringBuffer append is as fast as compare to adding of two String object
stringBuilder.append(bufferedStrChunk); - //This will fast in cpu execution
String str;
str = str + "some string" // This will slow in cpu execution

Google Custom Search API, how can I traverse google result pages programatically using Java?

The following code taken from: Java code for using google custom search API. It works correctly to extract the first 10 results of the first page in google results pages.
public static void main(String[] args) throws Exception {
String key="YOUR KEY";
String qry="Android";
URL url = new URL("https://www.googleapis.com/customsearch/v1?
key="+key+ "&cx=013036536707430787589:_pqjad5hr1a&q="+ qry + "&alt=json");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("Accept", "application/json");
BufferedReader br =
new BufferedReader(new InputStreamReader( (conn.getInputStream())));
String output;
System.out.println("Output from Server .... \n");
while ((output = br.readLine()) != null)
{
if(output.contains("\"link\": \""))
{
String link=output.substring(output.indexOf("\"link\": \"")+
("\"link\": \"").length(), output.indexOf("\","));
System.out.println(link); //Will print the google search links
}
}
conn.disconnect();
}
I'm trying to figure out how can I traverse all results pages? By searching in https://developers.google.com/custom-search/v1/using_rest I found that the start parameter in the query referes to the index, and it is obvious that by changing this value in a loop this will do the purpose, but will cost me a query for each page (which should not be the case, as it is not a new query, it is the same query but just new page). Also, I found that google have mentioned that if the query succeeds, the response data contains value totalResults for total results, but they mentioned that it is estimate number. So, how can one get benifit of this service and get the actual number of results or number of pages in order to traverse them all ?? It does not make any sense that I issue new query for every page.

You should use a JSON parser to extract data from the results, rather than parsing the result yourself.
Google won't return all the results at once for a single query. If you search for Java, there are approximately 214,000,000 results? Returning them all would last days, and you couldn't do anything meaningful with them anyway. So if there are several pages, you must do a new query for each page, as you do when doing a Google search with your browser. Mostof the time, the interesting results are in the first or second page. Returning more than that would waste resources.
Google doesn't know the exact number of results. It returns an estimate. Counting the exact number of results would be too hard. Knowing that there are 214,000,001 results and not 214,000,002 doesn't ad any value, and the exact number would be immediately obsolete anyway.

Doing HTTP post of JSON object every second

I have to do a HTTP post in java every second after building a json object.
The json object is built from reading a CSV file which is huge (200Mbs+), so my problem is
how do I read x number of lines build x objects and post that every second(as it is not possible to parse the whole 200mb file in less than a second) and continue reading the next x lines.
Please let me know your thoughts..
Can I use Java timer class, and keep reading the CSV file and at the same time post the json object to the server every second with the formed json?

It is hardly possible to read, parse, convert and send a 200 MB file once per second.
So you need to change your design:
My suggestion would be to only send changed lines, something like this:
{
"1" : {"field1":"value1","field2":"value2"},
"17" : {"field1":"value1","field2":"value2"}
}
Which of course gives you new problems:
The client needs to figure out which lines have changed, and the server needs to integrate the changed lines with the existing data.

I would make it depending on the file size and not depending on time.
BufferedReader fin = null; //create it
Gson gson=new Gson(); //Google code open source library for JSON in Java
ArrayList<JSONObject> jsonList=new ArrayList<JSONObject>();
while (((line = fin.readLine()) != null)) {
if ( line.length()==0 ){
//"Blank line;
}else{
currJSON=loadJSON(line);//You have to load it in a Java Object
if ( jsonList.size()<MAX_JSON){
jsonList.add(currJSON);
}
if (JsonList.size()==MAX_JSON){ //Define the maximum size of the list you want to post
gson.toJson(jsonList); //Convert to JSON
//You should post your Json with some Http Connection to your server
jsonList.clear();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.