How to efficiently handle huge JSON file, needs some ideas - java

It's a questions about the train of thought, so please don't let me to use a third-party library to deal this.
Recently, I took a job interview, There's a questions like below:
there is a huge JSON file, structure like a Database:
{
"tableName1 ":[
{"t1field1":"value1"},
{"t1field2":"value2"},
...
{"t1fieldN":"valueN"}
],
"tableName2 ":[
{"t2field1":"value1"},
{"t2field2":"value2"},
....
{"t2fieldN":"valueN"}
],
.......
.......
"tableNameN ":[
{"tNfield1":"value1"},
{"tNfield2":"value2"},
....
{"tNfieldN":"valueN"}
]
}
And the requirements is:
find some special child-node by given child-node' name and update it's field's value then save it to a new JSON file.
count the number of given field's name and value.
when it's a normal size JSON file, I wrote a Utility class to load the JSON file from local and parse it to JSON Object. Then I wrote two methods to deal the two requirements:
void upDateAndSaveJson(JSONObject json, String nodeName,
Map<String, Object> map, Map<String, Object> updateMap,
String outPath) {
//map saved target child-node's conditions
//updateMap saved update conditions
// first find the target child-node and update it finally save it
// ...code ...
}
int getCount(JSONObject json, Map<String, Object> map) {
//map saved field target field/value
// ...code...
}
But the interviewer let me thinking about the situation when the JSON file is very huge, then modify my code and how to make it more effective.
My idea is write a tool to split the JSON file first. Because finally I need take the JSON Object to invoke previous two methods, so before I split the huge JSON file I know the parameters of the two methods: a Map(saved target child-node's conditions/or field target field/value), nodeName(child-node name)
so when I load the JSON file I compare the inputstream String with the taget nodeName, and then start to count the number of object the child-node, if rule is 100, then when it have 100 objects, I split the child-node to a new smaller JSON file and remove it in source JSON file.
Like below:
while((line = reader.readLine()) != null){
for (String nodeName : nodeNames) {
//check if its' the target node
if (line.indexOf(nodeName) != -1) {
//count the target child-node's object
//and then split to smaller JSON file
}
}
}
After that I can use multiple thread to load the smaller JSON file previous created and invoke the two method to process the JSON Object.
It's a questions about the train of thought, so please don't tell me you can use a third-party library to deal this problem.
So if my though feasible? or is there some other idea you guys have, please share it.
Thanks.

Related

Best way to parse JSON with an unknown structure for comparison with a known structure?

I have a YAML file which I convert to JSON, and then to a Java object using GSON. This will be used as the standard definition which I will compare other YAML files against. The YAML files which I will be validating should contain fields with identical structures to my definition. However, it is very possible that it might contain fields with different structure, and fields that don't exist within my definition, as it is ultimately up to the user to create these fields before I receive the file. A field in the YAML to be validated can look like this, with the option of as many levels of nesting as the user wishes to define.
LBU:
type: nodes.Compute
properties:
name: LBU
description: LBU
configurable_properties:
test: {"additional_configurable_properties":{"aaa":"1"}}
vdu_profile:
min_number_of_instances: 1
max_number_of_instances: 4
capabilities:
virtual_compute:
properties:
virtual_memory:
virtual_mem_size: 8096 MB
virtual_cpu:
cpu_architecture: x86
num_virtual_cpu: 2
virtual_cpu_clock: 1800 MHz
requirements:
- virtual_storage:
capability: capabilities.VirtualStorage
node: LBU_Storage
Currently, I receive this YAML file and convert it to a JsonObject with Gson. It is not possible to map this to a Java object because of any possible unknown fields. My goal is to run through this file and validate every single field against a matching one in my definition. If a field is present that does not exist in the definition, or does exist but has properties that differ, I need to inform the user with specific info about the field.
So far, I am going the route of getting fields like this.
for (String field : obj.get("topology_template").getAsJsonObject().get("node_template").getAsJsonObject().get(key).getAsJsonObject().get(
obj.get("topology_template").getAsJsonObject().get("node_templates").getAsJsonObject().get(key).getAsJsonObject().keySet().toArray()[i].toString()).getAsJsonObject().keySet()) {
However, it seems that this is rather excessive and is very hard to follow for some deeply nested fields.
What I want to know is if there is a simpler way to traverse every field of a JsonObject, without mapping it to a Java object, and without explicitly accessing each field by name?
I think you are looking for something like a streaming Json Parser:
Here's an example
String json
= "{\"name\":\"Tom\",\"age\":25,\"address\":[\"Poland\",\"5th avenue\"]}";
JsonFactory jfactory = new JsonFactory();
JsonParser jParser = jfactory.createParser(json);
String parsedName = null;
Integer parsedAge = null;
List<String> addresses = new LinkedList<>();
while (jParser.nextToken() != JsonToken.END_OBJECT) {
String fieldname = jParser.getCurrentName();
if ("name".equals(fieldname)) {
jParser.nextToken();
parsedName = jParser.getText();
}
if ("age".equals(fieldname)) {
jParser.nextToken();
parsedAge = jParser.getIntValue();
}
if ("address".equals(fieldname)) {
jParser.nextToken();
while (jParser.nextToken() != JsonToken.END_ARRAY) {
addresses.add(jParser.getText());
}
}
}
jParser.close();
Please find the documentation here:
https://github.com/FasterXML/jackson-docs/wiki/JacksonStreamingApi

Parsing Nested JSON in java without know structure of JSON

I have a use case where i get a random jsonstring and variable name. I need to verify if that particular variable is present in that json, and if present fetch its value. For example, let us the json is as follows
{
"a" : {
"b":1,
"c":2
}
}
Along with above jsonString, say i get an input "a.b" . Now I need to return 1.
Is there any library to achieve this in java directly?
JsonPath is a library that provides the functionality you're after.
You will have to do some conversion between your input and the library's input.
As per your example, if your input is "a.b":
String convertedInput = ".." + input
JsonPath.read(json, convertedInput)

Parse JSON without knowing format (can be one of two different objects)

I have two concrete objects with a known schema (totally different). Then I get JSON from a client and want to map it into one of this object.
Is it possible to somehow check type before conversion, or I have to try to convert it into each of object and check if parsing was correct?
EDIT:
In example:
{"id":"1","name":"oneone"}
and second
{"age":50,"type":"elephant"}
Personally, I would parse the JSON using GSON or something similar and look for the key that is unique to one of the JSON formats, for instance "age". In reality, you could probably do this as a String as #user743414 mentioned as well.
UPDATE:
Here is some code to reflect what I'm talking about
JsonParser jsonParser = new JsonParser();
JsonObject jsonObject = jsonParser.parse(jsonString).getAsJsonObject();
Set<String> keys = jsonObject.keySet();
if(keys.contains("age")){
//Map to one object
} else {
//Map to the other object
}
If you are sure schema is constant for both JSON’s, then simply take a unique parameter like age in this example and check if it exists in the JSON.
If (String.contains(“age”)) {
//then it’s 2nd JSON
} else {
//then it’s 1st JSON
}

Spark : How to merge the transformations

I have 1000 json files, i need to do some transformations on each of the file, and then create a merged output file, which can have overlapping operations on values, (for example, say, it should not have repeated values)
So, if i read the files as wholeTextFiles, as a title,content pair, and then in the map function, i parse the content as json tree and perform the transformation, where and how do i merge the output?
Do i need to have another transform on the resultant RDD's to merge the values, and how would this work? Can i have a shared object(a List or a Map or RDD(?)) amongst all map blocks, which will be updated as part of the transformation, so that i can check for repeated values there?
P.S: Even if the output creates part files, i would still like to have no repititions.
Code:
//read the files as JavaPairRDD , which gives <filename, content> pairs
String filename = "/sample_jsons";
JavaPairRDD<String,String> distFile = sc.wholeTextFiles(filename);
//then create a JavaRDD from the content.
JavaRDD<String> jsonContent = distFile.map(x -> x._2);
//apply transformations, the map function will return an ArrayList which would
//have property names.
JavaRDD<ArrayList<String>> apm = jsonContent.map(
new Function< String, ArrayList<String> >() {
#Override
public ArrayList<String> call(String arg0) throws Exception {
JsonNode rootNode = mapper.readTree(arg0);
return parseJsonAndFindKey(rootNode, "type", "rootParent");
}
});
So, this way i am able to get all first level properties in an ArrayList, from each json file.
Now i need a final ArrayList, as a union of all these arraylists, removing duplicates. How can i achieve that ?
Why do you need 1000 RDDs for 1000 json files?
Do you see any issue with merging the 1000 json files in the input stage into one RDD?
If you'll be using one RDD from the input stage, it shouldn't be hard to perform all the needed actions on this RDD.

ANDROID usage of Jackson library: How to load object with indexes - range from to

I have really big JSON file for parsing and managing. My JSON file contains structure like this
[
{"id": "11040548","key1":"keyValue1","key2":"keyValue2","key3":"keyValue3","key4":"keyValue4","key5":"keyValue5","key6":"keyValue6","key7":"keyValue7","key8":"keyValue8","key9":"keyValue9","key10":"keyValue10","key11":"keyValue11","key12":"keyValue12","key13":"keyValue13","key14":"keyValue14","key15":"keyValue15"
},
{"id": "11040549","key1":"keyValue1","key2":"keyValue2","key3":"keyValue3","key4":"keyValue4","key5":"keyValue5","key6":"keyValue6","key7":"keyValue7","key8":"keyValue8","key9":"keyValue9","key10":"keyValue10","key11":"keyValue11","key12":"keyValue12","key13":"keyValue13","key14":"keyValue14","key15":"keyValue15"
},
....
{"id": "11040548","key1":"keyValue1","key2":"keyValue2","key3":"keyValue3","key4":"keyValue4","key5":"keyValue5","key6":"keyValue6","key7":"keyValue7","key8":"keyValue8","key9":"keyValue9","key10":"keyValue10","key11":"keyValue11","key12":"keyValue12","key13":"keyValue13","key14":"keyValue14","key15":"keyValue15"
}
]
My JSON file contains data about topics from news website and practically every day this JSON file will be increased dramatically.
For parsing of that file I use
URL urlLinkSource = new URL(OUTBOX_URL);
urlLinkSourceReader = new BufferedReader(new InputStreamReader(
urlLinkSource.openStream(), "UTF-8"));
ObjectMapper mapper = new ObjectMapper();
List<DataContainerList> DataContainerListData = mapper.readValue(urlLinkSourceReader,new TypeReference<List<DataContainerList>>() { }); //DataContainerList contains id, key1, key2, key3..key15
My problem is that I want to load in this line
List<DataContainerList> DataContainerListData = mapper.readValue(urlLinkSourceReader,new TypeReference<List<DataContainerList>>() { });
only range of JSON object - just first ten object, just second ten object - because I need to display in my app just 10 news in paging mode (all the time I know the index of which 10 I need to display). It totally stuped to load 10 000 objects and to iterate just first 10 of them. So my question is how I can load
in similar way like this one:
List<DataContainerList> DataContainerListData = mapper.readValue(urlLinkSourceReader,new TypeReference<List<DataContainerList>>() { });
only objects with indexes FROM -TO (for example from 30 to 40) without loading of all objects in the entire JSON file?
Regards
It depends of what you mean by "load object with indexes from to", if you want to
Read everything but bind only a sublist
The solution in that case is to read the full stream and only bind values within those indexes.
You can use jacksons streaming api and do it yourself. Parse the stream use a counter to keep track of actual index and then bind to POJOs only what you need.
However this is not a good solution if your file is large and its done in real time.
Read only the data between those indexes
You should do that if your file is big and performance matters. Instead of having a single big file, do the pagination by splitting your json array into multiple files matching your ranges, and then just deserialize the specific file content into your array.
Hope this helps...

Categories