reading json data through Apache Spark

reading json data through Apache Spark - java

i am trying to read sample Json file through Apache Spark, during this process i observed one thing is that you need to keep entire json object into single line. If i keep entire json object into single line,code is working well otherwise getting exception.
This is my json data:
[
{
"id": 2,
"name": "An ice sculpture",
"price": 12.50,
"tags": ["cold", "ice"],
"dimensions": {
"length": 7.0,
"width": 12.0,
"height": 9.5
},
"warehouseLocation": {
"latitude": -78.75,
"longitude": 20.4
}
},
{
"id": 3,
"name": "A blue mouse",
"price": 25.50,
"dimensions": {
"length": 3.1,
"width": 1.0,
"height": 1.0
},
"warehouseLocation": {
"latitude": 54.4,
"longitude": -32.7
}
}
]
This is my code:
SparkSession session = new SparkSession.Builder().appName("JsonRead").master("local").getOrCreate();
Dataset<Row> json = session.read().json("/Users/mac/Desktop/a.json");
json.select("tags").show();
In case of small datasets its okay, is any other way to process large json datasets?

see the document:
http://spark.apache.org/docs/2.0.1/sql-programming-guide.html#json-datasets
JSON Datasets
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

Related

Flattening a heavily nested JSON in Java - Time Complexity

{
"id": "12345678",
"data": {
"address": {
"street": "Address 1",
"locality": "test loc",
"region": "USA"
},
"country_of_residence": "USA",
"date_of_birth": {
"month": 2,
"year": 1988
},
"links": {
"self": "https://testurl"
},
"name": "John Doe",
"nationality": "XY",
"other": [
{
"key1": "value1",
"key2": "value2
},
{
"key1": "value1",
"key2": "value2"
}
],
"notified_on": "2016-04-06"
}
}
I am trying to read data from a GraphQL API that returns paginated JSON response. I need to write this into a CSV. I have been exploring Spring Batch for implementation where I would read JSON data in the ItemReader and flatten each JSON entry (in ItemProcessor) and then write this flattened data into a CSV (in ItemWriter). While I could use something like Jackson for flattening the JSON, I am concerned about possible performance implications if the JSON data is heavily nested.
expected output:
id, data.address.street, data.address.locality, data.address.region, data.country_of_residence, data.date_of_birth.month, data.date_of_birth.year, data.links.self, data.name, data.nationality, data.other (using jsonPath), data.notified_on
I need to do process more than a million records. While I believe flattening the CSV would be a linear operation O(n), I was still wondering if there could be other caveats if the JSON structure gets severely nested.

How to get value from json object with java obect mapper

I want to get the value at the field first inside name.
How i can access in this field using HashMap in java
{ "payload":{
"name": {
"first": "jean",
"last": "bob,
},
"address": {
"code": "75",
"city": "paris",
"country": "France"
},
}}

Use one of the available Java libraries for handling JSON. E.g. Gson from Guava API. They are pretty straing fw.

Jackson mapping same JSON nodes with different names as key

I'm working with a RESTful webservice in android, and I'm using Spring for Android with Jackson for the first time.
I'm using this generator to generate the java classes, but I'm in trouble sometimes when an array of the same objects inside JSON have a different names:
"a2e4ea4a-0a29-4385-b510-2ca6df65db1c": {
"url": "//url1.jpg",
"ext": "jpg",
"name": "adobe xm0 ",
"children": {},
"tree_key": []
},
"d3ff3921-e084-4812-bc49-6a7431b6ce52": {
"url": "https://www.youtube.com/watch?v=myvideo",
"ext": "video",
"name": "youtube example",
"children": {},
"tree_key": []
},
"151b5d60-8f41-4f38-8b67-fe875c3f0381": {
"url": "https://vimeo.com/channels/staffpicks/something",
"ext": "video",
"name": "vimeo example",
"children": {},
"tree_key": []
}
All the 3 nodes are of the same kind and can be mapped with the same object, but the generator creates 3 classes for each node with different name.
Thanks for the help.

With Jackson, you can use Map map = new ObjectMapper().readValue(<insert object here>, Map.class);
as mentioned by Programmer Bruce : here

Json Array Parsing issue

``I am having an Json Array as response like:
[
{
"status": "Active",
"entityName": "fghgfhfghfgh",
"entityCode": 14,
"children": [],
"attributes": [
{
"attributeValue": "500 michigan ave"
}
],
"deviceList": [],
"entityId": "64eab9299eed9455b3683da074cf175c",
"customerId": 2006546,
"type": "7dad308f82b41e02fe8959c05b631bd7"
}
,
{
"status": "Active",
"entityName": "ghghhguyutgh6re58rrt",
"entityCode": 13,
"children": [],
"attributes": [
{
"attributeValue": "500 michigan ave"
}
],
"deviceList": [],
"entityId": "912eff0613fa140c100af435c033e195",
"customerId": 2006546,
"type": "7dad308f82b41e02fe8959c05b631bd7"
}
]
I want to split this json into two like
{
"status": "Active",
"entityName": "fghgfhfghfgh",
"entityCode": 14,
"children": [],
"attributes": [
{
"attributeValue": "500 michigan ave"
}
],
"deviceList": [],
"entityId": "64eab9299eed9455b3683da074cf175c",
"customerId": 2006546,
"type": "7dad308f82b41e02fe8959c05b631bd7" }
and the other one.I am using GSON and simplejson,when I try to remove the delimiters([ and ])the json comes as malformed one.Is there any better otpion to split the json array to two or more json strings as per the json response coming.

Is there any reason you can't parse the entire array, and then iterate over/process each element individually?
Just removing the brackets does make the JSON invalid, and splitting at the comma is going to be unreliable. The parser exists to figure out where to split the array and turn each element into an object for you.
Assuming you have some data structure defined to hold a single element once you've broken the array down, you should be able to parse the array into a list of those and step through them (or pick one out) as needed.
After that point, you can do whatever you want with the data (including formatting it back into JSON). I would definitely recommend using a proper parser to break the array down, though; it will be a lot simpler and more reliable, and should work unless you have serious performance concerns.

How to index and search nested documents/structured data in Lucene or similar libraries?

I have a structured JSON data in the following form:
{
"id": 42,
"name": "hand",
"quantity": 2,
"digits": [
{
"id": 43,
"name": "thumb",
"quantity": 1,
"components": [
{
"id": 44,
"name": "thumb",
"position": 0
}
]
},
{
"id": 45,
"name": "fingers",
"quantity": 4,
"components": [
{
"id": 46,
"name": "index",
"position": 1
},
{
"id": 47,
"name": "middle",
"position": 2
},
{
"id": 48,
"name": "ring",
"position": 3
},
{
"id": 49,
"name": "little",
"position": 4
}
]
}
]
}
I need to index these data, using Java, so that it would be possible afterwards to make queries to get the needed information.
To that end a solution could be using Apache Lucene which supports nested documents since version 3.4. However, I could not find any tutorial nor a simple example on how a nested document can be created.
Can anyone explain how to create a Lucene document for structured data?
Alternatively, are there other libraries similar to Lucene which better support indexing and searching of structured data?

A simple solution in your case is to use path enumerations ("Dewey Decimal"). For example, yor first three tems would be "42", "42.43", and "42.43.44", while your fourth item would be "42.45". Make sure your slots are large enough for the largest number of items you will need -- for example, "042.043.044".

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

reading json data through Apache Spark - java

Related

Flattening a heavily nested JSON in Java - Time Complexity

How to get value from json object with java obect mapper

Jackson mapping same JSON nodes with different names as key

Json Array Parsing issue

How to index and search nested documents/structured data in Lucene or similar libraries?

Categories

Resources