Get JSON into Apache Spark from a web source in Java - java

I have a web server which returns JSON data that I would like to load into an Apache Spark DataFrame. Right now I have a shell script that uses wget to write the JSON data to file and then runs a Java program that looks something like this:
DataFrame df = sqlContext.read().json("example.json");
I have looked at the Apache Spark documentation and there doesn't seem a way to automatically join these two steps together. There must be a way of requesting JSON data in Java, storing it as an object and then converting it to a DataFrame, but I haven't been able to figure it out. Can anyone help?

You could store JSON data into a list of Strings like:
final String JSON_STR0 = "{\"name\":\"0\",\"address\":{\"city\":\"0\",\"region\":\"0\"}}";
final String JSON_STR1 = "{\"name\":\"1\",\"address\":{\"city\":\"1\",\"region\":\"1\"}}";
List<String> jsons = Arrays.asList(JSON_STR0, JSON_STR1);
where each String represents a JSON object.
Then you could transform the list to an RDD:
RDD<String> jsonRDD = sc.parallelize(jsons);
Once you've got RDD, it's easy to have DataFrame:
DataFrame data = sqlContext.read().json(jsonRDD);

Related

Consume the Json in kafka topic using tJava in Talend

I am currently trying to create an ingestion job workflow using kafka in Talend Studio. The job will read the json data in topic "work" and store into the hive table.
Snippet of json:
{"Header": {"Vers":"1.0","Message": "318","Owner": {"ID": 102,"FID": 101},"Mode":"8"},"Request": {"Type":"4","ObjType": "S","OrderParam":[{"Code": "OpType","Value": "30"},{"Code": "Time","Value": "202"},{"Code": "AddProperty","ObjParam": [{"Param": [{"Code": "Sync","Value": "Y"}]}]}]}}
{"Header": {"Vers":"2.0","Message": "318","Owner": {"ID": 103,"FID": 102},"Mode":"8"},"Request": {"Type":"5","ObjType": "S","OrderParam":[{"Code": "OpType","Value": "90"},{"Code": "Time","Value": "203"},{"Code": "AddProperty","ObjParam": [{"Param": [{"Code": "Sync","Value": "Y"}]}]}]}}
Talend workflow:
My focus in this question is not the talend component. But the java code in tJava component that uses the java to fetch and read the json.
Java code:
String output=((String)globalMap.get("tLogRow_1_OUTPUT"));
JSONObject jsonObject = new JSONObject(output);
System.out.println(jsonObject);
String sourceDBName=(jsonObject.getString("Vers"));
The code above able to get the data from tLogRow in "output" variable. However, it gives an error where it read null value for json object. What should I do to correctly get the data from json accordingly?
You can use a tExtractJsonFields instead of a tJava. This component extracts data from your input String following a json schema that you can define in metadata. With this you could extract all fields from your input .

Convert one json format to another in java

I am looking for a utility which converts one json format to another by respecting at the conversion definitions from a preferably xml file. Is there any library doing something like this in java ?
For example source json is:
{"name":"aa","surname":"bb","accounts":[{"accountid":10,"balance":100}]}
target json is :
{"owner":"aa-bb","accounts":[{"accountid":10,"balance":100}]}
sample config xml :
t.owner = s.name.concat("-").concat(surname)
t.accounts = t.accounts
Ps:Please dont post solutions for this example, it is just for giving an idea, there will be quite different scenarios in mapping.
Is this what u need?
Open input file.
Read / parse JSON from file using a JSON library.
Convert in-memory data structure to new structure.
Open output file
Unparse in-memory data structure to file using JSON library.

How can I efficiently read multiple json files into a Dataframe or JavaRDD?

I can use the following code to read a single json file but I need to read multiple json files and merge them into one Dataframe. How can I do this?
DataFrame jsondf = sqlContext.read().json("/home/spark/articles/article.json");
Or is there a way to read multiple json files into JavaRDD then convert to Dataframe?
To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.
context.read().json("/home/spark/articles/*.json")
// or getting json out of s3
context.read().json("s3n://bucket/articles/201510*/*.json")
You can use exactly the same code to read multiple JSON files. Just pass a path-to-a-directory / path-with-wildcards instead of path to a single file.
DataFrameReader also provides json method with a following signature:
json(jsonRDD: JavaRDD[String])
which can be used to parse JSON already loaded into JavaRDD.
function spark.read.json accepts list of file as a parameter.
spark.read.json(List_all_json file)
This will read all the files in the list and return a single data frame for all the information in the files.
Using pyspark, if you have all the json files in the same folder, you can use df = spark.read.json('folder_path'). This instruction will load all the json files inside the folder.
For reading performance, I recommend you for providing dataframe the schema:
import pyspark.sql.types as T
billing_schema = billing_schema = T.StructType([
T.StructField('accountId', T.LongType(),True),
T.StructField('accountName',T.StringType(),True),
T.StructField('accountOwnerEmail',T.StringType(),True),
T.StructField('additionalInfo',T.StringType(),True),
T.StructField('chargesBilledSeparately',T.BooleanType(),True),
T.StructField('consumedQuantity',T.DoubleType(),True),
T.StructField('consumedService',T.StringType(),True),
T.StructField('consumedServiceId',T.LongType(),True),
T.StructField('cost',T.DoubleType(),True),
T.StructField('costCenter',T.StringType(),True),
T.StructField('date',T.StringType(),True),
T.StructField('departmentId',T.LongType(),True),
T.StructField('departmentName',T.StringType(),True),
T.StructField('instanceId',T.StringType(),True),
T.StructField('location',T.StringType(),True),
T.StructField('meterCategory',T.StringType(),True),
T.StructField('meterId',T.StringType(),True),
T.StructField('meterName',T.StringType(),True),
T.StructField('meterRegion',T.StringType(),True),
T.StructField('meterSubCategory',T.StringType(),True),
T.StructField('offerId',T.StringType(),True),
T.StructField('partNumber',T.StringType(),True),
T.StructField('product',T.StringType(),True),
T.StructField('productId',T.LongType(),True),
T.StructField('resourceGroup',T.StringType(),True),
T.StructField('resourceGuid',T.StringType(),True),
T.StructField('resourceLocation',T.StringType(),True),
T.StructField('resourceLocationId',T.LongType(),True),
T.StructField('resourceRate',T.DoubleType(),True),
T.StructField('serviceAdministratorId',T.StringType(),True),
T.StructField('serviceInfo1',T.StringType(),True),
T.StructField('serviceInfo2',T.StringType(),True),
T.StructField('serviceName',T.StringType(),True),
T.StructField('serviceTier',T.StringType(),True),
T.StructField('storeServiceIdentifier',T.StringType(),True),
T.StructField('subscriptionGuid',T.StringType(),True),
T.StructField('subscriptionId',T.LongType(),True),
T.StructField('subscriptionName',T.StringType(),True),
T.StructField('tags',T.StringType(),True),
T.StructField('unitOfMeasure',T.StringType(),True)
])
billing_df = spark.read.json('/mnt/billingsources/raw-files/202106/', schema=billing_schema)
Function json(String... paths) takes variable arguments. (documentation)
So you can change your code like this:
sqlContext.read().json(file1, file2, ...)

How to convert IPentahoResultSet to JSON object

I am writing java code trying to convert an IPentahoResultSet to JSON so I can send it to a server using apache commons httpclient. I could not find any way to convert this pentaho result set to JSON. Any help will be appreciated.
I have tried this code to serialize it, but it does not work. I think it is meant to serialize classes not resultsets.
import flexjson.JSONSerializer;
import org.pentaho.commons.connection.marshal.MarshallableResultSet;
.
.
.
IPentahoResultSet data;
//data will contain result of executing and MDX against Mondrian
MarshallableResultSet result = new MarshallableResultSet();
result.setResultSet(data);
JSONSerializer serializer = new JSONSerializer();
String json = serializer.deepSerialize( result );

how to get the data from xml feeds

I have the following feeds from my vendor,
http://scores.cricandcric.com/cricket/getFeed?key=4333433434343&format=xml&tagsformat=long&type=schedule
I wanted to get the data from that xml files as java objects, so that I can insert into my database regularly.
The above data is nothing but regular updates from the vendor, so that I can update in my website.
can you please suggest me what are my options available to get this working
Should I use any webservices or just Xstream
to get my final output.. please suggest me as am a new comer to this concept
Vendor has suggested me that he can give me the data in following 3 formats rss, xml or json, I am not sure what is easy and less consumable to get it working
I would suggest just write a program that parses the XML and inserts the data directly into your database.
Example
This groovy script inserts data into a H2 database.
//
// Dependencies
// ============
import groovy.sql.Sql
#Grapes([
#Grab(group='com.h2database', module='h2', version='1.3.163'),
#GrabConfig(systemClassLoader=true)
])
//
// Main program
// ============
def sql = Sql.newInstance("jdbc:h2:db/cricket", "user", "pass", "org.h2.Driver")
def dataUrl = new URL("http://scores.cricandcric.com/cricket/getFeed?key=4333433434343&format=xml&tagsformat=long&type=schedule")
dataUrl.withReader { reader ->
def feeds = new XmlSlurper().parse(reader)
feeds.matches.match.each {
def data = [
it.id,
it.name,
it.type,
it.tournamentId,
it.location,
it.date,
it.GMTTime,
it.localTime,
it.description,
it.team1,
it.team2,
it.teamId1,
it.teamId2,
it.tournamentName,
it.logo
].collect {
it.text()
}
sql.execute("INSERT INTO matches (id,name,type,tournamentId,location,date,GMTTime,localTime,description,team1,team2,teamId1,teamId2,tournamentName,logo) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", data)
}
}
Well... you could use an XML Parser (stream or DOM), or a JSON parser (again stream of 'DOM'), and build the objects on the fly. But with this data - which seems to consist of records of cricket matches, why not go with a csv format?
This seems to be your basic 'datum':
<id>1263</id>
<name>Australia v India 3rd Test at Perth - Jan 13-17, 2012</name>
<type>TestMatch</type>
<tournamentId>137</tournamentId>
<location>Perth</location>
<date>2012-01-14</date>
<GMTTime>02:30:00</GMTTime>
<localTime>10:30:00</localTime>
<description>3rd Test day 2</description>
<team1>Australia</team1>
<team2>India</team2>
<teamId1>7</teamId1>
<teamId2>1</teamId2>
<tournamentName>India tour of Australia 2011-12</tournamentName>
<logo>/cricket/137/tournament.png</logo>
Of course you would still have to parse a csv, and deal with character delimiting (such as when you have a ' or a " in a string), but it will reduce your network traffic quite substantially, and likely parse much faster on the client. Of course, this depends on what your client is.
Actually you have RESTful store that can return data in several formats and you only need to read from this source and no further interaction is needed.
So, you can use any XML Parser to parse XML data and put the extracted data in whatever data structure that you want or you have.
I did not hear about XTREME, but you can find more information about selecting the best parser for your situation at this StackOverflow question.

Categories