I am currently trying to create an ingestion job workflow using kafka in Talend Studio. The job will read the json data in topic "work" and store into the hive table.
Snippet of json:
{"Header": {"Vers":"1.0","Message": "318","Owner": {"ID": 102,"FID": 101},"Mode":"8"},"Request": {"Type":"4","ObjType": "S","OrderParam":[{"Code": "OpType","Value": "30"},{"Code": "Time","Value": "202"},{"Code": "AddProperty","ObjParam": [{"Param": [{"Code": "Sync","Value": "Y"}]}]}]}}
{"Header": {"Vers":"2.0","Message": "318","Owner": {"ID": 103,"FID": 102},"Mode":"8"},"Request": {"Type":"5","ObjType": "S","OrderParam":[{"Code": "OpType","Value": "90"},{"Code": "Time","Value": "203"},{"Code": "AddProperty","ObjParam": [{"Param": [{"Code": "Sync","Value": "Y"}]}]}]}}
Talend workflow:
My focus in this question is not the talend component. But the java code in tJava component that uses the java to fetch and read the json.
Java code:
String output=((String)globalMap.get("tLogRow_1_OUTPUT"));
JSONObject jsonObject = new JSONObject(output);
System.out.println(jsonObject);
String sourceDBName=(jsonObject.getString("Vers"));
The code above able to get the data from tLogRow in "output" variable. However, it gives an error where it read null value for json object. What should I do to correctly get the data from json accordingly?
You can use a tExtractJsonFields instead of a tJava. This component extracts data from your input String following a json schema that you can define in metadata. With this you could extract all fields from your input .
I am looking for a utility which converts one json format to another by respecting at the conversion definitions from a preferably xml file. Is there any library doing something like this in java ?
For example source json is:
{"name":"aa","surname":"bb","accounts":[{"accountid":10,"balance":100}]}
target json is :
{"owner":"aa-bb","accounts":[{"accountid":10,"balance":100}]}
sample config xml :
t.owner = s.name.concat("-").concat(surname)
t.accounts = t.accounts
Ps:Please dont post solutions for this example, it is just for giving an idea, there will be quite different scenarios in mapping.
Is this what u need?
Open input file.
Read / parse JSON from file using a JSON library.
Convert in-memory data structure to new structure.
Open output file
Unparse in-memory data structure to file using JSON library.
I can use the following code to read a single json file but I need to read multiple json files and merge them into one Dataframe. How can I do this?
DataFrame jsondf = sqlContext.read().json("/home/spark/articles/article.json");
Or is there a way to read multiple json files into JavaRDD then convert to Dataframe?
To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.
context.read().json("/home/spark/articles/*.json")
// or getting json out of s3
context.read().json("s3n://bucket/articles/201510*/*.json")
You can use exactly the same code to read multiple JSON files. Just pass a path-to-a-directory / path-with-wildcards instead of path to a single file.
DataFrameReader also provides json method with a following signature:
json(jsonRDD: JavaRDD[String])
which can be used to parse JSON already loaded into JavaRDD.
function spark.read.json accepts list of file as a parameter.
spark.read.json(List_all_json file)
This will read all the files in the list and return a single data frame for all the information in the files.
Using pyspark, if you have all the json files in the same folder, you can use df = spark.read.json('folder_path'). This instruction will load all the json files inside the folder.
For reading performance, I recommend you for providing dataframe the schema:
import pyspark.sql.types as T
billing_schema = billing_schema = T.StructType([
T.StructField('accountId', T.LongType(),True),
T.StructField('accountName',T.StringType(),True),
T.StructField('accountOwnerEmail',T.StringType(),True),
T.StructField('additionalInfo',T.StringType(),True),
T.StructField('chargesBilledSeparately',T.BooleanType(),True),
T.StructField('consumedQuantity',T.DoubleType(),True),
T.StructField('consumedService',T.StringType(),True),
T.StructField('consumedServiceId',T.LongType(),True),
T.StructField('cost',T.DoubleType(),True),
T.StructField('costCenter',T.StringType(),True),
T.StructField('date',T.StringType(),True),
T.StructField('departmentId',T.LongType(),True),
T.StructField('departmentName',T.StringType(),True),
T.StructField('instanceId',T.StringType(),True),
T.StructField('location',T.StringType(),True),
T.StructField('meterCategory',T.StringType(),True),
T.StructField('meterId',T.StringType(),True),
T.StructField('meterName',T.StringType(),True),
T.StructField('meterRegion',T.StringType(),True),
T.StructField('meterSubCategory',T.StringType(),True),
T.StructField('offerId',T.StringType(),True),
T.StructField('partNumber',T.StringType(),True),
T.StructField('product',T.StringType(),True),
T.StructField('productId',T.LongType(),True),
T.StructField('resourceGroup',T.StringType(),True),
T.StructField('resourceGuid',T.StringType(),True),
T.StructField('resourceLocation',T.StringType(),True),
T.StructField('resourceLocationId',T.LongType(),True),
T.StructField('resourceRate',T.DoubleType(),True),
T.StructField('serviceAdministratorId',T.StringType(),True),
T.StructField('serviceInfo1',T.StringType(),True),
T.StructField('serviceInfo2',T.StringType(),True),
T.StructField('serviceName',T.StringType(),True),
T.StructField('serviceTier',T.StringType(),True),
T.StructField('storeServiceIdentifier',T.StringType(),True),
T.StructField('subscriptionGuid',T.StringType(),True),
T.StructField('subscriptionId',T.LongType(),True),
T.StructField('subscriptionName',T.StringType(),True),
T.StructField('tags',T.StringType(),True),
T.StructField('unitOfMeasure',T.StringType(),True)
])
billing_df = spark.read.json('/mnt/billingsources/raw-files/202106/', schema=billing_schema)
Function json(String... paths) takes variable arguments. (documentation)
So you can change your code like this:
sqlContext.read().json(file1, file2, ...)
I have the following feeds from my vendor,
http://scores.cricandcric.com/cricket/getFeed?key=4333433434343&format=xml&tagsformat=long&type=schedule
I wanted to get the data from that xml files as java objects, so that I can insert into my database regularly.
The above data is nothing but regular updates from the vendor, so that I can update in my website.
can you please suggest me what are my options available to get this working
Should I use any webservices or just Xstream
to get my final output.. please suggest me as am a new comer to this concept
Vendor has suggested me that he can give me the data in following 3 formats rss, xml or json, I am not sure what is easy and less consumable to get it working
I would suggest just write a program that parses the XML and inserts the data directly into your database.
Example
This groovy script inserts data into a H2 database.
//
// Dependencies
// ============
import groovy.sql.Sql
#Grapes([
#Grab(group='com.h2database', module='h2', version='1.3.163'),
#GrabConfig(systemClassLoader=true)
])
//
// Main program
// ============
def sql = Sql.newInstance("jdbc:h2:db/cricket", "user", "pass", "org.h2.Driver")
def dataUrl = new URL("http://scores.cricandcric.com/cricket/getFeed?key=4333433434343&format=xml&tagsformat=long&type=schedule")
dataUrl.withReader { reader ->
def feeds = new XmlSlurper().parse(reader)
feeds.matches.match.each {
def data = [
it.id,
it.name,
it.type,
it.tournamentId,
it.location,
it.date,
it.GMTTime,
it.localTime,
it.description,
it.team1,
it.team2,
it.teamId1,
it.teamId2,
it.tournamentName,
it.logo
].collect {
it.text()
}
sql.execute("INSERT INTO matches (id,name,type,tournamentId,location,date,GMTTime,localTime,description,team1,team2,teamId1,teamId2,tournamentName,logo) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", data)
}
}
Well... you could use an XML Parser (stream or DOM), or a JSON parser (again stream of 'DOM'), and build the objects on the fly. But with this data - which seems to consist of records of cricket matches, why not go with a csv format?
This seems to be your basic 'datum':
<id>1263</id>
<name>Australia v India 3rd Test at Perth - Jan 13-17, 2012</name>
<type>TestMatch</type>
<tournamentId>137</tournamentId>
<location>Perth</location>
<date>2012-01-14</date>
<GMTTime>02:30:00</GMTTime>
<localTime>10:30:00</localTime>
<description>3rd Test day 2</description>
<team1>Australia</team1>
<team2>India</team2>
<teamId1>7</teamId1>
<teamId2>1</teamId2>
<tournamentName>India tour of Australia 2011-12</tournamentName>
<logo>/cricket/137/tournament.png</logo>
Of course you would still have to parse a csv, and deal with character delimiting (such as when you have a ' or a " in a string), but it will reduce your network traffic quite substantially, and likely parse much faster on the client. Of course, this depends on what your client is.
Actually you have RESTful store that can return data in several formats and you only need to read from this source and no further interaction is needed.
So, you can use any XML Parser to parse XML data and put the extracted data in whatever data structure that you want or you have.
I did not hear about XTREME, but you can find more information about selecting the best parser for your situation at this StackOverflow question.