Flattening JSON data into individual rows

Flattening JSON data into individual rows - java

I am interested in flattening JSON with multiple layers of nested arrays of object. I would ideally like to do this in Java but it seems like the Pandas library in python might be good for this.
Does anyone know a good java library for this?
I found this article (Create a Pandas DataFrame from deeply nested JSON) using pandas and jq and my solution almost works but the output I am receiving is not quite as expected. Here is my code sample
json_data = '''{ "id": 1,
"things": [
{
"tId": 1,
"objs": [{"this": 99},{"this": 100}]
},
{
"tId": 2,
"objs": [{"this": 222},{"this": 22222}]
}
]
}'''
rule = """[{id: .id,
tid: .things[].tId,
this: .things[].objs[].this}]"""
out = jq(rule, _in=json_data).stdout
res = pd.DataFrame(json.loads(out))
The problem is the output I am receiving is this:
id this tid
0 1 99 1
1 1 100 1
2 1 222 1
3 1 22222 1
4 1 99 2
5 1 100 2
6 1 222 2
7 1 22222 2
I am expecting to see
id this tid
0 1 99 1
1 1 100 1
3 1 222 2
4 1 22222 2
Any tips on how to make this work, different solutions, or a java option would be great!
Thanks in advance!
Craig

The problem is that your "rule" creates a Cartesian product, whereas in effect you want nested iteration.
With your input, the following jq expression, which makes the nested iteration reasonably clear, produces the output as shown:
.id as $id
| .things[] as $thing
| $thing.objs[]
| [$id, .this, $thing.tId]
| #tsv
Output
1 99 1
1 100 1
1 222 2
1 22222 2
Rule
So presumably your rule should look something like this:
[{id} + (.things[] | {tid: .tId} + (.objs[] | {this}))]
or if you want to make the nested iteration clearer:
[ .id as $id
| .things[] as $thing
| $thing.objs[]
| {id: $id, this, tid: $thing.tId} ]
Running jq in java
Besides processBuilder, you might like to take a look at these wrappers:
https://github.com/eiiches/jackson-jq
https://github.com/arakelian/java-jq

Related

Get lines of code in src folder skipping comments and empty lines

Hi I already have something to get the lines of code but still it out puts the count having empty lines and comments counted.
git ls-files | grep "\.java$" | xargs wc -l
Can you modify this to skip comments and blank lines..?
Thanks in advance

Try CLOC, it can lists numbers in great detail.
You need to install CLOC first using syntax brew install cloc
cloc $(git ls-files)
Sample output for reference:
20 text files.
20 unique files.
6 files ignored.
http://cloc.sourceforge.net v 1.62 T=0.22 s (62.5 files/s, 2771.2 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Javascript 2 13 111 309
JSON 3 0 0 58
HTML 2 7 12 50
Handlebars 2 0 0 37
CoffeeScript 4 1 4 12
SASS 1 1 1 5
-------------------------------------------------------------------------------
SUM: 14 22 128 471
-------------------------------------------------------------------------------

Most test coverage tools also calculate LOC. For instance, Jacoco spits this out to Jenkins.

FOREACH in cypher - neo4j

I am very new to CYPHER QUERY LANGUAGE AND i am working on relationships between nodes.
I have a CSV file of table containing multiple columns and 1000 rows.
Template of my table is :
cdrType ANUMBER BNUMBER DUARTION
2 123 456 10
2 890 456 5
2 123 666 2
2 123 709 7
2 345 789 20
I have used these commands to create nodes and property keys.
LOAD CSV WITH HEADERS FROM "file:///2.csv" AS ROW
CREATE (:ANUMBER {aNumber:ROW.aNumber} ),
CREATE (:BNUMBER {bNumber:ROW.bNumber} )
Now I need to create relation between all rows in the table and I think FOREACH loop is best in my case. I created this query but it gives me an error. Query is :
MATCH (a:ANUMBER),(b:BNUMBER)
FOREACH(i in RANGE(0, length(ANUMBER)) |
CREATE UNIQUE (ANUMBER[i])-[s:CALLED]->(BNUMBER[i]))
and the error is :
Invalid input '[': expected an identifier character, whitespace,
NodeLabel, a property map, ')' or a relationship pattern (line 3,
column 29 (offset: 100)) " CREATE UNIQUE
(a:ANUMBER[i])-[s:CALLED]->(b:BNUMBER[i]))"
I need relation for every row. like in my case. 123 - called -> 456 , 890 - called -> 456. So I need visual representation of this calling data that which number called which one. For this I need to create relation between all rows.
any one have idea how to solve this ?

What about :
LOAD CSV WITH HEADERS FROM "file:///2.csv" AS ROW
CREATE (a:ANUMBER {aNumber:ROW.aNumber} )
CREATE (b:BNUMBER {bNumber:ROW.bNumber} )
MERGE (a)-[:CALLED]->(b);
It's not more complex than that i.m.o.
Hope this helps !
Regards,
Tom

Filter and Group multiple DataSets in spark java

I am very new to spark.The below is the requirement am getting to
1st RDD
empno first-name last-name
0 fname lname
1 fname1 lname1
2nd rdd
empno dept-no dept-code
0 1 a
0 1 b
1 1 a
1 2 a
3rd rdd
empno history-no address
0 1 xyz
0 2 abc
1 1 123
1 2 456
1 3 a12
I have to generate a file combining all the RDDs for each employee, and the average emp-count is 200k
Desired output:
seg-start emp-0
seg-emp 0-fname-lname
seg-dept 0-1-a
seg-dept 0-1-b
seg-his 0-1-xyz
seg-his 0-2-abc
seg-end emp-0
seg-start emp-1
......
seg-end emp-1
How can I achieve this by combining RDDs? Please note that the data is not written straight forward as it was shown here, we are converting data to business valid format(ex:- e0xx5fname5lname is 0-fname-lname), so need help from the experts here, as the current batch program runs for hours to write data, thinking of using spark to process this efficiently.

Can I use values from Spark dataframe to create a new one dynamically?

I have a Spark dataframe(oldDF) that looks like:
Id | Category | Count
898989 5 12
676767 12 1
334344 3 2
676767 13 3
And I want to create a new dataframe with column names of Category with value of Count grouped by Id.
The reason why I can't specify schema or would rather not is because the categories change a lot.Is there any way to do it dynamically?
An output I would like to see as Dataframe from the one above:
Id | V3 | V5 | V12 | V13
898989 0 12 0 0
676767 0 0 1 3
334344 2 0 0 0

With Spark 1.6
oldDf.groupBy("Id").pivot("category").sum("count)

You need to do your groupby operation first, then you can apply implement a pivot operation as explained here

R: read.arff error

I am starting to work with Weka in R and I got stuck at the first step. I converted my csv file into arff file and I did this using an online converter, but when i tried to read it into R I got the following error message.
require(RWeka)
A <- read.arff("Environmental variables all overviewxlsx.arff")
Error in .jnew("weka/core/Instances", .jcast(reader, "java/io/Reader")) :
java.io.IOException: no valid attribute type or invalid enumeration, read Token[[°C]], line 6
Does anyone have an idea that could help me?
Thanks!
p.s. the proper package (RWeka) is already installed.

Because read.arff() returns a dataframe you could skip the conversion process and use read.csv().
train_arff<-read.arff(file.choose())
str(train_arff)
'data.frame': 14 obs. of 5 variables:
$ outlook : Factor w/ 3 levels "sunny","overcast",..: 1 1 2 3 3 3 2 1 1 3 ...
$ temperature: Factor w/ 3 levels "hot","mild","cool": 1 1 1 2 3 3 3 2 3 2 ...
$ humidity : Factor w/ 2 levels "high","normal": 1 1 1 1 2 2 2 1 2 2 ...
$ windy : logi FALSE TRUE FALSE FALSE FALSE TRUE ...
$ play : Factor w/ 2 levels "yes","no": 2 2 1 1 1 2 1 2 1 1 ...
train_csv<-read.csv(file.choose())
str(train_csv)
'data.frame': 14 obs. of 5 variables:
$ outlook : Factor w/ 3 levels "overcast","rainy",..: 3 3 1 2 2 2 1 3 3 2 ...
$ temperature: Factor w/ 3 levels "cool","hot","mild": 2 2 2 3 1 1 1 3 1 3 ...
$ humidity : Factor w/ 2 levels "high","normal": 1 1 1 1 2 2 2 1 2 2 ...
$ windy : logi FALSE TRUE FALSE FALSE FALSE TRUE ...
$ play : Factor w/ 2 levels "no","yes": 1 1 2 2 2 1 2 1 2 2 ...
Otherwise your .arff file should have this format

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Flattening JSON data into individual rows - java

Related

Get lines of code in src folder skipping comments and empty lines

FOREACH in cypher - neo4j

Filter and Group multiple DataSets in spark java

Can I use values from Spark dataframe to create a new one dynamically?

R: read.arff error

Categories

Resources