Apache Beam - Reading JSON and Stream

Apache Beam - Reading JSON and Stream - java

I am writing Apache beam code, where I have to read a JSON file which has placed in the project folder, and read the data and Stream it.
This is the sample code to read JSON. Is this correct way of doing it?
PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(SparkRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("ReadMyFile", TextIO.read().from("/Users/xyz/eclipse-workspace/beam-prototype/test.json"));
System.out.println("lines: " + lines);
or I should use,
p.apply(FileIO.match().filepattern("/Users/xyz/eclipse-workspace/beam-prototype/test.json"))
I just need to read the below json file. Read the complete testdata from this file and then Stream it.
{
“testdata":{
“siteOwner”:”xxx”,
“siteInfo”:{
“siteID”:”id_member",
"siteplatform”:”web”,
"siteType”:”soap”,
"siteURL”:”www”,
}
}
}
The above code is not reading the json file, it is printing like
lines: ReadMyFile/Read.out [PCollection]
, could you please guide me with sample reference?

This is the sample code to read JSON. Is this correct way of doing it?
To quickly answer your question, yes. Your sample code is the correct way to read a file containing JSON, where each line of the file contains a single JSON element. The TextIO input transform reads a file line by line, so if a single JSON element spans multiple lines, then it will not be parseable.
The second code sample has the same effect.
The above code is not reading the json file, it is printing like
The printed result is expected. The variable lines does not actually contain the JSON strings in the file. lines is a PCollection of Strings; it simply represents the state of the pipeline after a transform is applied. Accessing elements in the pipeline can be done by applying subsequent transforms. The actual JSON string can be access in the implementation of a transform.

Related

How to log the content of csv in Apache Camel?

I have the following code
DataFormat bindy = new BindyCsvDataFormat(Employee.class);
from("file:src/main/resources/csv2?noop=true").routeId("route3").unmarshal(bindy).to("mock:result").log("${body[0].name}");
I am trying to log every line of the csv file, currently I am only able to hardcode it to print.
Do I have to use Loop even I don't know the number of lines of the csv ? Or Do I have to use processor ? Whats the easiest way to achieve what I want ?

The unmarshalling step is producing an exchange whose body is a list of tuples. For that reason you can simply use Camel splitter to slice the original exchange into 1-N sub-exchanges (one per line/item of the list) and then log each of these lines:
from("file:src/main/resources/csv2?noop=true")
.unmarshal(bindy)
.split().body()
.log("${name}");
If you do not want to alter the original message, you can use the wiretap pattern in order to log a copy of the exchange:
from("file:src/main/resources/csv2?noop=true")
.unmarshal(bindy)
.wireTap("direct:logBody")
.to("mock:result");
from("direct:logBody")
.split().body()
.log("Row# ${exchangeProperty.CamelSplitIndex} : ${name}");

How can I efficiently read multiple json files into a Dataframe or JavaRDD?

I can use the following code to read a single json file but I need to read multiple json files and merge them into one Dataframe. How can I do this?
DataFrame jsondf = sqlContext.read().json("/home/spark/articles/article.json");
Or is there a way to read multiple json files into JavaRDD then convert to Dataframe?

To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.
context.read().json("/home/spark/articles/*.json")
// or getting json out of s3
context.read().json("s3n://bucket/articles/201510*/*.json")

You can use exactly the same code to read multiple JSON files. Just pass a path-to-a-directory / path-with-wildcards instead of path to a single file.
DataFrameReader also provides json method with a following signature:
json(jsonRDD: JavaRDD[String])
which can be used to parse JSON already loaded into JavaRDD.

function spark.read.json accepts list of file as a parameter.
spark.read.json(List_all_json file)
This will read all the files in the list and return a single data frame for all the information in the files.

Using pyspark, if you have all the json files in the same folder, you can use df = spark.read.json('folder_path'). This instruction will load all the json files inside the folder.
For reading performance, I recommend you for providing dataframe the schema:
import pyspark.sql.types as T
billing_schema = billing_schema = T.StructType([
T.StructField('accountId', T.LongType(),True),
T.StructField('accountName',T.StringType(),True),
T.StructField('accountOwnerEmail',T.StringType(),True),
T.StructField('additionalInfo',T.StringType(),True),
T.StructField('chargesBilledSeparately',T.BooleanType(),True),
T.StructField('consumedQuantity',T.DoubleType(),True),
T.StructField('consumedService',T.StringType(),True),
T.StructField('consumedServiceId',T.LongType(),True),
T.StructField('cost',T.DoubleType(),True),
T.StructField('costCenter',T.StringType(),True),
T.StructField('date',T.StringType(),True),
T.StructField('departmentId',T.LongType(),True),
T.StructField('departmentName',T.StringType(),True),
T.StructField('instanceId',T.StringType(),True),
T.StructField('location',T.StringType(),True),
T.StructField('meterCategory',T.StringType(),True),
T.StructField('meterId',T.StringType(),True),
T.StructField('meterName',T.StringType(),True),
T.StructField('meterRegion',T.StringType(),True),
T.StructField('meterSubCategory',T.StringType(),True),
T.StructField('offerId',T.StringType(),True),
T.StructField('partNumber',T.StringType(),True),
T.StructField('product',T.StringType(),True),
T.StructField('productId',T.LongType(),True),
T.StructField('resourceGroup',T.StringType(),True),
T.StructField('resourceGuid',T.StringType(),True),
T.StructField('resourceLocation',T.StringType(),True),
T.StructField('resourceLocationId',T.LongType(),True),
T.StructField('resourceRate',T.DoubleType(),True),
T.StructField('serviceAdministratorId',T.StringType(),True),
T.StructField('serviceInfo1',T.StringType(),True),
T.StructField('serviceInfo2',T.StringType(),True),
T.StructField('serviceName',T.StringType(),True),
T.StructField('serviceTier',T.StringType(),True),
T.StructField('storeServiceIdentifier',T.StringType(),True),
T.StructField('subscriptionGuid',T.StringType(),True),
T.StructField('subscriptionId',T.LongType(),True),
T.StructField('subscriptionName',T.StringType(),True),
T.StructField('tags',T.StringType(),True),
T.StructField('unitOfMeasure',T.StringType(),True)
])
billing_df = spark.read.json('/mnt/billingsources/raw-files/202106/', schema=billing_schema)

Function json(String... paths) takes variable arguments. (documentation)
So you can change your code like this:
sqlContext.read().json(file1, file2, ...)

How to fully read a file with delimited messages in Google Protobufs?

I'm trying to read a file, which has multiple delimited messages in it (in the thousands), how can I do this properly using Google protobufs?
This is how I'm writing the delimited:
MyMessage myMessage = MyMessage.parseFrom(byte[] msg);
myMessage.writeDelimitedTo(FileOutputStream);
and this is how I'm reading the delimited file;
CodedInputStream is = CodedInputStream.newInstance(new FileInputStream("/location/to/file"));
while (!is.isAtEnd()) {
int size = is.readRawVarint32();
MyMessage msg = MyMessage.parseFrom(is.readRawBytes(size));
//do stuff with your messages
}
I'm kind of confused because the accepted answer in this question say's to use .parseDelimitedFrom() to read the delimited bytes; Google Protocol Buffers - Storing messages into file
However, when using .parseDelimitedFrom(), it only reads the first message. (I don't know how to read the whole file using parseDelimitedFrom()).
This comment say's to write the delimited messages using CodedOutputStream: Google Protocol Buffers - Storing messages into file (i.e. writer.writeRawVariant()). I'm currently using the implementation of this comment to read the whole file. Does writeDelimitedTo() basically do the same thing as
writer.writeRawVarint32(bytes.length);
and
writer.writeRawBytes(bytes);
Also, if my way isn't the proper way of reading a whole file consisting of delimited messages, can you please show me what is?
thank you.

Yes, writeDelimitedTo() simply writes the length as a varint followed byte the bytes. There's no need to use CodedOutputStream directly if you're working in Java.
parseDelimitedFrom() parses one message, but you may call it repeatedly to parse all the messages in the InputStream. The method will return null when you reach the end of the stream.

JSon to CSV with Java using CDL: possible to replace comma-sep. by semi-colum sep. values?

Everything is in the title :)
I'm using org.json.CDL to convert JSONArray into CSV data but it renders a string with ',' as separator.
I'd like to know if it's possible to replace with ';' ?
Here is a simple example of what i'm doing:
public String exportAsCsv() throws Exception {
return CDL.toString(
new JSONArray(
mapper.writeValueAsString(extractAccounts()))
);
}
Thanks in advance for any advice on that question.
Edit: No replacement solution of course, as this could have impact for large data, and of course the library used enable me to specify the field separator.
Edit2: Finally the solution to extract data as JSONArray (and String...) was not very good, especially for large data file.
So i made the following changes:
use a Java CSV library (for example: http://www.csvreader.com/java_csv_samples.php)
refactor code to stream data from json input source to csv output source
This is nicer for large data treatment. If you have comments do not hesitate.

String output = "Hello,This,is,separated,by,a,comma";
// Simple call the replaceAll method.
output = output.replace(',',';');
I found this in the String documentation.
Example
String value = "Hello,tthis,is,a,string";
value = value.replace(',', ';');
System.out.println(value);
// Outputs: Hello;tthis;is;a;string

Load txt's file into Java application and save it to XML's file

I read the next answer about load file into java application.
I need to write a program that load .txt, which contains a list of records. After I parse it, I need to match the records (with conditions that I will check), and save the result to XML's file.
I am stuck on this issue, and I will happy for answer to next questions:
How I load the .txt file into Java?
After I load the file, how I can acsses to the information into it? for example, How I can asked if the first line of one of the records is equal to "1";
How I export the result to XML's file.

one: you need a sample-code for reading a file line by line
two: the split-method of a string might be helpful. For instance getting the number of the first element if information is seperated by a space
String myLine;
String[] components = myLine.split(" ");
if(components != null && components.length >= 1) {
int num = Integer.parseInt(components[0]);
....
}
three: you can just write it like any text-file, or use any XML-Writer you want

Basic I/O
Integer.parseInt(1stLine)
There are a plethora of choices.
Create POJO's to represent the records and write them using XMLEncoder
SAX
DOM..

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Beam - Reading JSON and Stream - java

Related

How to log the content of csv in Apache Camel?

How can I efficiently read multiple json files into a Dataframe or JavaRDD?

How to fully read a file with delimited messages in Google Protobufs?

JSon to CSV with Java using CDL: possible to replace comma-sep. by semi-colum sep. values?

Load txt's file into Java application and save it to XML's file

Categories

Resources