Read and write to different Mongo collections using Spark with Java

Read and write to different Mongo collections using Spark with Java - java

I am a relative newbie to Spark. I need to read from a Mongo collection in Java using Spark, change some field values, let's say I am appending "123" to one field value and write into another collection. Accordingly I had 2 separate Mongo URIs as the input and output URIs configured in Spark. I am then proceeding to read from the input collection. However, what I am not understanding is how would I make the same RDD of documents as output to another collection. This is the input code:
String inputUri = "mongodb://" + kp.getProperty("source.mongo.userid") + ":"
+ Encryptor.decrypt(kp.getProperty("source.mongo.cache")) + "#"
+ kp.getProperty("source.mongo.bootstrap-servers") + "/" + kp.getProperty("source.mongo.database")
+ "." + kp.getProperty("source.mongo.inputCollection") + "?ssl=true&connectTimeoutMS="
+ kp.getProperty("source.mongo.connectTimeoutMS") + "&socketTimeoutMS="
+ kp.getProperty("source.mongo.socketTimeoutMS") + "&maxIdleTimeMS="
+ kp.getProperty("source.mongo.maxIdleTimeMS");
String outputUri = "mongodb://" + kp.getProperty("source.mongo.userid") + ":"
+ Encryptor.decrypt(kp.getProperty("source.mongo.cache")) + "#"
+ kp.getProperty("source.mongo.bootstrap-servers") + "/" + kp.getProperty("source.mongo.database")
+ "." + kp.getProperty("source.mongo.outputCollection") + "?ssl=true&connectTimeoutMS="
+ kp.getProperty("source.mongo.connectTimeoutMS") + "&socketTimeoutMS="
+ kp.getProperty("source.mongo.socketTimeoutMS") + "&maxIdleTimeMS="
+ kp.getProperty("source.mongo.maxIdleTimeMS");
SparkSession spark = SparkSession.builder().master("local[3]").appName(kp.getProperty("spark.app.name"))
.config("spark.mongodb.input.uri", inputUri)
.config("spark.mongodb.output.uri", outputUri)
...;
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
JavaMongoRDD<Document> rdd = MongoSpark.load(sc);
System.out.println("Count: " + rdd.count());
System.out.println(rdd.first().toJson());
Please help me in this regard.

I have got the answer myself. I went the Dataset route instead of RDDs which made the modification simpler. So, to load the Mongo colection, I use
Dataset<Row> df = MongoSpark.load(sc).toDF();
Then I create a temporary view upon it in orcder to be able to use Spark SQL:
df.createOrReplaceTempView("Customer");
I register an UDF for operating upon each column value:
spark.udf().register("Test", new TestUDF(), DataTypes.StringType);
the UDF definition is as follows:
public class TestUDF implements UDF1<String, String> {
#Override
public String call(String customer) throws Exception {
return customer + "123";
}
}
Then I call the UDF using the same column name as the original so that the values in the original dataset are replaced:
df = df.withColumn("CustomerName", functions.callUDF("Test", functions.col("CustomerName")));
Then I write it back to Mongo in a separate collection:
MongoSpark.write(df).option("collection", "myCollection").save();

Related

Hbase loading .opv and .ope give hexadecimal output

I'm using Oracle Big Data Spatial & Graph v.2.5 and following the official guide to load through Java a Graph on HBase.
This is my code:
public class Main {
public static void main(String[] arg) throws Exception {
org.apache.log4j.BasicConfigurator.configure();
OraclePropertyGraphDataLoader opgdl = OraclePropertyGraphDataLoader.getInstance();
String vfile = "/root/oracle_property_files/connections.opv";
String efile = "/root/oracle_property_files/connections.ope";
PgHbaseGraphConfig cfg = GraphConfigBuilder.forPropertyGraphHbase()
.setName("config").setZkQuorum("zk01node,zk02node,zk03node").build();
OraclePropertyGraph opg = OraclePropertyGraph.getInstance(cfg);
opgdl.loadData(opg, vfile, efile, 48);
}
}
Using this libraries:
This is my .opv file:
1,name,1,Alice,,
1,age,2,,31,
2,name,1,Bob,,
2,age,2,,27,
And this is my .ope file:
1,1,2,knows,type,1,friends,,
My code creates on HBase the tables: configEI.
configGE.
configIT.
configVI.
configVT.
The problem is that if I launch the command scan 'configVT.' The output is mixed in hexadecimal and ASCII values:
hbase(main):003:0> scan 'configVT.'
ROW COLUMN+CELL
3v\x93ur|\xD7\xD3\x00\x00\x00\x00\x00\x00\x00\x02 column=v:i\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x01, timestamp=1624009988902, value=knows
3v\x93ur|\xD7\xD3\x00\x00\x00\x00\x00\x00\x00\x02 column=v:kage, timestamp=1624009989001, value=\x00\x00\x00\x1B\x02
3v\x93ur|\xD7\xD3\x00\x00\x00\x00\x00\x00\x00\x02 column=v:kname, timestamp=1624009989001, value=Bob\x01
\xCB\xFC%\xA7qt\x02\x84\x00\x00\x00\x00\x00\x00\x00 column=v:kage, timestamp=1624009988909, value=\x00\x00\x00\x1F\x02
\x01
\xCB\xFC%\xA7qt\x02\x84\x00\x00\x00\x00\x00\x00\x00 column=v:kname, timestamp=1624009988909, value=Alice\x01
\x01
\xCB\xFC%\xA7qt\x02\x84\x00\x00\x00\x00\x00\x00\x00 column=v:o\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x01, timestamp=1624009988909, value=knows
\x01
2 row(s) in 0.0490 seconds
I would like to have a more readable result.
Edit: It seems that String and Date types are stored correctly (but with some HEX escape character as Alice\x01). Instead the integers are totally converted to theirs HEX values.

I figured it out. Using the scan command, i read the tables as simply hbase tables, but they aren't, they are Oracle Big Data Spatial & Graph tables stored in hbase. So my configVT. table is only one of the five tables created with the java method opgdl.loadData and reading just it is not enough.
In order to get readable result, I should read it has edges or vertex:
opg.getVertices().forEach( e -> {
System.out.println("id vertex: " + e.getId());
e.getPropertyKeys().forEach(p -> {
System.out.println("property: " + p);
System.out.println("value: " + e.getProperty(p));
});
});
opg.getEdges().forEach( e -> {
System.out.println("label: " + e.getLabel());
System.out.println("id edge: " + e.getId());
Vertex vIn = e.getVertex(Direction.IN);
Vertex vOut = e.getVertex(Direction.OUT);
System.out.println("edge from: " + vOut.getId());
System.out.println("edge to: " + vIn.getId());
e.getPropertyKeys().forEach(p -> {
System.out.println("property: " + p);
System.out.println("value: " + e.getProperty(p));
});
});

Randomly changing the JSON Values for every "Post" Request Body using Java

This could be a duplicate question, but I couldn't find my solution anywhere. Hence, posting it.
I am trying to simply POST a request for a Student account Creation Scenario. I do have a JSON file which comprises all the "Keys:Values", required for Student account creation.
This is how the file student_Profile.json looks like:
{
"FirstName":"APi1-Stud-FN",
"MiddleInitial":"Q",
"LastName":"APi1-Stud-LN",
"UserAlternateEmail":"",
"SecretQuestionId":12,
"SecretQuestionAnswer":"Scot",
"UserName":"APi1-stud#xyz.com",
"VerifyUserName":"APi1-stud#xyz.com",
"Password":"A123456",
"VerifyPassword":"A123456",
"YKey":"123xyz",
"YId":6,
"Status":false,
"KeyCode":"",
"SsoUserName":"APi1-stud#xyz.com",
"SsoPassword":"",
"BirthYear":2001
}
So everything on Posting the request from "Rest Assured" point of view looks fine, it's just that I want to update a few values from the above JSON body using JAVA so that I can create a new Student profile every time I run my function and don't have to manually change the Body.
For Every POST Student Account Creation scenario, I need to update the value for
the following keys so that a new test student user account can be created:
First Name
Last Name and
Username // "VerifyUserName" and "SSO UserName" will remain same as user name

I modified the answer to get random values and pass them to json body. random value generation was taken from the accepted answer of this question.
public void testMethod() {
List<String> randomValueList = new ArrayList<>();
for (int i = 0; i < 3; i++) {
String SALTCHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
StringBuilder salt = new StringBuilder();
Random rnd = new Random();
while (salt.length() < 18) { // length of the random string.
int index = (int) (rnd.nextFloat() * SALTCHARS.length());
salt.append(SALTCHARS.charAt(index));
}
randomValueList.add(salt.toString());
}
String jsonBody = "{\n" +
" \"FirstName\":\"" + randomValueList.remove(0) + "\",\n" +
" \"MiddleInitial\":\"Q\",\n" +
" \"LastName\":\"" + randomValueList.remove(0) + "\",\n" +
" \"UserAlternateEmail\":\"\",\n" +
" \"SecretQuestionId\":12,\n" +
" \"SecretQuestionAnswer\":\"Scot\",\n" +
" \"UserName\":\"" + randomValueList.remove(0) + " \",\n" +
" \"VerifyUserName\":\"APi1-stud#xyz.com\",\n" +
" \"Password\":\"A123456\",\n" +
" \"VerifyPassword\":\"A123456\",\n" +
" \"YKey\":\"123xyz\",\n" +
" \"YId\":6,\n" +
" \"Status\":false,\n" +
" \"KeyCode\":\"\",\n" +
" \"SsoUserName\":\"APi1-stud#xyz.com\",\n" +
" \"SsoPassword\":\"\",\n" +
" \"BirthYear\":2001\n" +
"}";
Response response = RestAssured
.given()
.body(jsonBody)
.when()
.post("api_url")
.then()
.extract()
.response();
// Do what you need to do with the response body
}

We can used pojo based approach to do certain things very easily . No matter how complex is the payload , serialization and dieselization is the best answer . I have created a framework template for api automation that can we used by putting required POJO's in path :
https://github.com/tanuj-vishnoi/pojo_api_automation
To create pojo, I also have ready to eat food for you :
https://github.com/tanuj-vishnoi/pojo_generator_using_jsonschema2pojo
for the above problem you can refer to the JsonPath lib https://github.com/json-path/JsonPath and use this code:
String mypayload = "{\n" +
" \"FirstName\":\"APi1-Stud-FN\",\n" +
" \"MiddleInitial\":\"Q\",\n" +
" \"LastName\":\"APi1-Stud-LN\"}";
Map map = JsonPath.parse(mypayload).read("$",Map.class);
System.out.println(list);
once the payload converted into map you can change only required values as per the requirement
To generate random strings you can refer to lib org.apache.commons.lang3.RandomStringUtils;
public static String generateUniqueString(int lenghtOfString){
return
RandomStringUtils.randomAlphabetic(lenghtOfString).toLowerCase();
}
I recommend to store payload in a separate file and load it at runtime.

Convert HOCON string into Java object

One of my webservice return below Java string:
[
{
id=5d93532e77490b00013d8862,
app=null,
manufacturer=pearsonEducation,
bookUid=bookIsbn,
model=2019,
firmware=[1.0],
bookName=devotional,
accountLinking=mandatory
}
]
I have the equivalent Java object for the above string. I would like to typecast or convert the above java string into Java Object.
I couldn't type-cast it since it's a String, not an object. So, I was trying to convert the Java string to JSON string then I can write that string into Java object but no luck getting invalid character "=" exception.
Can you change the web service to return JSON?
That's not possible. They are not changing their contracts. It would be super easy if they returned JSON.

The format your web-service returns has it's own name HOCON. (You can read more about it here)
You do not need your custom parser. Do not try to reinvent the wheel.
Use an existing one instead.
Add this maven dependency to your project:
<dependency>
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.3.0</version>
</dependency>
Then parse the response as follows:
Config config = ConfigFactory.parseString(text);
String id = config.getString("id");
Long model = config.getLong("model");
There is also an option to parse the whole string into a POJO:
MyResponsePojo response = ConfigBeanFactory.create(config, MyResponsePojo.class);
Unfortunately this parser does not allow null values. So you'll need to handle exceptions of type com.typesafe.config.ConfigException.Null.
Another option is to convert the HOCON string into JSON:
String hoconString = "...";
String jsonString = ConfigFactory.parseString(hoconString)
.root()
.render(ConfigRenderOptions.concise());
Then you can use any JSON-to-POJO mapper.

Well, this is definitely not the best answer to be given here, but it is possible, at least…
Manipulate the String in small steps like this in order to get a Map<String, String> which can be processed. See this example, it's very basic:
public static void main(String[] args) {
String data = "[\r\n"
+ " {\r\n"
+ " id=5d93532e77490b00013d8862, \r\n"
+ " app=null,\r\n"
+ " manufacturer=pearsonEducation, \r\n"
+ " bookUid=bookIsbn, \r\n"
+ " model=2019,\r\n"
+ " firmware=[1.0], \r\n"
+ " bookName=devotional, \r\n"
+ " accountLinking=mandatory\r\n"
+ " }\r\n"
+ "]";
// manipulate the String in order to have
String[] splitData = data
// no leading and trailing [ ] - cut the first and last char
.substring(1, data.length() - 1)
// no linebreaks
.replace("\n", "")
// no windows linebreaks
.replace("\r", "")
// no opening curly brackets
.replace("{", "")
// and no closing curly brackets.
.replace("}", "")
// Then split it by comma
.split(",");
// create a map to store the keys and values
Map<String, String> dataMap = new HashMap<>();
// iterate the key-value pairs connected with '='
for (String s : splitData) {
// split them by the equality symbol
String[] keyVal = s.trim().split("=");
// then take the key
String key = keyVal[0];
// and the value
String val = keyVal[1];
// and store them in the map ——> could be done directly, of course
dataMap.put(key, val);
}
// print the map content
dataMap.forEach((key, value) -> System.out.println(key + " ——> " + value));
}
Please note that I just copied your example String which may have caused the line breaks and I think it is not smart to just replace() all square brackets because the value firmware seems to include those as content.

In my opinion, we split the parse process in two step.
Format the output data to JSON.
Parse text by JSON utils.
In this demo code, i choose regex as format method, and fastjson as JSON tool. you can choose jackson or gson. Furthermore, I remove the [ ], you can put it back, then parse it into array.
import com.alibaba.fastjson.JSON;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class SerializedObject {
private String id;
private String app;
static Pattern compile = Pattern.compile("([a-zA-Z0-9.]+)");
public static void main(String[] args) {
String str =
" {\n" +
" id=5d93532e77490b00013d8862, \n" +
" app=null,\n" +
" manufacturer=pearsonEducation, \n" +
" bookUid=bookIsbn, \n" +
" model=2019,\n" +
" firmware=[1.0], \n" +
" bookName=devotional, \n" +
" accountLinking=mandatory\n" +
" }\n";
String s1 = str.replaceAll("=", ":");
StringBuffer sb = new StringBuffer();
Matcher matcher = compile.matcher(s1);
while (matcher.find()) {
matcher.appendReplacement(sb, "\"" + matcher.group(1) + "\"");
}
matcher.appendTail(sb);
System.out.println(sb.toString());
SerializedObject serializedObject = JSON.parseObject(sb.toString(), SerializedObject.class);
System.out.println(serializedObject);
}
}

how to created REPEATED type in parquet file schema with avro?

We are creating a dataflow pipeline, we will read the data from postgres and write it to a parquet file. ParquetIO.Sink allows you to write a PCollection of GenericRecord into a Parquet file (from here https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/parquet/ParquetIO.html). But the parquet file schema is not like what i expected
here is my schema:
schema = new org.apache.avro.Schema.Parser().parse("{\n" +
" \"type\": \"record\",\n" +
" \"namespace\": \"com.example\",\n" +
" \"name\": \"Patterns\",\n" +
" \"fields\": [\n" +
" { \"name\": \"id\", \"type\": \"string\" },\n" +
" { \"name\": \"name\", \"type\": \"string\" },\n" +
" { \"name\": \"createdAt\", \"type\": {\"type\":\"string\",\"logicalType\":\"timestamps-millis\"} },\n" +
" { \"name\": \"updatedAt\", \"type\": {\"type\":\"string\",\"logicalType\":\"timestamps-millis\"} },\n" +
" { \"name\": \"steps\", \"type\": [\"null\",{\"type\":\"array\",\"items\":{\"type\":\"string\",\"name\":\"json\"}}] },\n" +
" ]\n" +
"}");
this is my code so far:
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(JdbcIO.<GenericRecord> read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"org.postgresql.Driver", "jdbc:postgresql://localhost:port/database")
.withUsername("username")
.withPassword("password"))
.withQuery("select * from table limit(10)")
.withCoder(AvroCoder.of(schema))
.withRowMapper((JdbcIO.RowMapper<GenericRecord>) resultSet -> {
GenericRecord record = new GenericData.Record(schema);
ResultSetMetaData metadata = resultSet.getMetaData();
int columnsNumber = metadata.getColumnCount();
for(int i=0; i<columnsNumber; i++) {
Object columnValue = resultSet.getObject(i+1);
if(columnValue instanceof UUID) columnValue=columnValue.toString();
if(columnValue instanceof Timestamp) columnValue=columnValue.toString();
if(columnValue instanceof PgArray) {
Object[] array = (Object[]) ((PgArray) columnValue).getArray();
List list=new ArrayList();
for (Object d : array) {
if(d instanceof PGobject) {
list.add(((PGobject) d).getValue());
}
}
columnValue = list;
}
record.put(i, columnValue);
}
return record;
}))
.apply(FileIO.<GenericRecord>write()
.via(ParquetIO.sink(schema).withCompressionCodec(CompressionCodecName.SNAPPY))
.to("something.parquet")
);
p.run();
this is what i get:
message com.example.table {
required binary id (UTF8);
required binary name (UTF8);
required binary createdAt (UTF8);
required binary updatedAt (UTF8);
optional group someArray (LIST) {
repeated binary array (UTF8);
}
}
this is what i expected:
message com.example.table {
required binary id (UTF8);
required binary name (UTF8);
required binary createdAt (UTF8);
required binary updatedAt (UTF8);
optional repeated binary someArray(UTF8);
}
please help

I did not find a way to create a repeated element from Avro that isn't in a GroupType.
The ParquetIO in Beam uses a "standard" avro conversion defined in the parquet-mr project, which is implemented here.
It appears that there are two ways to turn an Avro ARRAY field to a Parquet message -- but neither of them create what you are looking for.
Currently, the avro conversion is the only way to interact with ParquetIO at the moment. I saw this JIRA Use Beam schema in ParquetIO that extend this to Beam Rows, which might permit a different parquet message strategy.
Alternatively, you could create a JIRA feature request for ParquetIO to support thrift structures, which should allow finer control over the parquet structure.

Is it a protobuf message you used to describe the expected schema? I think what you got is correctly generated from the specified JSON schema. optional repeated does not make sense in the protobuf language specification: https://developers.google.com/protocol-buffers/docs/reference/proto2-spec
You can remove null and square bracket to generate simply repeated field and it's semantically equivalent to optional repeated (since repeated means zero or more times).

Convert Loadrunner File Parameter to Java string for payload

I have a java virtual user script that is sending a payload request. I am trying to use values from a file to send via a loadrunner file parameter.
here is the payload:
private static final String PAYLOAD =
"<ips_cad_mdt>\n" +
" <SignOnRequest>\n" +
" <DestApplication>hhhh</DestApplication>\n" +
" <OrigApplication>hhh</OrigApplication>\n" +
" <SessionRef>3</SessionRef>\n" +
" <Aliasing>1234</Aliasing>\n" +
" </SignOnRequest>\n" +
"</ips_cad_mdt>";
I would like to use something like the following:
private static final String PAYLOAD =
"<ips_cad_mdt>\n" +
" <SignOnRequest>\n" +
" <DestApplication>hhh</DestApplication>\n" +
" <OrigApplication>hhh</OrigApplication>\n" +
" <SessionRef>3</SessionRef>\n" +
" <Aliasing>”+lr.eval_string(“{AliasId}”)+”</Aliasing>\n" +
" </SignOnRequest>\n" +
"</ips_cad_mdt>";
for some reason i cant see any output for this value. do i need to declare a variable: e.g. lr.save_string("AliasId", "{AliasId}");
an example of this would help loads. Many Thanks

There seems to be an error in the code completion in VuGen. The parameters should be reversed and without the {} in save_string.
lr.save_string("1234","myId");
lr.message(lr.eval_string("{myId}"));
In the documentation it is correct - https://admhelp.microfocus.com/lr/en/12.55/help/function_reference/FuncRef.htm#FuncRef/c_vuser/lrFr_lr_save_string.htm?Highlight=lr_save_string
I asked the responsible team to fix the code completion in VuGen so you will be able to see this change in one of the future releases.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Read and write to different Mongo collections using Spark with Java - java

Related

Hbase loading .opv and .ope give hexadecimal output

Randomly changing the JSON Values for every "Post" Request Body using Java

Convert HOCON string into Java object

how to created REPEATED type in parquet file schema with avro?

Convert Loadrunner File Parameter to Java string for payload

Categories

Resources