How to set amazon ami's hadoop configuration by using java code - java

I want set this configuration textinputformat.record.delimiter=; to the hadoop.
Right now i use the following code to run pig script on ami. Anyone knows how to set this configuration by using the following code?
Code:
StepConfig installPig = new StepConfig()
.withName("Install Pig")
.withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW.name())
.withHadoopJarStep(stepFactory.newInstallPigStep());
// [Configure pig script][1]
String[] scriptArgs = new String[] { "-p", input, "-p", output };
StepConfig runPigLatinScript = new StepConfig()
.withName("Run Pig Script") .withActionOnFailure(ActionOnFailure.CANCEL_AND_WAIT.name())
.withHadoopJarStep(stepFactory.newRunPigScriptStep("s3://pig/script.pig", scriptArgs));
// Configure JobFlow [R1][2], [R3][3]
//
//
RunJobFlowRequest request = new RunJobFlowRequest()
.withName(jobFlowName)
.withSteps(installPig, runPigLatinScript)
.withLogUri(logUri)
.withAmiVersion("2.3.2")
.withInstances(new JobFlowInstancesConfig()
.withEc2KeyName(this.ec2KeyName)
.withInstanceCount(this.count)
.withKeepJobFlowAliveWhenNoSteps(false)
.withMasterInstanceType(this.masterType)
.withSlaveInstanceType(this.slaveType));
// Run JobFlow
RunJobFlowResult runJobFlowResult = this.amazonEmrClient.runJobFlow(request);

What you need to do is create BootstrapActionConfig and add it to the RunJobFlowRequest being created, which would then add custom hadoop configuration to the cluster.
Here is the complete code I wrote for you after editing the code here :
import java.util.ArrayList;
import java.util.List;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient;
import com.amazonaws.services.elasticmapreduce.model.BootstrapActionConfig;
import com.amazonaws.services.elasticmapreduce.model.JobFlowInstancesConfig;
import com.amazonaws.services.elasticmapreduce.model.RunJobFlowRequest;
import com.amazonaws.services.elasticmapreduce.model.RunJobFlowResult;
import com.amazonaws.services.elasticmapreduce.model.ScriptBootstrapActionConfig;
import com.amazonaws.services.elasticmapreduce.model.StepConfig;
import com.amazonaws.services.elasticmapreduce.util.StepFactory;
/**
*
* #author amar
*
*/
public class RunEMRJobFlow {
private static final String CONFIG_HADOOP_BOOTSTRAP_ACTION = "s3://elasticmapreduce/bootstrap-actions/configure-hadoop";
public static void main(String[] args) {
String accessKey = "";
String secretKey = "";
AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonElasticMapReduceClient emr = new AmazonElasticMapReduceClient(credentials);
StepFactory stepFactory = new StepFactory();
StepConfig enabledebugging = new StepConfig().withName("Enable debugging")
.withActionOnFailure("TERMINATE_JOB_FLOW").withHadoopJarStep(stepFactory.newEnableDebuggingStep());
StepConfig installHive = new StepConfig().withName("Install Hive").withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newInstallHiveStep());
List<String> setMappersArgs = new ArrayList<String>();
setMappersArgs.add("-s");
setMappersArgs.add("textinputformat.record.delimiter=;");
BootstrapActionConfig mappersBootstrapConfig = createBootstrapAction("Set Hadoop Config",
CONFIG_HADOOP_BOOTSTRAP_ACTION, setMappersArgs);
RunJobFlowRequest request = new RunJobFlowRequest()
.withBootstrapActions(mappersBootstrapConfig)
.withName("Hive Interactive")
.withSteps(enabledebugging, installHive)
.withLogUri("s3://myawsbucket/")
.withInstances(
new JobFlowInstancesConfig().withEc2KeyName("keypair").withHadoopVersion("0.20")
.withInstanceCount(5).withKeepJobFlowAliveWhenNoSteps(true)
.withMasterInstanceType("m1.small").withSlaveInstanceType("m1.small"));
RunJobFlowResult result = emr.runJobFlow(request);
}
private static BootstrapActionConfig createBootstrapAction(String bootstrapName, String bootstrapPath,
List<String> args) {
ScriptBootstrapActionConfig bootstrapScriptConfig = new ScriptBootstrapActionConfig();
bootstrapScriptConfig.setPath(bootstrapPath);
if (args != null) {
bootstrapScriptConfig.setArgs(args);
}
BootstrapActionConfig bootstrapConfig = new BootstrapActionConfig();
bootstrapConfig.setName(bootstrapName);
bootstrapConfig.setScriptBootstrapAction(bootstrapScriptConfig);
return bootstrapConfig;
}
}

Related

Flink Inner Join Missing records and adding duplicates

We are running a flink application on AWS Kinesis Analytics.
We are using kafka as our source and sink and event time as our water mark generation. We have a window of 5 seconds. We are performing an inner join using a common field.
Kakfa topics have 12 partitions and flink have 3 way parallelism.
Issues observed : We are seeing for some window we are missing records. Records should join based on the event-time but not joining and for other windows we are seeing duplicate records.
sample records
{"empName":"ted","timestamp":"0","uuid":"f2c2e48a44064d0fa8da5a3896e0e42a","empId":"23698"}
{"empName":"ted","timestamp":"1","uuid":"069f2293ad144dd38a79027068593b58","empId":"23145"}
{"empName":"john","timestamp":"2","uuid":"438c1f0b85154bf0b8e4b3ebf75947b6","empId":"23698"}
{"empName":"john","timestamp":"0","uuid":"76d1d21ed92f4a3f8e14a09e9b40a13b","empId":"23145"}
{"empName":"ted","timestamp":"0","uuid":"bbc3bad653aa44c4894d9c4d13685fba","empId":"23698"}
{"empName":"ted","timestamp":"0","uuid":"530871933d1e4443ade447adc091dcbe","empId":"23145"}
{"empName":"ted","timestamp":"1","uuid":"032d7be009cb448bb40fe5c44582cb9c","empId":"23698"}
{"empName":"john","timestamp":"1","uuid":"e5916821bd4049bab16f4dc62d4b90ea","empId":"23145"}
{"empId":"23698","timestamp":"0","expense":"234"}
{"empId":"23698","timestamp":"0","expense":"34"}
{"empId":"23698","timestamp":"1","expense":"234"}
{"empId":"23145","timestamp":"2","expense":"234"}
{"empId":"23698","timestamp":"2","expense":"234"}
{"empId":"23698","timestamp":"0","expense":"234"}
{"empId":"23145","timestamp":"0","expense":"234"}
{"empId":"23698","timestamp":"0","expense":"34"}
{"empId":"23145","timestamp":"1","expense":"34"}
Below is code for your reference.
As you can see here for the two streams there are a lot event timestamp which can repeat. There can be thousands of employee and empId combinations(in the real data there are many more dimensions) and they are all coming in a single kafka topic.
import java.text.SimpleDateFormat;
import java.time.Duration;
import java.time.Instant;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.time.temporal.ChronoUnit;
import java.util.Properties;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple1;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.deser.std.StringDeserializer;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.functions.co.CoFlatMapFunction;
import org.apache.flink.streaming.api.functions.co.KeyedCoProcessFunction;
import org.apache.flink.streaming.api.functions.co.RichCoFlatMapFunction;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.Semantic;
import org.apache.flink.util.Collector;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
private static final Logger LOG = LoggerFactory.getLogger(Main.class);
// static String TOPIC_IN = "event_hub_all-mt-partitioned";
static String TOPIC_ONE = "kafka_one_multi";
static String TOPIC_TWO = "kafka_two_multi";
static String TOPIC_OUT = "final_join_topic_multi";
static String BOOTSTRAP_SERVER = "localhost:9092";
public static void main(String[] args) {
Producer<String> emp = new Producer<String>(BOOTSTRAP_SERVER, StringSerializer.class.getName());
Producer<String> dept = new Producer<String>(BOOTSTRAP_SERVER, StringSerializer.class.getName());
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
Properties props = new Properties();
props.put("bootstrap.servers", BOOTSTRAP_SERVER);
props.put("client.id", "flink-example1");
FlinkKafkaConsumer<Employee> kafkaConsumerOne = new FlinkKafkaConsumer<>(TOPIC_ONE, new EmployeeSchema(),
props);
LOG.info("Coming to main function");
//Commenting event timestamp for watermark generation!!
var empDebugStream = kafkaConsumerOne.assignTimestampsAndWatermarks(
WatermarkStrategy.<Employee>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((employee, timestamp) -> employee.getTimestamp().getTime())
.withIdleness(Duration.ofSeconds(1)));
// for allowing Flink to handle late elements
kafkaConsumerOne.setStartFromLatest();
FlinkKafkaConsumer<EmployeeExpense> kafkaConsumerTwo = new FlinkKafkaConsumer<>(TOPIC_TWO,
new DepartmentSchema(), props);
//Commenting event timestamp for watermark generation!!
kafkaConsumerTwo.assignTimestampsAndWatermarks(
WatermarkStrategy.<EmployeeExpense>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((employeeExpense, timestamp) -> employeeExpense.getTimestamp().getTime())
.withIdleness(Duration.ofSeconds(1)));
kafkaConsumerTwo.setStartFromLatest();
// EventSerializationSchema<EmployeeWithExpenseAggregationStats> employeeWithExpenseAggregationSerializationSchema = new EventSerializationSchema<EmployeeWithExpenseAggregationStats>(
// TOPIC_OUT);
EventSerializationSchema<EmployeeWithExpense> employeeWithExpenseSerializationSchema = new EventSerializationSchema<EmployeeWithExpense>(
TOPIC_OUT);
// FlinkKafkaProducer<EmployeeWithExpenseAggregationStats> sink = new FlinkKafkaProducer<EmployeeWithExpenseAggregationStats>(
// TOPIC_OUT,
// employeeWithExpenseAggregationSerializationSchema,props,
// FlinkKafkaProducer.Semantic.AT_LEAST_ONCE);
FlinkKafkaProducer<EmployeeWithExpense> sink = new FlinkKafkaProducer<EmployeeWithExpense>(TOPIC_OUT,
employeeWithExpenseSerializationSchema, props, FlinkKafkaProducer.Semantic.AT_LEAST_ONCE);
DataStream<Employee> empStream = env.addSource(kafkaConsumerOne)
.transform("debugFilter", empDebugStream.getProducedType(), new StreamWatermarkDebugFilter<>())
.keyBy(emps -> emps.getEmpId());
DataStream<EmployeeExpense> expStream = env.addSource(kafkaConsumerTwo).keyBy(exps -> exps.getEmpId());
// DataStream<EmployeeWithExpense> aggInputStream = empStream.join(expStream)
empStream.join(expStream).where(new KeySelector<Employee, Tuple1<Integer>>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Tuple1<Integer> getKey(Employee value) throws Exception {
return Tuple1.of(value.getEmpId());
}
}).equalTo(new KeySelector<EmployeeExpense, Tuple1<Integer>>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Tuple1<Integer> getKey(EmployeeExpense value) throws Exception {
return Tuple1.of(value.getEmpId());
}
}).window(TumblingEventTimeWindows.of(Time.seconds(5))).allowedLateness(Time.seconds(15))
.apply(new JoinFunction<Employee, EmployeeExpense, EmployeeWithExpense>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public EmployeeWithExpense join(Employee first, EmployeeExpense second) throws Exception {
return new EmployeeWithExpense(second.getTimestamp(), first.getEmpId(), second.getExpense(),
first.getUuid(), LocalDateTime.now()
.format(DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'+0000'")));
}
}).addSink(sink);
// KeyedStream<EmployeeWithExpense, Tuple3<Integer, Integer,Long>> inputKeyedByGWNetAccountProductRTG = aggInputStream
// .keyBy(new KeySelector<EmployeeWithExpense, Tuple3<Integer, Integer,Long>>() {
//
// /**
// *
// */
// private static final long serialVersionUID = 1L;
//
// #Override
// public Tuple3<Integer, Integer,Long> getKey(EmployeeWithExpense value) throws Exception {
// return Tuple3.of(value.empId, value.expense,Instant.ofEpochMilli(value.timestamp.getTime()).truncatedTo(ChronoUnit.SECONDS).toEpochMilli());
// }
// });
//
// inputKeyedByGWNetAccountProductRTG.window(TumblingEventTimeWindows.of(Time.seconds(2)))
// .aggregate(new EmployeeWithExpenseAggregator()).addSink(sink);
// streamOne.print();
// streamTwo.print();
// DataStream<KafkaRecord> streamTwo = env.addSource(kafkaConsumerTwo);
//
// streamOne.connect(streamTwo).flatMap(new CoFlatMapFunction<KafkaRecord, KafkaRecord, R>() {
// })
//
// // Create Kafka producer from Flink API
// Properties prodProps = new Properties();
// prodProps.put("bootstrap.servers", BOOTSTRAP_SERVER);
//
// FlinkKafkaProducer<KafkaRecord> kafkaProducer =
//
// new FlinkKafkaProducer<KafkaRecord>(TOPIC_OUT,
//
// ((record, timestamp) -> new ProducerRecord<byte[], byte[]>(TOPIC_OUT, record.key.getBytes(), record.value.getBytes())),
//
// prodProps,
//
// Semantic.EXACTLY_ONCE);;
//
// DataStream<KafkaRecord> stream = env.addSource(kafkaConsumer);
//
// stream.filter((record) -> record.value != null && !record.value.isEmpty()).keyBy(record -> record.key)
// .timeWindow(Time.seconds(15)).allowedLateness(Time.milliseconds(500))
// .reduce(new ReduceFunction<KafkaRecord>() {
// /**
// *
// */
// private static final long serialVersionUID = 1L;
// KafkaRecord result = new KafkaRecord();
// #Override
// public KafkaRecord reduce(KafkaRecord record1, KafkaRecord record2) throws Exception
// {
// result.key = "outKey";
//
// result.value = record1.value + " " + record2.value;
//
// return result;
// }
// }).addSink(kafkaProducer);
// produce a number as string every second
new MessageGenerator(emp, TOPIC_ONE, "EMP").start();
new MessageGenerator(dept, TOPIC_TWO, "EXP").start();
// for visual topology of the pipeline. Paste the below output in
// https://flink.apache.org/visualizer/
// System.out.println(env.getExecutionPlan());
// start flink
try {
env.execute();
LOG.debug("Starting flink application!!");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Questions
How do we debug when the window is emitted ?? Is there a way to add
both the streams to a sink(kafka) and see when the records are
emitted window wise ?
Can we put the late arrival records to a sink to check more about them ?
What is cause of duplicates ?? how do we debug them?
Any help in this direction is greatly appreciated. Thanks in advance.

Java HttpClient - converting HttpResponse to String[]

I am new to java HttpClient and am currently trying to create a RestServiceApplication, but i am unable to convert a HttpResponse to a String array in order to access the elements of the array.
package com.example.restservice;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
#SpringBootApplication
public class RestServiceApplication {
public static void main(String[] args) {
SpringApplication.run(RestServiceApplication.class, args);
}
}
I have implemented the following controller for the application which has different methods that each return an array of String.
package com.example.restservice;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
#RestController
public class ConfigurationController {
#GetMapping("/getlenkertypen")
public String[] getLenkertypen() {
String[] lenker = new String[3];
lenker[0] = "Flatbarlenker";
lenker[1] = "Rennradlenker";
lenker[2] = "Bullhornlenker";
return lenker;
}
#GetMapping("/getmaterial")
public String[] getMaterial() {
String[] material = new String[3];
material[0] = "Aluminium";
material[1] = "Stahl";
material[2] = "Kunststoff";
return material;
}
#GetMapping("/getschaltung")
public String[] getSchaltung() {
String[] schaltung = new String[3];
schaltung[0] = "Kettenschaltung";
schaltung[1] = "Nabenschaltung";
schaltung[2] = "Tretlagerschaltung";
return schaltung;
}
#GetMapping("/getgriff")
public String[] getGriff() {
String[] griff = new String[3];
griff[0] = "Ledergriff";
griff[1] = "Schaumstoffgriff";
griff[2] = "Kunststoffgriff";
return griff;
}
#GetMapping("/test")
public String test() {
String[] griff = new String[3];
griff[0] = "Ledergriff";
griff[1] = "Schaumstoffgriff";
griff[2] = "Kunststoffgriff";
return "test";
}
}
Now i want the HttpClient to request those methods and access the Strings of the array.
My problem is that the returned String[] arrays are not arrays but a single String (BodyHandler.ofString()), so i cannot access the elements anymore. I hoped to solve this with the BodyHander.ofLine()-method, converting the response to a Stream and then call the toArray() function on the stream. But it is still creating an array that has only one element, containing the same String i get from the ofString()-method.
Here is my client:
package com.example.restservice;
import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.util.Scanner;
import java.util.stream.Stream;
public class GetRequest {
public static void main(String[] args) throws IOException, InterruptedException {
HttpClient client = HttpClient.newHttpClient();
/*
* Abfrage Lenkertypen
*/
HttpRequest requestLenkertypen = HttpRequest.newBuilder()
.uri(URI.create("http://localhost:8081/getlenkertypen"))
.build();
HttpResponse<Stream<String>> lenkerResponse = client.send(requestLenkertypen,
HttpResponse.BodyHandlers.ofLines());
String[] lenkerArray = lenkerResponse.body().toArray(String[]::new);
System.out.println("Bitte wählen Sie als erstes den Lenker aus: ");
for (int i = 0; i < lenkerArray.length; i++) {
System.out.println(lenkerArray[i] + " " + i);
}
System.out.println(lenkerArray[0]);
Scanner scanner = new Scanner(System.in);
int answer = scanner.nextInt();
System.out.println(answer);
HttpRequest requestSchaltung = HttpRequest.newBuilder()
.uri(URI.create("http://localhost:8081/getschaltung"))
.build();
HttpResponse<String> response = client.send(requestSchaltung,
HttpResponse.BodyHandlers.ofString());
System.out.println(response.body());
}
}
By googling i found out that i somehow need to parse the response JSON, but i do not know how. I downloaded the project structure from spring.io so i think the response returns as JSON, but i am new to this and any help would be appreaciated.
Thanks!
If you want to get the lines of the response one at a time use a BufferedReader and the lines() method.
However, it seems what you really want is to parse the JSON data that is returned. For that there are many possibilities. Consider using Jackson: https://www.baeldung.com/jackson

Saving an H2O model directly from Java

I'm trying to create and save a generated model directly from Java. The documentation specifies how to do this in R and Python, but not in Java. A similar question was asked before, but no real answer was provided (beyond linking to H2O doc, which doesn't contain a code example).
It'd be sufficient for my present purpose get some pointers to be able to translate the following reference code to Java. I'm mainly looking for guidance on the relevant JAR(s) to import from the Maven repository.
import h2o
h2o.init()
path = h2o.system_file("prostate.csv")
h2o_df = h2o.import_file(path)
h2o_df['CAPSULE'] = h2o_df['CAPSULE'].asfactor()
model = h2o.glm(y = "CAPSULE",
x = ["AGE", "RACE", "PSA", "GLEASON"],
training_frame = h2o_df,
family = "binomial")
h2o.download_pojo(model)
I think I've figured out an answer to my question. A self-contained sample code follows. However, I'll still appreciate an answer from the community since I don't know if this is the best/idiomatic way to do it.
package org.name.company;
import hex.glm.GLMModel;
import water.H2O;
import water.Key;
import water.api.StreamWriter;
import water.api.StreamingSchema;
import water.fvec.Frame;
import water.fvec.NFSFileVec;
import hex.glm.GLMModel.GLMParameters.Family;
import hex.glm.GLMModel.GLMParameters;
import hex.glm.GLM;
import water.util.JCodeGen;
import java.io.*;
import java.util.Map;
public class Launcher
{
public static void initCloud(){
String[] args = new String [] {"-name", "h2o_test_cloud"};
H2O.main(args);
H2O.waitForCloudSize(1, 10 * 1000);
}
public static void main( String[] args ) throws Exception {
// Initialize the cloud
initCloud();
// Create a Frame object from CSV
File f = new File("/path/to/data.csv");
NFSFileVec nfs = NFSFileVec.make(f);
Key frameKey = Key.make("frameKey");
Frame fr = water.parser.ParseDataset.parse(frameKey, nfs._key);
// Create a GLM and output coefficients
Key modelKey = Key.make("modelKey");
try {
GLMParameters params = new GLMParameters();
params._train = frameKey;
params._response_column = fr.names()[1];
params._intercept = true;
params._lambda = new double[]{0};
params._family = Family.gaussian;
GLMModel model = new GLM(params).trainModel().get();
Map<String, Double> coefs = model.coefficients();
for(Map.Entry<String, Double> entry : coefs.entrySet()) {
System.out.format("%s: %f\n", entry.getKey(), entry.getValue());
}
String filename = JCodeGen.toJavaId(model._key.toString()) + ".java";
StreamingSchema ss = new StreamingSchema(model.new JavaModelStreamWriter(false), filename);
StreamWriter sw = ss.getStreamWriter();
OutputStream os = new FileOutputStream("/base/path/" + filename);
sw.writeTo(os);
} finally {
if (fr != null) {
fr.remove();
}
}
}
}
Would something like this do the trick?
public void saveModel(URI uri, Keyed<Frame> model)
{
Persist p = H2O.getPM().getPersistForURI(uri);
OutputStream os = p.create(uri.toString(), true);
model.writeAll(new AutoBuffer(os, true)).close();
}
Make sure the URI has a proper form otherwise H2O will break on an npe. As for Maven you should be able to get away with the h2o core.
<dependency>
<groupId>ai.h2o</groupId>
<artifactId>h2o-core</artifactId>
<version>3.14.0.2</version>
</dependency>

ContentModel cannot be resolved to a variable

I'm having this error :
Exception in thread "main" java.lang.Error: Unresolved compilation problem:
ContentModel cannot be resolved to a variable
at test2CMIS.Test.main(Test.java:39)"
And I dont understand from where it comes, here is my code :
public class Test {
public static void main(String[] args){
Test atest = new Test();
Session session = atest.iniSession();
AuthenticationService authenticationService=null;
PersonService personService = null;
if (authenticationService.authenticationExists("test") == false)
{
authenticationService.createAuthentication("test", "changeMe".toCharArray());
PropertyMap ppOne = new PropertyMap(4);
ppOne.put(ContentModel.PROP_USERNAME, "test");
ppOne.put(ContentModel.PROP_FIRSTNAME, "firstName");
ppOne.put(ContentModel.PROP_LASTNAME, "lastName");
ppOne.put(ContentModel.PROP_EMAIL, "test"+"#example.com");
personService.createPerson(ppOne);
}
}
I did import the : import org.alfresco.model.ContentModel; and a lot of others librarys for my code.
Thx for help.
The code I'm using and I left some things that I tried too in comments so you can see what things I have done:
import java.io.File;
import java.io.Serializable;
import java.util.HashMap;
import java.util.Map;
import java.util.Scanner;
import org.alfresco.service.cmr.security.*;
import org.alfresco.error.AlfrescoRuntimeException;
import org.alfresco.model.ContentModel;
import java.util.Iterator;
import org.alfresco.repo.jscript.People;
import org.alfresco.repo.security.authentication.AuthenticationException;
import org.alfresco.service.cmr.security.AuthenticationService;
import org.alfresco.service.cmr.security.PersonService;
import org.alfresco.service.namespace.QName;
import org.alfresco.util.PropertyMap;
import org.apache.chemistry.opencmis.client.api.CmisObject;
import org.apache.chemistry.opencmis.client.api.Document;
import org.apache.chemistry.opencmis.client.api.Folder;
import org.apache.chemistry.opencmis.client.api.Session;
import org.apache.chemistry.opencmis.commons.PropertyIds;
import org.apache.chemistry.opencmis.commons.SessionParameter;
import org.apache.chemistry.opencmis.commons.enums.BindingType;
import org.apache.chemistry.opencmis.commons.enums.VersioningState;
import org.apache.chemistry.opencmis.commons.exceptions.CmisContentAlreadyExistsException;
import org.apache.chemistry.opencmis.commons.exceptions.CmisUnauthorizedException;
import org.apache.chemistry.opencmis.client.util.FileUtils;
import org.apache.chemistry.opencmis.client.runtime.SessionFactoryImpl;
public class Test {
public static void main(String[] args){
Test atest = new Test();
Session session = atest.iniSession();
AuthenticationService authenticationService=new AuthenticationServiceImpl();
PersonService personService = new PersonServiceImpl();
HashMap<QName, Serializable> properties = new HashMap<QName, Serializable>();
properties.put(ContentModel.PROP_USERNAME, "test");
properties.put(ContentModel.PROP_FIRSTNAME, "test");
properties.put(ContentModel.PROP_LASTNAME, "qsdqsd");
properties.put(ContentModel.PROP_EMAIL, "wshAlors#gmail.com");
properties.put(ContentModel.PROP_ENABLED, Boolean.valueOf(true));
properties.put(ContentModel.PROP_ACCOUNT_LOCKED, Boolean.valueOf(false));
personService.createPerson(properties);
authenticationService.createAuthentication("test", "changeme".toCharArray());
authenticationService.setAuthenticationEnabled("test", true);
authenticationService.getAuthenticationEnabled("Admin");
//String testAuthen = authenticationService.getCurrentTicket();
//System.out.println(testAuthen);
//QName username = QName.createQName("test");
//Map<QName,Serializable> propertiesUser = new HashMap<QName,Serializable>();
//propertiesUser.put(ContentModel.PROP_USERNAME,username);
//propertiesUser.put(ContentModel.PROP_FIRSTNAME,"test");
//propertiesUser.put(ContentModel.PROP_LASTNAME,"test");
//propertiesUser.put(ContentModel.PROP_EMAIL, "test#example.com");
//propertiesUser.put(ContentModel.PROP_PASSWORD,"0000");
//personService.createPerson(propertiesUser);
//if (authenticationService.authenticationExists("test") == false)
//{
// authenticationService.createAuthentication("test", "changeMe".toCharArray());
// PropertyMap ppOne = new PropertyMap(4);
// ppOne.put(ContentModel.PROP_USERNAME, "test");
// ppOne.put(ContentModel.PROP_FIRSTNAME, "test");
// ppOne.put(ContentModel.PROP_LASTNAME, "test");
// ppOne.put(ContentModel.PROP_EMAIL, "test#example.com");
//ppOne.put(ContentModel.PROP_JOBTITLE, "jobTitle");
// personService.createPerson(ppOne);
//}
}
public Session iniSession() {
Session session;
SessionFactoryImpl sf = SessionFactoryImpl.newInstance();
Map<String, String> parameters = new HashMap<String, String>();
Scanner reader = new Scanner(System.in);
System.out.println("Enter your logging : ");
String log = reader.nextLine();
System.out.println("Enter your password : ");
String pass = reader.nextLine();
parameters.put(SessionParameter.USER, log);
parameters.put(SessionParameter.PASSWORD, pass);
parameters.put(SessionParameter.BROWSER_URL, "http://127.0.0.1:8080/alfresco/api/-default-/public/cmis/versions/1.1/browser");
parameters.put(SessionParameter.BINDING_TYPE, BindingType.BROWSER.value());
parameters.put(SessionParameter.REPOSITORY_ID, "-default-");
try{
session = sf.createSession(parameters);
}catch(CmisUnauthorizedException cue){
session = null;
System.out.println("Wrong logging OR password !");
}
return session;
}
You are writing a runnable class which is not running the same process as Alfresco. In that sense, your class is running "remotely".
Because your class is running remotely to Alfresco, you are correct in using CMIS. But CMIS will only allow you to perform Create, Read, Update, and Delete (CRUD) functions against documents and folders in Alfresco. CMIS does not know how to create users or groups.
Your class will not be able to instantiate the AuthenticationService or PersonService. Those are part of the Alfresco Foundation API which only works when you are running in the same process as Alfresco, such as in an Action, a Behavior, or a Java-backed web script. In those cases, you will use Spring Dependency Injection to inject those services into your Java class. You would then put your class in a JAR that gets deployed into the Alfresco web application and loaded by the same classloader as Alfresco's.
If you want to create users remotely you should consider using the Alfresco REST API. Your runnable class can then use an HTTP client to invoke REST calls to create people and groups.
Thanks you for everything ! Thanks to you and researches I found out how to do it ! For the others who wonder how to do I'll post how I did and what site I used to understand it !
So you just need to manipulate JSON with Java because your alfresco people page (127.0.0.1:8080/alfresco/service/api/people) returns a JSON object, and you'll be able to create, delete, search... users! Thx again !
Sites :
https://api-explorer.alfresco.com/api-explorer/#/people
http://crunchify.com/json-manipulation-in-java-examples/
The code :
This is for creating an user :
public User createUser(String firstN, String lastN, String email, String pass, String authTicket) throws Exception{
try{
String url = "http://127.0.0.1:8080/alfresco/service/api/people?alf_ticket="+authTicket;
HttpClient httpclient = new HttpClient();
PostMethod mPost = new PostMethod(url);
//JSONObject obj = new JSONObject();
//JSONArray people = obj.getJSONArray("people");
JSONObject newUser = new JSONObject();
newUser.put("userName", firstN.toLowerCase().charAt(0)+lastN.toLowerCase());
newUser.put("enabled",true);
newUser.put("firstName",firstN);
newUser.put("lastName", lastN);
newUser.put("email", email);
newUser.put("quota",-1);
newUser.put("emailFreedDisable",false);
newUser.put("isDeleted",false);
newUser.put("isAdminAuthority",false);
newUser.put("password", pass);
//people.put(newUser);
//Response response = PostRequest(newUser.toString()));
StringRequestEntity requestEntity = new StringRequestEntity(
newUser.toString(),
"application/json",
"UTF-8");
mPost.setRequestEntity(requestEntity);
int statusCode2 = httpclient.executeMethod(mPost);
mPost.releaseConnection();
}catch(Exception e){
System.err.println("[ERROR] "+e);
}
return new User(firstN, lastN);
}
And if you want to get all the users you have on alfresco :
public ArrayList<User> getAllUsers(String authTicket)
{
ArrayList<User> allUsers = new ArrayList<>();
String lastName, firstName;
try{
String url = "http://127.0.0.1:8080/alfresco/service/api/people?alf_ticket="+authTicket;
HttpClient httpclient = new HttpClient();
GetMethod mPost = new GetMethod(url);
int statusCode1 = httpclient.executeMethod(mPost);
System.out.println("statusLine >>> "+statusCode1+"....."
+"\n status line \n"
+mPost.getStatusLine()+"\nbody \n"+mPost.getResponseBodyAsString());
JSONObject obj = new JSONObject(mPost.getResponseBodyAsString());
JSONArray people = obj.getJSONArray("people");
int n = people.length();
for(int i =0 ; i < n ; i++)
{
JSONObject peoples = people.getJSONObject(i);
User u = new User(peoples.getString("firstName"), peoples.getString("lastName"));
if (!allUsers.contains(u)){
allUsers.add(u);
}
}
}catch(Exception e){
System.err.println("[ERROR] "+e);
}
return(allUsers);
}

Colons in Apache Spark application path

I'm submitting Apache Spark application to YARN programmatically:
package application.RestApplication;
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.deploy.yarn.ClientArguments;
public class App {
public static void main(String[] args1) {
String[] args = new String[] {
"--class", "org.apache.spark.examples.JavaWordCount",
"--jar", "/opt/spark/examples/jars/spark-examples_2.11-2.0.0.jar",
"--arg", "hdfs://hadoop-master:9000/input/file.txt"
};
Configuration config = new Configuration();
System.setProperty("SPARK_YARN_MODE", "true");
SparkConf sparkConf = new SparkConf();
ClientArguments cArgs = new ClientArguments(args);
Client client = new Client(cArgs, config, sparkConf);
client.run();
}
}
I have problem with line: "--arg", "hdfs://hadoop-master:9000/input/file.txt" - more specifically with colons:
16/08/29 09:54:16 ERROR yarn.ApplicationMaster: Uncaught exception:
java.lang.NumberFormatException: For input string: "9000/input/plik2.txt"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:935)
at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:547)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:405)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:247)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:749)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:71)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:70)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:70)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:747)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:774)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
How to write (as argument) path to file with colons? I try various combinations with slashes, backslashes, %3a, etc...
According to Utils#parseHostPort which get invoked during that call, Spark seems to consider as port all the text that is behind last : :
def parseHostPort(hostPort: String): (String, Int) = {
// Check cache first.
val cached = hostPortParseResults.get(hostPort)
if (cached != null) {
return cached
}
val indx: Int = hostPort.lastIndexOf(':')
// This is potentially broken - when dealing with ipv6 addresses for example, sigh ...
// but then hadoop does not support ipv6 right now.
// For now, we assume that if port exists, then it is valid - not check if it is an int > 0
if (-1 == indx) {
val retval = (hostPort, 0)
hostPortParseResults.put(hostPort, retval)
return retval
}
val retval = (hostPort.substring(0, indx).trim(), hostPort.substring(indx + 1).trim().toInt)
hostPortParseResults.putIfAbsent(hostPort, retval)
hostPortParseResults.get(hostPort)
}
As a consequence, the whole string 9000/input/file.txt is supposed to be a single port number. Which suggests you are not supposed to refer to your input file from HDFS file system. I guess someone more skilled in Apache Spark would give you better advice.
I changed program to: https://github.com/mahmoudparsian/data-algorithms-book/blob/master/src/main/java/org/dataalgorithms/chapB13/client/SubmitSparkPiToYARNFromJavaCode.java
import org.apache.spark.SparkConf;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.deploy.yarn.ClientArguments;
import org.apache.hadoop.conf.Configuration;
import org.apache.log4j.Logger;
public class SubmitSparkAppToYARNFromJavaCode {
public static void main(String[] args) throws Exception {
run();
}
static void run() throws Exception {
String sparkExamplesJar = "/opt/spark/examples/jars/spark-examples_2.11-2.0.0.jar";
final String[] args = new String[]{
"--jar",
sparkExamplesJar,
"--class",
"org.apache.spark.examples.JavaWordCount",
"--arg",
"hdfs://hadoop-master:9000/input/file.txt"
};
Configuration config = ConfigurationManager.createConfiguration();
System.setProperty("SPARK_YARN_MODE", "true");
SparkConf sparkConf = new SparkConf();
sparkConf.setSparkHome(SPARK_HOME);
sparkConf.setMaster("yarn");
sparkConf.setAppName("spark-yarn");
sparkConf.set("master", "yarn");
sparkConf.set("spark.submit.deployMode", "cluster");
ClientArguments clientArguments = new ClientArguments(args);
Client client = new Client(clientArguments, config, sparkConf);
client.run();
}
}
and now it works!

Categories