Using FileInputFormat.addInputPaths to recursively add HDFS path - java

I've got a HDFS structure something like
a/b/file1.gz
a/b/file2.gz
a/c/file3.gz
a/c/file4.gz
I'm using the classic pattern of
FileInputFormat.addInputPaths(conf, args[0]);
to set my input path for a java map reduce job.
This works fine if I specify args[0] as a/b but it fails if I specify just a (my intention being to process all 4 files)
the error being
Exception in thread "main" java.io.IOException: Not a file: hdfs://host:9000/user/hadoop/a
How do I recursively add everything under a ?
I must be missing something simple...

As Eitan Illuz mentioned here, in Hadoop 2.4.0 a mapreduce.input.fileinputformat.input.dir.recursive configuration property was introduced that when set to true instructs the input format to include files recursively.
In Java code it looks like this:
Configuration conf = new Configuration();
conf.setBoolean("mapreduce.input.fileinputformat.input.dir.recursive", true);
Job job = Job.getInstance(conf);
// etc.
I've been using this new property and find that it works well.
EDIT: Better yet, use this new method on FileInputFormat that achieves the same result:
Job job = Job.getInstance();
FileInputFormat.setInputDirRecursive(job, true);

This is a bug in the current version of Hadoop. Here is the JIRA for the same. It's still in open state. Either make the changes in the code and build the binaries or wait for it to be fixed in the coming releases. Processing of the files recursively can be turned on/off, check the patch attached to the JIRA for more details.

Related

How to run a GPR file in Java API and run a model GAMS

I have a model with GMS extension. When I run that model with Gams studio, it run perfectly and I obtain the expected results.
I have tried to run the GMS model with Gams IDE but I obtain a lot of errors, so, I have tried something different. I Have opened a file with GPR extension and after of that I have imported the GMS model and everything works perfectly when I run the project.
I think I need to do same thing usinge Gams Java API, but I don't know how to import to my workspace a GPR file.
In this moment I just have the next code:
GAMSWorkspace workspace = new GAMSWorkspace();
workspace.setDebugLevel(DebugLevel.KEEP_FILES);
GAMSJob jobGams = workspace.addJobFromFile("fileModelGms");
jobGams.run();
When I run that code, I obtain an error:
GAMS process returns unsuccessfully with return code : 2 [there was a
compilation error]. Check \_gams_java_gjo1.lst] for more details.
The gpr file has a format that is only understood by the GAMSIDE. You can not pass it to any API. If you get errors calling your model from the API but not from the GAMSIDE, you probably have set certain options using the IDE which you should set now trough the API as well. Though, without seeing the exact error, it is hard to give further hints.
I have solved the problem with Lutz's helper. I needed to Include a dir with input that model uses.
This is my code commented line per line to understand how API Gams works. I used a specific workspace too because API creates a folder in temps file when you run a new Job. I did use for a database GDX too to run my model.
//specific workspace information is created example: C:/Desktop/Workspace
GAMSWorkspaceInfo workspaceInfo = new GAMSWorkspaceInfo();
workspaceInfo.setWorkingDirectory("specificPathWorkspace");
//A new workspace is created with workspaceInfo.
GAMSWorkspace workspace = new GAMSWorkspace(workspaceInfo);
workspace.setDebugLevel(DebugLevel.KEEP_FILES);
//Options where you're going to set input file data.
GAMSOptions options = workspace.addOptions();
//Set path with input Data example: C:/Desktop/InputDate
options.IDir.add("PathWithInputData");
//Using a database where is data to be processed example: db.gdx
GAMSDatabase gdxdb = workspace.addDatabaseFromGDX("db.gdx");
// Creating a JOB to execute the model.
GAMSJob jobGams = workspace.addJobFromFile(entradasModeloGamsDTO.getPathModeloGams());
//Running model
jobGams.run(options,gdxdb);

Google Cloud Dataflow: Submitted job is executing but using old code

I'm writing a Dataflow pipeline that should do 3 things:
Reading .csv files from GCP Storage
Parsing the data to BigQuery campatible TableRows
Writing the data to a BigQuery table
Up until now this all worked like a charm. And it still does, but when I change the source and destination variables nothing changes. The job that actually runs is an old one, not the recently changed (and committed) code. Somehow when I run the code from Eclipse using the BlockingDataflowPipelineRunner the code itself is not uploaded but an older version is used.
Normally nothing wrong with the code but to be as complete as possible:
public class BatchPipeline {
String source = "gs://sourcebucket/*.csv";
String destination = "projectID:datasetID.testing1";
//Creation of the pipeline with default arguments
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<String> line = p.apply(TextIO.Read.named("ReadFromCloudStorage")
.from(source));
#SuppressWarnings("serial")
PCollection<TableRow> tablerows = line.apply(ParDo.named("ParsingCSVLines").of(new DoFn<String, TableRow>(){
#Override
public void processElement(ProcessContext c){
//processing code goes here
}
}));
//Defining the BigQuery table scheme
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("datetime").setType("TIMESTAMP").setMode("REQUIRED"));
fields.add(new TableFieldSchema().setName("consumption").setType("FLOAT").setMode("REQUIRED"));
fields.add(new TableFieldSchema().setName("meterID").setType("STRING").setMode("REQUIRED"));
TableSchema schema = new TableSchema().setFields(fields);
String table = destination;
tablerows.apply(BigQueryIO.Write
.named("BigQueryWrite")
.to(table)
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withoutValidation());
//Runs the pipeline
p.run();
}
This problem arose because I've just changed laptops and had to reconfigure everything. I'm working on a clean Ubuntu 16.04 LTS OS with all the dependencies for GCP development installed (normally). Normally everything is configured quite well since I'm able to start a job (which shouldn't be possible if my config is erred, right?). I'm using Eclipse Neon btw.
So where could the problem lie? It seems to me that there is a problem uploading the code, but I've made sure that my cloud git repo is up-to-date and the staging bucket has been cleaned up ...
**** UPDATE ****
I never found what was exactly going wrong but when I checked out the creation dates of the files in my deployed jar, I indeed saw that they were never really updated. The jar file itself had however a recent timestamp which made me overlook that problem completely (rookie mistake).
I eventually got it all working again by simply creating a new Dataflow project in Eclipse and copying my .java files from the broken project into the new one. Everything worked like a charm from then on.
Once you submit a Dataflow job, you can check which artifacts were part of the job specification by inspecting the files that are part of the job description which is available via DataflowPipelineWorkerPoolOptions#getFilesToStage. The code snippet below gives a little sample of how to get this information.
PipelineOptions myOptions = ...
myOptions.setRunner(DataflowPipelineRunner.class);
Pipeline p = Pipeline.create(myOptions);
// Build up your pipeline and run it.
p.apply(...)
p.run();
// At this point in time, the files which were staged by the
// DataflowPipelineRunner will have been populated into the
// DataflowPipelineWorkerPoolOptions#getFilesToStage
List<String> stagedFiles = myOptions.as(DataflowPipelineWorkerPoolOptions.class).getFilesToStage();
for (String stagedFile : stagedFiles) {
System.out.println(stagedFile);
}
The above code should print out something like:
/my/path/to/file/dataflow.jar
/another/path/to/file/myapplication.jar
/a/path/to/file/alibrary.jar
It is likely that the resources part of the job that your uploading are out of date in some way containing your old code. Look through all the directories and jar parts of the staging list and find all instances of BatchPipeline and verify their age. jar files can be extracted using the jar tool or any zip file reader. Alternatively use javap or any other class file inspector to validate that the BatchPipeline class file lines up with the expected changes you have made.

No list of internal splits provided! No Rules URL Provided?

I am setting up GATE to run on a text document, I want to use DefaultTokenizer and POSTagger, but I am getting error while initializing ANNIE controller.
Exception in thread "main" gate.creole.ResourceInstantiationException: No URL provided for the rules!
at gate.creole.tokeniser.SimpleTokeniser.init(SimpleTokeniser.java:131)
at gate.Factory.createResource(Factory.java:302)
at gate.Factory.createResource(Factory.java:117)
at gate.creole.tokeniser.DefaultTokeniser.init(DefaultTokeniser.java:55)
at gate.Factory.createResource(Factory.java:302)
at gate.Factory.createResource(Factory.java:97)
Can you please help?
Could you please share information how you created application pipeline?
From the error description I can assume that you have a wrong path in your Tokenizer. May be you accidentally added something to default path.
ProcessingResource tokeniser = (ProcessingResource) Factory.createResource("gate.creole.tokeniser.DefaultTokeniser",Factory.newFeatureMap());
SerialAnalyserController pipeline = (SerialAnalyserController) Factory.createResource("gate.creole.SerialAnalyserController");
pipeline.add(tokeniser);
I think the issue was with its home, so I just removed old version, reinstalled GATE latest version and set path accordingly and it worked.

Hadoop: Job shows up in the job browser but unable to access the JobStatus via api

I have run an example hadoop job, and when I look at the Jobs area of the Hue web app I can see the details for my job. I would like to access this info programmatically... I wrote the following code as a test:
JobClient jobClient = new JobClient(new Configuration());
JobStatus[] jobStatuses = jobClient.getAllJobs();
System.out.println("Found " + jobStatuses.length + " job statuses.");
for(JobStatus jobStatus : jobStatuses) {
System.out.println(jobStatus.getJobID());
}
jobClient.close();
Output is: "Found 0 job statuses."
Other details - I testing this using the CDH4 standalone VM. I am using the conf files from /etc/hadoop/conf/conf.cloudera.yarn1 (using the /etc/hadoop/conf ones did not work).
The question here seems related but is unanswered as well...
What are some areas that I could investigate to sort this out?
Thanks!
After some additional research I determined that I was using the mr2 (yarn) compatible jars instead of the mr1 compatible jars. I changed my pom.xml appropriately and the problems magically went away.

java.lang.IllegalArgumentException: no JSON input found while trying google calendar Api in java

I downloaded google calendar api sample from http://code.google.com/p/google-api-java-client/source/browse/calendar-cmdline-sample/?repo=samples and created a project in eclipse.
Now when i try to run the project am getting java.lang.IllegalArgumentException: no JSON input found at this line
FileCredentialStore credentialStore = new FileCredentialStore(
new File(System.getProperty("user.home"), ".credentials/calendar.json"), JSON_FACTORY);
Have any of you tried this example? what is wrong here?
This error can be resolved by providing input to the .credentials/calendar.json file. If you manually provided the following entry in the calendar.json , it will work :
{
"installed": {
"client_id": "client_id",
"client_secret": "client_secret"
}
}
It seems to be the Windows problem which is not allowing to set writable permissions on calendar.json file . The method setWritable(boolean,boolean) is returning false and so is the cause of this problem. Still providing json input manually is not a perfect solutions but your application will work.
That may happen when your application executed before and it created empty .credentials/calendar.json file in you home dir. That may happen if you're running your application in Windows, cause FileCredentialStore tries to do:
file.setReadable(false, false)
and fails.
To solve it just remove calendar.json. Although you might have another error: [unable to set file permissions]
which I don't know how to solve yet.
Is that project having calendar.json resource file. Please share complete exception stack trace.
Seems some required configuration missed from calendar.json file

Categories