I'm writing a Dataflow pipeline that should do 3 things:
Reading .csv files from GCP Storage
Parsing the data to BigQuery campatible TableRows
Writing the data to a BigQuery table
Up until now this all worked like a charm. And it still does, but when I change the source and destination variables nothing changes. The job that actually runs is an old one, not the recently changed (and committed) code. Somehow when I run the code from Eclipse using the BlockingDataflowPipelineRunner the code itself is not uploaded but an older version is used.
Normally nothing wrong with the code but to be as complete as possible:
public class BatchPipeline {
String source = "gs://sourcebucket/*.csv";
String destination = "projectID:datasetID.testing1";
//Creation of the pipeline with default arguments
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<String> line = p.apply(TextIO.Read.named("ReadFromCloudStorage")
.from(source));
#SuppressWarnings("serial")
PCollection<TableRow> tablerows = line.apply(ParDo.named("ParsingCSVLines").of(new DoFn<String, TableRow>(){
#Override
public void processElement(ProcessContext c){
//processing code goes here
}
}));
//Defining the BigQuery table scheme
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("datetime").setType("TIMESTAMP").setMode("REQUIRED"));
fields.add(new TableFieldSchema().setName("consumption").setType("FLOAT").setMode("REQUIRED"));
fields.add(new TableFieldSchema().setName("meterID").setType("STRING").setMode("REQUIRED"));
TableSchema schema = new TableSchema().setFields(fields);
String table = destination;
tablerows.apply(BigQueryIO.Write
.named("BigQueryWrite")
.to(table)
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withoutValidation());
//Runs the pipeline
p.run();
}
This problem arose because I've just changed laptops and had to reconfigure everything. I'm working on a clean Ubuntu 16.04 LTS OS with all the dependencies for GCP development installed (normally). Normally everything is configured quite well since I'm able to start a job (which shouldn't be possible if my config is erred, right?). I'm using Eclipse Neon btw.
So where could the problem lie? It seems to me that there is a problem uploading the code, but I've made sure that my cloud git repo is up-to-date and the staging bucket has been cleaned up ...
**** UPDATE ****
I never found what was exactly going wrong but when I checked out the creation dates of the files in my deployed jar, I indeed saw that they were never really updated. The jar file itself had however a recent timestamp which made me overlook that problem completely (rookie mistake).
I eventually got it all working again by simply creating a new Dataflow project in Eclipse and copying my .java files from the broken project into the new one. Everything worked like a charm from then on.
Once you submit a Dataflow job, you can check which artifacts were part of the job specification by inspecting the files that are part of the job description which is available via DataflowPipelineWorkerPoolOptions#getFilesToStage. The code snippet below gives a little sample of how to get this information.
PipelineOptions myOptions = ...
myOptions.setRunner(DataflowPipelineRunner.class);
Pipeline p = Pipeline.create(myOptions);
// Build up your pipeline and run it.
p.apply(...)
p.run();
// At this point in time, the files which were staged by the
// DataflowPipelineRunner will have been populated into the
// DataflowPipelineWorkerPoolOptions#getFilesToStage
List<String> stagedFiles = myOptions.as(DataflowPipelineWorkerPoolOptions.class).getFilesToStage();
for (String stagedFile : stagedFiles) {
System.out.println(stagedFile);
}
The above code should print out something like:
/my/path/to/file/dataflow.jar
/another/path/to/file/myapplication.jar
/a/path/to/file/alibrary.jar
It is likely that the resources part of the job that your uploading are out of date in some way containing your old code. Look through all the directories and jar parts of the staging list and find all instances of BatchPipeline and verify their age. jar files can be extracted using the jar tool or any zip file reader. Alternatively use javap or any other class file inspector to validate that the BatchPipeline class file lines up with the expected changes you have made.
Related
I'm trying to execute a GMS model in GAMS through the API Java. When I execute that model directly from Gams Studio, it works perfectly. But when I run the model through the API I obtain a lot of errors.
I have searched and I think I need to add input data to the job or workspace(I don't know what of them need to know the input data). I have a folder where I have a lot of files with data and those files are processed from GAMS Studio when I run the GMS model.
I believe I need to add these files in Java Api too, but I don't know how to add a folder or if I need to add one by one and how to do it.
My code is simple:
GAMSWorkspace workspace = new GAMSWorkspace();
workspace.setDebugLevel(DebugLevel.KEEP_FILES);
GAMSJob jobGams = workspace.addJobFromFile("fileModelGms");
jobGams.run();
I have solved the problem with Lutz's user helper. I needed to Include a dir with input that model uses.
This is my code commented line per line to understand how API Gams works. I used a specific workspace too because API creates a folder in temps file when you run a new Job. I did use for a database GDX too to run my model.
//specific workspace information is created example: C:/Desktop/Workspace
GAMSWorkspaceInfo workspaceInfo = new GAMSWorkspaceInfo();
workspaceInfo.setWorkingDirectory("specificPathWorkspace");
//A new workspace is created with workspaceInfo.
GAMSWorkspace workspace = new GAMSWorkspace(workspaceInfo);
workspace.setDebugLevel(DebugLevel.KEEP_FILES);
//Options where you're going to set input file data.
GAMSOptions options = workspace.addOptions();
//Set path with input Data example: C:/Desktop/InputDate
options.IDir.add("PathWithInputData");
//Using a database where is data to be processed example: db.gdx
GAMSDatabase gdxdb = workspace.addDatabaseFromGDX("db.gdx");
// Creating a JOB to execute the model.
GAMSJob jobGams = workspace.addJobFromFile(entradasModeloGamsDTO.getPathModeloGams());
//Running model
jobGams.run(options,gdxdb);
I have a model with GMS extension. When I run that model with Gams studio, it run perfectly and I obtain the expected results.
I have tried to run the GMS model with Gams IDE but I obtain a lot of errors, so, I have tried something different. I Have opened a file with GPR extension and after of that I have imported the GMS model and everything works perfectly when I run the project.
I think I need to do same thing usinge Gams Java API, but I don't know how to import to my workspace a GPR file.
In this moment I just have the next code:
GAMSWorkspace workspace = new GAMSWorkspace();
workspace.setDebugLevel(DebugLevel.KEEP_FILES);
GAMSJob jobGams = workspace.addJobFromFile("fileModelGms");
jobGams.run();
When I run that code, I obtain an error:
GAMS process returns unsuccessfully with return code : 2 [there was a
compilation error]. Check \_gams_java_gjo1.lst] for more details.
The gpr file has a format that is only understood by the GAMSIDE. You can not pass it to any API. If you get errors calling your model from the API but not from the GAMSIDE, you probably have set certain options using the IDE which you should set now trough the API as well. Though, without seeing the exact error, it is hard to give further hints.
I have solved the problem with Lutz's helper. I needed to Include a dir with input that model uses.
This is my code commented line per line to understand how API Gams works. I used a specific workspace too because API creates a folder in temps file when you run a new Job. I did use for a database GDX too to run my model.
//specific workspace information is created example: C:/Desktop/Workspace
GAMSWorkspaceInfo workspaceInfo = new GAMSWorkspaceInfo();
workspaceInfo.setWorkingDirectory("specificPathWorkspace");
//A new workspace is created with workspaceInfo.
GAMSWorkspace workspace = new GAMSWorkspace(workspaceInfo);
workspace.setDebugLevel(DebugLevel.KEEP_FILES);
//Options where you're going to set input file data.
GAMSOptions options = workspace.addOptions();
//Set path with input Data example: C:/Desktop/InputDate
options.IDir.add("PathWithInputData");
//Using a database where is data to be processed example: db.gdx
GAMSDatabase gdxdb = workspace.addDatabaseFromGDX("db.gdx");
// Creating a JOB to execute the model.
GAMSJob jobGams = workspace.addJobFromFile(entradasModeloGamsDTO.getPathModeloGams());
//Running model
jobGams.run(options,gdxdb);
I want to write a Gradle plugin which can inspect an eclipse workspace directory and iterate over the open projects within the workspace and determine the location of each.
Something like
Workspace workspace = EclipseUtils.parseWorkspace("c:/myEclipseWorkspace");
Collection<Project> projects = workspace.getProjects();
for (Project project : projects) {
System.out.println(String.format("name=%s, location=%s, open=%s",
project.getName(), project.getLocation(), project.isOpen()));
}
I've looked at my workspace and can see some .location files under c:\myEclipseWorkspace\.metadata\.plugins\org.eclipse.core.resources\.projects\
But these files are a custom binary format
Is there an eclipse API that I can invoke to parse these? Or some other solution to iterate the open projects in a workspace.
Please note that I want to do this externally to eclipse and NOT within an eclipse plugin.
Reading the Private Description to Obtain Location
Since you are writing in Java, then reuse the Eclipse code from your external location.
i.e. Pull out some of the key code from org.eclipse.core.resources.ResourcesPlugin. Start with the impl of org.eclipse.core.resources.ResourcesPlugin.getWorkspace() and then work your way to org.eclipse.core.resources.IWorkspaceRoot.getProjects()
The above code reads the project description here: org.eclipse.core.internal.resources.LocalMetaArea.readPrivateDescription(IProject, ProjectDescription) and that is called from org.eclipse.core.internal.localstore.FileSystemResourceManager.read(IProject, boolean) which has some logic about default locations.
This is the joy of EPL, as long as your new program/feature is EPL you can reuse Eclipse's core code to do new and wonderful things.
Reading Workspace State to Obtain Open/Close State
When reading workspace state, you are moving into the ElementTree data structures. Reading this without using the ElementTree classes is probably unrealistic. Using the ElementTree classes without full OSGi is probably unrealistic. I provide the following notes to help you on your way.
Working backwards:
ICoreConstants.M_OPEN is the flag value indicating project is open or closed (set for open, clear for closed)
M_OPEN is tested when Project.isOpen() is called
The flags at runtime are in ResourceInfo.flags
The flags are loaded by ResourceInfo.readFrom() called from SaveManager.readElement()
The DataInput input passed to readElement is from the Element Tree stored in the workspace meta directory in .metadata/.plugins/org.eclipse.core.resources/.root/<id>.tree. The specific version (id) of the file to use is recorded in the safe table .metadata/.plugins/org.eclipse.core.resources/.safetable/org.eclipse.core.resources
The safe table is part of the SaveManager's internal state stored in a MasterTable
I've managed to parse the file using this as a reference
protected Location parseLocation(File locationFile) throws IOException {
String name = locationFile.getParentFile().getName();
DataInputStream in = new DataInputStream(new FileInputStream(locationFile));
try {
in.skip(ILocalStoreConstants.BEGIN_CHUNK.length);
String path = in.readUTF();
int numRefs = in.readInt();
String[] refNames = new String[numRefs];
for (int i = 0; i < numRefs; ++ i) {
refNames[i] = in.readUTF();
}
in.skipBytes(ILocalStoreConstants.END_CHUNK.length);
return new Location(name, path, refNames);
} finally {
in.close();
}
}
Unfortunately this doesn't seem to be able to detect if a project is closed or not. Any pointers on getting the closed flag would be much appreciated
All of the introductory tutorials and docs that I can find on Hadoop have simple/contrived (word count-style) examples, where each of them is submitted to MR by:
SSHing into the JobTracker node
Making sure that a JAR file containing the MR job is on HDFS
Running an HDFS command of the form bin/hadoop jar share/hadoop/mapreduce/my-map-reduce.jar <someArgs> that actually runs Hadoop/MR
Either reading the MR result from the command-line or opening a text file containing the result
Although these examples are great for showing total newbies how to work with Hadoop, it doesn't show me how Java code actually integrates with Hadoop/MR at the API level. I guess I am sort of expecting that:
Hadoop exposes some kind of client access/API for submitting MR jobs to the cluster
Once the jobs are complete, some asynchronous mechanism (callback, listener, etc.) reports the result back to the client
So, something like this (Groovy pseudo-code):
class Driver {
static void main(String[] args) {
new Driver().run(args)
}
void run(String[] args) {
MapReduceJob myBigDataComputation = new SolveTheMeaningOfLifeJob(convertToHadoopInputs(args), new MapReduceCallback() {
#Override
void onResult() {
// Now that you know the meaning of life, do nothing.
}
})
HadoopClusterClient hadoopClient = new HadoopClusterClient("http://my-hadoop.example.com/jobtracker")
hadoopClient.submit(myBigDataComputation)
}
}
So I ask: Surely the simple examples in all the introductory tutorials, where you SSH into nodes and run Hadoop from the CLI, and open text files to view its results...surely that can't be the way Big Data companies actually integrate with Hadoop. Surely, something along the lines of my pseudo-code snippet above is used to kick off an MR job and fetch its results. What is it?
In one word, kicking off an MR job can be done, using Oozie scheduler. But before that, you write a map reduce job. It has the driver class which is the starting point of the job. You give all information needed for the job to run in Driver class: like map input, mapper class, if any partitioners, config details and reducer details.
Once these are there in jar file, and you start a job as above(hadoop jar) using CLI(in reality oozie does it), the rest is taken care by Hadoop ecosystem. Hope I answered your question
I've got a HDFS structure something like
a/b/file1.gz
a/b/file2.gz
a/c/file3.gz
a/c/file4.gz
I'm using the classic pattern of
FileInputFormat.addInputPaths(conf, args[0]);
to set my input path for a java map reduce job.
This works fine if I specify args[0] as a/b but it fails if I specify just a (my intention being to process all 4 files)
the error being
Exception in thread "main" java.io.IOException: Not a file: hdfs://host:9000/user/hadoop/a
How do I recursively add everything under a ?
I must be missing something simple...
As Eitan Illuz mentioned here, in Hadoop 2.4.0 a mapreduce.input.fileinputformat.input.dir.recursive configuration property was introduced that when set to true instructs the input format to include files recursively.
In Java code it looks like this:
Configuration conf = new Configuration();
conf.setBoolean("mapreduce.input.fileinputformat.input.dir.recursive", true);
Job job = Job.getInstance(conf);
// etc.
I've been using this new property and find that it works well.
EDIT: Better yet, use this new method on FileInputFormat that achieves the same result:
Job job = Job.getInstance();
FileInputFormat.setInputDirRecursive(job, true);
This is a bug in the current version of Hadoop. Here is the JIRA for the same. It's still in open state. Either make the changes in the code and build the binaries or wait for it to be fixed in the coming releases. Processing of the files recursively can be turned on/off, check the patch attached to the JIRA for more details.