I will begin with an example. Suppose the input data is something like
User1,product1,time1
User1,product2,time2
User1,product3,time3
User2,product2,time2
User2,product4,time6
Now the output which is expected is I have to insert the data to a database(Aerospike(Key Value Store), in my case) where the data should be formatted as
User1, [ [product1,time1],[product2,time2],[product3,time3] ]
User2, [ [product2,time2],[product4,time6] ]
So In the Mapper I output the below
UserID, [productid,timestamp]
Please do not assume that [x,y] means i am outputting list i may send data from mappper in any way may be write the data in a custom object
So at the receiver side I have data in the format
User1, [ [product1,time1],[product2,time2],[product3,time3] ]
User2, [ [product2,time2],[product4,time6] ]
Now I can do two things
a) I can write the logic to push this data in database in the reducer only
(i dont want to do this)
b) What i want to do is, when we do Context.write() then i want the data to be written to database.
Please help how could this be done and if possible attach a code snippet or pseudo-code
PS : What does Context.write() do ? where does it write to? what are the steps and phases it goes by ?
As far as my understanding goes, invoking context.write involves a certain number of steps
In the driver we have to specify the output format. Now lets see what happens if we want to write to a file
For writing to text file we specify something like
job.setOutputFormatClass(TextOutputFormat.class);
Now if we see the implementation of the TextOutputFormat class that extends FileOutputFormat(abstract class) which implements the OutputFormat interface and the OutputFormat interface provides two methods
1) getRecordWriter
2) checkOutputSpecs
Now what will happen, OutputFormatClass just tells what kind of record you want to write and how is given by the record writer, for a record writer it gets just Object Key, Object Value where value could be single or a list, and in the implementation of the record writer we specify the actual logic like how should be this record written.
Now Comming back to the original question that how should the record be written to a database in my case Aerospike
I created a custom OutputFormat say
public class AerospikeOutputFormat extends OutputFormat {
//Return a new instance of record writer
#Override
public RecordWriter getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException {
return new AerospikeRecordWriter(context.getConfiguration(), new Progressable() {
#Override
public void progress() {
}
});
}
}
Now we have to define a custom record writer which would get a key and a value and would write the data to the database
public class RSRVRecordWriter<KK,VV> extends RecordWriter<KK, VV> {
#Override
public void write(KK key, VV value) throws IOException {
//Now here we can have an instance of aerospikeclient from a singleton class and then we could do client.put()
}
Above code is just a snippet, proper design strategy must be taken.
PS: Aerospike has given a record writer which could be extended to match your needs at this link
Related
I have text file
Now I am trying to read this into a two dimension array .
anyone with an example code or question which was answered ?
Consider this file divided in middle present two record in same format, you need to design class that contains fields that you want to get from this file. After that you need to read
List<String> fileLines = Files.readAllLines(Path pathToYourFile, Charset cs);
and parse this file with help of regular expressions. To simplify this task you may read lines and after that specify regexp per line.
class UnstructuredFile {
private List<String> rawLines;
public UnstructuredFile (List<String> rawLines) {
this.rawLines = rawLines;
}
public List<FileRecord> readAllRecords() {
//determine where start and stop one record in list list.sublist(0,5) or split it to List<List<String>>
}
private FileRecord readOneRecord(List<String> record) {
//read one record from list
}
}
in this class we first detect start and end of every record and after that pass it to method that parse one FileRecord from List
Maybe you need to decouple you task even more, consider you have one record
------
data 1
data 2
data 3
------
we make to do classes RecordRowOne, RecordRowTwo etc. every class have regex that know how
to parse particular line of row of the record string and returns partucular results like
RecordRowOne {
//fields
public RecordRowOne(String regex, String dataToParse) {
//code
}
int getDataOne() {
//parse
}
}
another row class in example has methods like
getDataTwo();
after you create all this row classes pass them to FileRecord class
that get data from all Row classes and it will be present one record of you file;
class FileRecord {
//fields
public FileRecord(RecordRowOne one, RecordRowTwo two) {
//get all data from rows and set it to fields
}
//all getters for fields
}
it is basic idea for you
I want to know how to process and manage Tabular data stream in java programming.
consider there is a table of records has the scheme ( name, age, zip-code, disease) and the records are to be read and processed tuple by tuple in time as a stream. i want to manage these stream tuples to save the processed tuples with the the scheme ( age, zip- code, disease ) ( name attribute is supposed to be deleted )
for input example .. read Tuple 1 ( han, 25, 12548, flue) at Time t1
publish Tuple 1* ( 25, 12548, flue)
read Tuple 2 ( alex, 27, 12544, cancer) 1 at t2 .
output Tuple 2* (27, 12544, cancer).
.. and so on, Can anyone Help me?
Here are some suggestions for a framework you can base your final application on.
First, make classes to represent your input and output records. We'll call them InRecord and OutRecord for the sake of discussion, but you can give them whatever names make sense for you. Give them private fields to hold the necessary data and public getter/setter methods to access the data.
Second, define an interface for an input supplier; let's call it InputSupplier for this discussion. It will need to provide methods for setup (open()) and tear-down (close()) methods to be called at the start and end of processing, and a getNext() method that returns the next available InRecord. You'll need to decide how it indicate end-of-input: either define that getNext() will return null if
there are no more input records, or provide a hasNext() method to call which will return true or false to indicate if another input record is available.
Third, define an interface for an output consumer (OutputConsumer). You'll want to have open() and close() methods, as well as an accept(OutRecord) method.
With this infrastructure in place, you can write your processing method:
public void process(InputSupplier in, OutputConsumer out){
in.open();
out.open();
InRecord inrec;
while ((inrec = in.getNext()) != null){
OutRecord outrec = new OutRecord(in.getAge(), in.getZipCode(), in.getDisease());
out.accept(outrec);
}
out.close();
in.close();
}
Finally, write some "dummy" I/O classes, one that implements InputSupplier and another that implements OutputConsumer. For test purposes, your input supplier can just return a few hand-created records and your output consumer could just print on the console the output records you send it.
Then all you need is a main method to tie it all together:
public static void main(String[] args){
InputSupplier in = new TestInput();// our "dummy" input supplier class
OuputConsumer out = new TestOutput(); // our "dummy" output consumer
process(in, out);
}
For the "real" application you'd write a "real" input supplier class, still implementing the InputSupplier interface, that can read from from a database or an Excel file or whatever input source, and an new output consumer class, still implementing the OutputConsumer interface, that can take output records and store them into whatever appropriate format. Your processing logic won't have to change, because you coded it in terms of InputSupplier and OutputConsumer interfaces. Now just tweak main a bit and you've got your final app:
public static void main(String[] args){
InputSupplier in = new RealInput();// our "real" input supplier class
OuputConsumer out = new RealOutput(); // our "real" output consumer
process(in, out);
}
This question already has answers here:
Spring Batch - more than one writer based on field value
(2 answers)
Closed 5 years ago.
I'm working with an application where I'm required to translate various CSV files to an SQL database. Different CSV files may contain a variable number of columns but the first three will always be consistent between files.
The files are being read in just fine, and new database tables are created before the CSV is picked up as I know the combination of possible values. However, when it comes to the writer (I'm using a JdbcBatchItemWriter), I need to reference the newly created table names based on the values in these columns to determine which tables the data corresponds to - I won't know which row corresponds to which table until I look at each row.
Some code to illustrate this:
public JdbcBatchItemWriter<Form> writer(#Value("#{jobParameters}") Map<String,Object> jobParameters) {
JdbcBatchItemWriter<Form> writer = new JdbcBatchItemWriter<Form>();
...
String parameters = getParameters();
writer.setSql(String.format("INSERT INTO tableName VALUES (%s)", parameters.toString()));
Basically, 'tableName' needs updating for each row that is processed.
It seems that ItemSqlParameterSourceProvider and ItemPreparedStatementSetter are designed for populating values in SQL query strings, but there isn't anything I can find to get the table name in as a parameter. Because I don't have access to each item at the level of the writer definition, I can't replace the value before the prepared statement is put together.
I've considered filtering items before they get there, but it's too messy for an unknown number of tables that might need to be entered into from the CSV. Any ideas?
Write your own writer, that keeps a map of writers. Everytime a new tablename appears, you can instantiate a new writer and store it in this map.
Instantiating a JdbcBatchItemWriter on the fly is no big deal (it does not have to be a spring bean).
public static <T> ItemWriter<T> createDbItemWriter(DataSource ds, String sql, ItemPreparedStatementSetter<T> psSetter) {
JdbcBatchItemWriter<T> writer = new JdbcBatchItemWriter<>();
writer.setDataSource(ds);
writer.setSql(sql);
writer.setItemPreparedStatementSetter(psSetter);
writer.afterPropertiesSet();
return writer;
}
Your writer will have to look somehow loke this (note: this code is not tested, it is just here to give you an idea)
public class MyWriter extends ItemWriter<MyDto> {
private Map<String, JdbcBatchItemWriter<MyDto>> writersMaps = new HashMap<>();
private JdbcBatchItemWriter<MyDto> getDbWriter(String tableName) throws Exception {
return writersMaps.putIfAbsent(tableName, createJdbcWriter(tableName));
}
private JdbcBatchItemWriter<MyDto> createJdbcWriter(String tableName) {
JdbcBatchItemWriter<T> writer = new JdbcBatchItemWriter<>();
// do your configuration
writer.afterPropertiesSet();
return writer;
}
public void write(List<MyDto> items) throws Exception {
Map<String, List<MyDto>> groupedItems =
--> build a list for every targetTableName, put in a Map
for (Map.Entry<String, List<MyDto>> entry : groupedItems) {
getDbWriter(entry.getKey()).write(entry.getValue);
}
}
}
I have a custom item reader that transforms lines from a textfile to my entity:
public class EntityItemReader extends AbstractItemStreamItemReader<MyEntity> {
#Override
public MyEntity read() {
String line = delegate.read();
//analyze line and skip by condition
//line.split
//create entity with line values
}
}
This is similar to the FlatFileItemReader.
The read MyEntity will then be persisted to a DB by a JdbcItemReader.
Problem: sometimes I have lines that contain values that should be skipped.
BUT when I just return null inside the read() method of the reader, then not only this item is skipped, by the reading is terminated completely, and all further lines will be skipped. Because a null element is the "signal" for all spring-readers that the file to be read is finished.
So: what can I do to skip specific lines by condition inside the reader if I cannot return null? Because by nature of the reader I'm forced to return an object here.
I think the good practice to filter some lines is to use not the reader but a processor (in which you can return null when you want to filter the line).
Please see http://docs.spring.io/spring-batch/trunk/reference/html/readersAndWriters.html :
6.3.2 Filtering Records
One typical use for an item processor is to filter out records before they are passed to the ItemWriter. Filtering is an action distinct from skipping; skipping indicates that a record is invalid whereas filtering simply indicates that a record should not be written.
For example, consider a batch job that reads a file containing three different types of records: records to insert, records to update, and records to delete. If record deletion is not supported by the system, then we would not want to send any "delete" records to the ItemWriter. But, since these records are not actually bad records, we would want to filter them out, rather than skip. As a result, the ItemWriter would receive only "insert" and "update" records.
To filter a record, one simply returns "null" from the ItemProcessor. The framework will detect that the result is "null" and avoid adding that item to the list of records delivered to the ItemWriter. As usual, an exception thrown from the ItemProcessor will result in a skip.
I've had a similar problem for the more general case where I'm using a custom reader. That is backed by an iterator over an object type and returns a new item (of different type) for each object read. Problem is some of those objects don't map to anything, so I'd like to return something that marks that.
Eventually I've decided to define an INVALID_ITEM and return that. Another approach could be to advance the iterator in the read() method, until the next valid item, with null returned if .hasNext() becomes false, but that is more cumbersome.
Initially I have also tried to return a custom ecxeption and tell Spring to skip the item upon it, but it seemed to be ignored, so I gave up (if there are too many invalids isn't performant anyway).
I do not think you can have your cake and eat it too in this case (and after reading all the comments).
By best opinion would (as suggested) to throw a custom Exception and skip 'on it'.
You can maybe optimize your entity creation or processes elsewhere so you don't loose so much performance.
Good luck.
We can handle it via a custom Dummy Object.
private final MyClass DUMMYMyClassObject ;
private MyClass(){
// create blank Object .
}
public static final MyClass getDummyyClassObject(){
if(null == DUMMYMyClassObject){
DUMMYMyClassObject = new MyClass();
}
return DUMMYMyClassObject ;
}
And just use the below when required to skip the record in the reader :
return MyClass.getDummyyClassObject();
The same can be ignored in the processor , checking if the object is blank OR as per the logic written in the private default constructor .
For skipping lines you can throw Exception when you want to skip some lines, like below.
My Spring batch Step
#Bean
Step processStep() {
return stepBuilderFactory.get("job step")
.<String, String>chunk(1000)
.reader(ItemReader)
.writer(DataWriter)
.faultTolerant() //allowing spring batch to skip line
.skipLimit(1000) //skip line limit
.skip(CustomException.class) //skip lines when this exception is thrown
.build();
}
My Item reader
#Bean(name = "reader")
public FlatFileItemReader<String> fileItemReader() throws Exception {
FlatFileItemReader<String> reader = new FlatFileItemReader<String>();
reader.setResource(resourceLoader.getResource("c://file_location/file.txt"));
CustomLineMapper lineMapper = new CustomLineMapper();
reader.setLineMapper(lineMapper);
return reader;
}
My custom line mapper
public class CustomLineMapper implements LineMapper<String> {
#Override
public String mapLine(String s, int i) throws Exception {
if(Condition) //put your condition here when you want to skip lines
throw new CustomException();
return s;
}
}
I am new to Hadoop and MapReduce and have been trying to write output to multiple files based on keys. Could anyone please provide clear idea or Java code snippet example on how to use it. My mapper is working exactly fine and after shuffle, keys and the corresponding values are obtained as expected. Thanks!
What i am trying to do is output only few records from the input file to a new file.
Thus the new output file shall contain only those required records, ignoring rest irrelevant records.
This would work fine even if i don't use MultipleTextOutputFormat.
Logic which i implemented in mapper is as follows:
public static class MapClass extends
Mapper {
StringBuilder emitValue = null;
StringBuilder emitKey = null;
Text kword = new Text();
Text vword = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] parts;
String line = value.toString();
parts = line.split(" ");
kword.set(parts[4].toString());
vword.set(line.toString());
context.write(kword, vword);
}
}
Input to reduce is like this:
[key1]--> [value1, value2, ...]
[key2]--> [value1, value2, ...]
[key3]--> [value1, value2, ...] & so on
my interest is in [key2]--> [value1, value2, ...] ignoring other keys and corresponding values. Please help me out with the reducer.
Using MultipleOutputs lets you emit records in multiple files, but in a set of pre-defined number/type of files only and not arbitrary number of files and not with on-the-fly decision on filename according to key/value.
You may create your own OutputFormat by extending org.apache.hadoop.mapred.lib.MultipleTextOutputFormat. Your OutputFormat class shall enable decision of output file name as well as folder according to the key/value emitted by reducer. This can be achieved as follows:
package oddjob.hadoop;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
public class MultipleTextOutputFormatByKey extends MultipleTextOutputFormat<Text, Text> {
/**
* Use they key as part of the path for the final output file.
*/
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String leaf) {
return new Path(key.toString(), leaf).toString();
}
/**
* When actually writing the data, discard the key since it is already in
* the file path.
*/
#Override
protected Text generateActualKey(Text key, Text value) {
return null;
}
}
For more info read here.
PS: You will need to use the old mapred API to achieve that. As in the newer API there isn't support for MultipleTextOutput yet! Refer this.