I plan to use a custom Field and TimeBased partitioner to partition my data in s3 as follow:
My Partitioner works fine, everything is as expected in my S3 bucket.
The problem is linked to the performance of the sink
I have 400kB/s/broker = ~1.2MB/s in my input topic and the sink works with spikes and commit a small number of records.
If I use the classic TimeBasedPartitioner, enter image description here
So my problem seems to be in my custom partitioner. Here is the code:
package test;
import ...;
public final class FieldAndTimeBasedPartitioner<T> extends TimeBasedPartitioner<T> {
private static final Logger log = LoggerFactory.getLogger(FieldAndTimeBasedPartitioner.class);
private static final String FIELD_SUFFIX = "part_";
private static final String FIELD_SEP = "=";
private long partitionDurationMs;
private DateTimeFormatter formatter;
private TimestampExtractor timestampExtractor;
private PartitionFieldExtractor partitionFieldExtractor;
protected void init(long partitionDurationMs, String pathFormat, Locale locale, DateTimeZone timeZone, Map<String, Object> config) {
this.delim = (String)config.get("directory.delim");
this.partitionDurationMs = partitionDurationMs;
try {
this.formatter = getDateTimeFormatter(pathFormat, timeZone).withLocale(locale);
this.timestampExtractor = this.newTimestampExtractor((String)config.get("timestamp.extractor"));
this.partitionFieldExtractor = new PartitionFieldExtractor((String)config.get("partition.field"));
} catch (IllegalArgumentException e) {
ConfigException ce = new ConfigException("path.format", pathFormat, e.getMessage());
throw ce;
private static DateTimeFormatter getDateTimeFormatter(String str, DateTimeZone timeZone) {
return DateTimeFormat.forPattern(str).withZone(timeZone);
public static long getPartition(long timeGranularityMs, long timestamp, DateTimeZone timeZone) {
long adjustedTimestamp = timeZone.convertUTCToLocal(timestamp);
long partitionedTime = adjustedTimestamp / timeGranularityMs * timeGranularityMs;
return timeZone.convertLocalToUTC(partitionedTime, false);
public String encodePartition(SinkRecord sinkRecord, long nowInMillis) {
final Long timestamp = this.timestampExtractor.extract(sinkRecord, nowInMillis);
final String partitionField = this.partitionFieldExtractor.extract(sinkRecord);
return this.encodedPartitionForFieldAndTime(sinkRecord, timestamp, partitionField);
public String encodePartition(SinkRecord sinkRecord) {
final Long timestamp = this.timestampExtractor.extract(sinkRecord);
final String partitionFieldValue = this.partitionFieldExtractor.extract(sinkRecord);
return encodedPartitionForFieldAndTime(sinkRecord, timestamp, partitionFieldValue);
private String encodedPartitionForFieldAndTime(SinkRecord sinkRecord, Long timestamp, String partitionField) {
if (timestamp == null) {
String msg = "Unable to determine timestamp using timestamp.extractor " + this.timestampExtractor.getClass().getName() + " for record: " + sinkRecord;
throw new ConnectException(msg);
} else if (partitionField == null) {
String msg = "Unable to determine partition field using partition.field '" + partitionField + "' for record: " + sinkRecord;
throw new ConnectException(msg);
} else {
DateTime recordTime = new DateTime(getPartition(this.partitionDurationMs, timestamp.longValue(), this.formatter.getZone()));
return this.FIELD_SUFFIX
+ config.get("partition.field")
+ this.FIELD_SEP
+ partitionField
+ this.delim
+ recordTime.toString(this.formatter);
static class PartitionFieldExtractor {
private final String fieldName;
PartitionFieldExtractor(String fieldName) {
this.fieldName = fieldName;
String extract(ConnectRecord<?> record) {
Object value = record.value();
if (value instanceof Struct) {
Struct struct = (Struct)value;
return (String) struct.get(fieldName);
} else {
FieldAndTimeBasedPartitioner.log.error("Value is not of Struct !");
throw new PartitionException("Error encoding partition.");
public long getPartitionDurationMs() {
return partitionDurationMs;
public TimestampExtractor getTimestampExtractor() {
return timestampExtractor;
It's more or less a merge of FieldPartitioner and TimeBasedPartitioner.
Any clue on why I have suck a bad performance on while sinking messages ?
While partitioning using field in the record, deserialize and extract data from the message can cause this issue ?
As I have around 80 different fields values, can it be a memory issue as it will maintain 80 times more buffers in the heap ?
Thanks for your help.
FYI, the problem was the partitioner itself. My partitioner needed to decode the entire message and get the info.
As I have a lot of messages, it takes time to handle all these events.
We have a gRPC server that inserts the data into the CockRoachDB and the data is coming from a Spring Boot micro-service.
This is my code to persist in the CRDB database:
#Transactional(propagation = Propagation.REQUIRED, rollbackFor = Exception.class)
public class CockroachPersister {
private static final String X_AMZN_REQUESTID = "x-amzn-RequestId";
private static final String X_AMZN_RESPONSE = "x-amzn-Response";
private static final String PUTITEM = "PutItem";
private static final String GETITEM = "GetItem";
private static final String DELETEITEM = "DeleteItem";
private static final String UPDATEITEM = "UpdateItem";
public <T extends Message> T save(final String requestBody, final String action, final String tableName) {
T t = null;
try {
List<GRPCMapper> lGRPCMapper = ServiceMapper.getServices(action,tableName);
for (GRPCMapper grpcMapper : lGRPCMapper) {
System.out.println("grpcMapper.getClassName() ==> "+grpcMapper.getClassName());
Class<?> className = Class.forName(grpcMapper.getClassName());
Class<?> implementedClassType = Class.forName(grpcMapper.getImplementedClass());
Method userMethod = implementedClassType.getDeclaredMethod(grpcMapper.getServiceName(), className);
System.out.println("userMethod\t" + userMethod.getName());
t = (T) userMethod.invoke(null, ProtoUtil.getInstance(requestBody, grpcMapper.getProtoType()));
System.out.printf("Service => %s row(s) Inserted \n", t.getAllFields().toString());
} catch (Exception e) {
return t;
If the initial insertion failed, I would like to try at least 3 TIMES before we can log the error. How do I implement that?
A solution that use message queue will be also acceptable.
This is the service class.I am creating a XML file by reading value from database. Code is using three more pojo classes. Mt700, Header and Swift details. MT700 is main class for Header and swift details. Problem is I am able to store everything one time. Doesn't matter how many rows of data I have when the file get generated with one record it has only one header and one swift details. How can I make this work in loop? I think I have to use list but I am not sure how to use it to make it work.
public void generateEliteExtracts(int trdCustomerKy, Date lastRunDate, Date currentDate) throws TradeException {
FileOutputStream fout = null;
try {
MT700 mt700 = getMT700(trdCustomerKy,lastRunDate,currentDate);
if (null != mt700){
StringBuffer fileName = new StringBuffer(1024);
smLog.debug("Generated Extract for BankRef" + fileName.toString());
mTracer.log("Generated Extract for BankRef" + fileName.toString());
File xmlFile = new File(fileName.toString());
fout = new FileOutputStream(xmlFile);
JAXBContext jaxbContext = JAXBContext.newInstance(MT700.class);
Marshaller marshaller = jaxbContext.createMarshaller();
marshaller.setProperty(Marshaller.JAXB_ENCODING, ENCODING_ASCII);
marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
marshaller.setProperty(Marshaller.JAXB_FRAGMENT, Boolean.FALSE);
marshaller.setProperty("com.sun.xml.internal.bind.xmlDeclaration", Boolean.FALSE);
marshaller.marshal(mt700, fout);
Exception ex)
smLog.error("Caught unexpected error while creating extracts. ", ex);
throw new TradeException("Caught unexpected error while creating extracts.", ex);
} finally
private MT700 getMT700(int trdCustomerKy, Date lastRunDate, Date currentDate) throws TradeException {
MT700 mt700 = new MT700();
AbInBevEliteExtractDAO dao = new AbInBevEliteExtractDAO(mConnection);
CompanyCodesHelper ccHelper = new CompanyCodesHelper(mConnection);
String cifCodes = ccHelper.getDescription(trdCustomerKy, "CIF Codes", "CIF Codes");
if (false == TradeUtil.isStringNull(cifCodes)) {
mTracer.log("Fetching records for CIFs: " + StringUtils.replace(cifCodes, PIPE, COMMA));
String[] codes = StringUtils.split(cifCodes, PIPE);
List<ExportAdvicesData> exportList = dao.getExportAdvices(trdCustomerKy, lastRunDate, currentDate, codes);
for (int i = 0; i < exportList.size(); i++) {
ExportAdvicesData exportData = exportList.get(i);
if ("XXLC".equalsIgnoreCase(exportData.getDocAcronym())) {
Header header = new Header();
header.setDocumentDate(DateUtil.formatDate(new Date(), DATE_FORMAT_YYYY_MM_DD_HHMMSS));
header.setBankId("BOA" + StringUtils.substring(exportData.getCustRef(), 0, 4));
SwiftDetails swiftTest = new SwiftDetails();
SwiftParserBankDocs parser = new SwiftParserBankDocs(exportData.getDocumentContent());
String bankRef = parser.getTagValue("21");
String custRef = parser.getTagValue("20");
if (TradeUtil.isStringNull(bankRef)) {
} else {
String issueDate = parser.getTagValue("31C");
swiftTest.setTAG_40E("UCP LATEST VERSION");
String datePlaceOfExpiry = parser.getTagValue("31D");
if (false == TradeUtil.isStringNull(exportData.getPositiveTolerance())) {
exportData.getPositiveTolerance() + "/" + exportData.getPositiveTolerance());
} else {
String tag42A = parser.getTagValue("42A");
if (TradeUtil.isStringNull(tag42A)) {
if (!(TradeUtil.isStringNull(parser.getTagValue("44A")))) {
if (!(TradeUtil.isStringNull(parser.getTagValue("44B")))) {
if (!(TradeUtil.isStringNull(parser.getTagValue("44E")))) {
if (!(TradeUtil.isStringNull(parser.getTagValue("44F")))) {
Date latestShipDate = exportData.getLatestShipDate();
if (null != latestShipDate) {
swiftTest.setTAG_44C(DateUtil.formatDate(latestShipDate, DATE_FORMAT_YYMMDD));
} else {
swiftTest.setTAG_45A(parser.getTagValue("45") + BLANK_STRING + parser.getTagValue("45A")
+ BLANK_STRING + parser.getTagValue("45B"));
swiftTest.setTAG_46A(parser.getTagValue("46") + BLANK_STRING + parser.getTagValue("46A")
+ BLANK_STRING + parser.getTagValue("46B"));
swiftTest.setTAG_47A(parser.getTagValue("47") + BLANK_STRING + parser.getTagValue("47A")
+ BLANK_STRING + parser.getTagValue("47B"));
String issuingBank = parser.getAddress(SwiftParserBankDocs.ISSUING_BANK);
if (TradeUtil.isStringNull(issuingBank)) {
String errorMsg = "Issuing Bank address not found in bankdoc text, SWIFT content is possibly invalid, skipped processed record: "
+ exportData.getCustRef();
mTracer.log("ERROR: " + errorMsg);
issuingBank = StringUtils.replace(issuingBank, CRLF, BLANK_STRING + CRLF);
if (parser.is710Advice()) {
} else if ("XAMD".equalsIgnoreCase(exportData.getDocAcronym())) {
Header header = new Header();
header.setDocumentDate(DateUtil.formatDate(new Date(), DATE_FORMAT_YYYY_MM_DD_HHMMSS));
header.setBankId("BOA" + StringUtils.substring(exportData.getCustRef(), 0, 4));
SwiftDetails swift = new SwiftDetails();
SwiftParserBankDocs parser = new SwiftParserBankDocs(exportData.getDocumentContent());
String custRef = parser.getTagValue("20");
String bankRef = parser.getTagValue("23");
if (TradeUtil.isStringNull(bankRef)) {
} else {
String issuingBank = parser.getAddress(SwiftParserBankDocs.ISSUING_BANK);
if (TradeUtil.isStringNull(issuingBank)) {
String errorMsg = "Issuing Bank address not found in bankdoc text, SWIFT content is possibly invalid, skipped processed record: "
+ exportData.getCustRef();
mTracer.log("ERROR: " + errorMsg);
} else {
issuingBank = StringUtils.replace(issuingBank, CRLF, BLANK_STRING + CRLF);
return mt700;
This is MT700 POJO class. In this class I am calling header and swift details pojo classes.
#XmlRootElement(name = "MT700")
public class MT700 implements Serializable
* serialVersionUID
private static final long serialVersionUID = 1L;
private Header header;
private SwiftDetails swift700;
private String version = "1.0";
public Header getHeader()
return header;
#XmlElement(name = "Header")
public void setHeader(Header header)
this.header = header;
* #return the swift700
public SwiftDetails getSwift700()
return swift700;
#XmlElement(name = "Swift_Details_700")
public void setSwift700(SwiftDetails swift700)
this.swift700 = swift700;
public String getVersion()
return version;
#XmlAttribute(name = "Version")
public void setVersion(String version)
this.version = version;
This is Header class. I class similar to like this which has tags and that is swift details
#XmlRootElement(name = "Header")
#XmlType(propOrder = { "documentType", "messageType", "versionNo",
"revisionNo", "documentDate", "bankId", "custId", "custRefNo",
"bankRefNo" })
public class Header implements Serializable
private static final long serialVersionUID = 1L;
private String documentType;
private String messageType;
private String versionNo;
private String revisionNo;
private String documentDate;
private String bankId;
private String custId;
private String custRefNo;
private String bankRefNo;
I am not adding getter and setter for this class to make the post look simple
You are creating one MT700 instance and then in this loop, you are reassigning the header and swift fields each time through the loop:
MT700 mt700 = new MT700();
for (int i = 0; i < exportList.size(); i++) {
This means that the document you are outputting contains just the last header/swift returned from the database query.
You need to make one or more of these three into a list of some sort. Either your MT700 contains a list of headers and swifts, or more likely you want to have a list of MT700s each with one header and one swift.
In other words, you want to have a fourth type that will be the actual root of your XML document. That element will contain one MT700 element for each row found by the query. Each MT700 element will have a header element and a swift element.
So, more specifically, here is what you want to do:
class MT700s {
#XmlElement(name = "MT700")
private List<MT700> mt700s = new ArrayList<>();
public List<MT700> getMT700s() { return mt700s; }
// Etc.
MT700s mt700s = new MT700s();
for (int i = 0; i < exportList.size(); i++) {
MT700 mt700 = new MT700();
In my play-framework-based web application users can download all the rows of different database tables in csv or json format. Tables are relatively large (100k+ rows) and I am trying to stream back the result using chunking in Play 2.2.
However the problem is although println statements shows that the rows get written to the Chunks.Out object, they do not show up in the client side! If I limit the rows getting sent back it will work, but it also has a big delay in the beginning which gets bigger if I try to send back all the rows and causes a time-out or the server runs out of memory.
I use Ebean ORM and the tables are indexed and querying from psql doesn't take much time. Does anyone have any idea what might be the problem?
I appreciate your help a lot!
Here is the code for one of the controllers:
public static Result showEpex() {
User user = getUser();
if(user == null || user.getRole() == null)
return ok(views.html.profile.render(user, Application.NOT_CONFIRMED_MSG));
DynamicForm form = DynamicForm.form().bindFromRequest();
final UserRequest req = UserRequest.getRequest(form);
if(req.getFormat().equalsIgnoreCase("html")) {
Page<EpexEntry> page = EpexEntry.page(req.getStart(), req.getFinish(), req.getPage());
return ok(views.html.epex.render(page, req));
// otherwise chunk result and send back
final ResultStreamer<EpexEntry> streamer = new ResultStreamer<EpexEntry>();
Chunks<String> chunks = new StringChunks() {
public void onReady(play.mvc.Results.Chunks.Out<String> out) {
Page<EpexEntry> page = EpexEntry.page(req.getStart(), req.getFinish(), 0);
ResultStreamer<EpexEntry> streamer = new ResultStreamer<EpexEntry>();
streamer.stream(out, page, req);
return ok(chunks).as("text/plain");
And the streamer:
public class ResultStreamer<T extends Entry> {
private static ALogger logger = Logger.of(ResultStreamer.class);
public void stream(Out<String> out, Page<T> page, UserRequest req) {
if(req.getFormat().equalsIgnoreCase("json")) {
JsonContext context = Ebean.createJsonContext();
for(T e: page.getList())
out.write(context.toJsonString(e) + ", ");
while(page.hasNext()) {
page = page.next();
for(T e: page.getList())
out.write(context.toJsonString(e) + ", ");
} else if(req.getFormat().equalsIgnoreCase("csv")) {
for(T e: page.getList())
out.write(e.toCsv(CSV_SEPARATOR) + "\n");
while(page.hasNext()) {
page = page.next();
for(T e: page.getList())
out.write(e.toCsv(CSV_SEPARATOR) + "\n");
}else {
out.write("Invalid format! Only CSV, JSON and HTML can be generated!");
public static final String CSV_SEPARATOR = ";";
And the model:
public class EpexEntry extends Model implements Entry {
#Column(columnDefinition = "pg-uuid")
private UUID id;
private DateTime start;
private DateTime finish;
private String contract;
private String market;
private Double low;
private Double high;
private Double last;
private Double weightAverage;
private Double index;
private Double buyVol;
private Double sellVol;
private static final String START_COL = "start";
private static final String FINISH_COL = "finish";
private static final String CONTRACT_COL = "contract";
private static final String MARKET_COL = "market";
private static final String ORDER_BY = MARKET_COL + "," + CONTRACT_COL + "," + START_COL;
public static final int PAGE_SIZE = 100;
public static final String HOURLY_CONTRACT = "hourly";
public static final String MIN15_CONTRACT = "15min";
public static final String FRANCE_MARKET = "france";
public static final String GER_AUS_MARKET = "germany/austria";
public static final String SWISS_MARKET = "switzerland";
public static Finder<UUID, EpexEntry> find =
new Finder(UUID.class, EpexEntry.class);
public EpexEntry() {
public EpexEntry(UUID id, DateTime start, DateTime finish, String contract,
String market, Double low, Double high, Double last,
Double weightAverage, Double index, Double buyVol, Double sellVol) {
this.id = id;
this.start = start;
this.finish = finish;
this.contract = contract;
this.market = market;
this.low = low;
this.high = high;
this.last = last;
this.weightAverage = weightAverage;
this.index = index;
this.buyVol = buyVol;
this.sellVol = sellVol;
public static Page<EpexEntry> page(DateTime from, DateTime to, int page) {
if(from == null && to == null)
return find.order(ORDER_BY).findPagingList(PAGE_SIZE).getPage(page);
ExpressionList<EpexEntry> exp = find.where();
if(from != null)
exp = exp.ge(START_COL, from);
if(to != null)
exp = exp.le(FINISH_COL, to.plusHours(24));
return exp.order(ORDER_BY).findPagingList(PAGE_SIZE).getPage(page);
public String toCsv(String s) {
return id + s + start + s + finish + s + contract +
s + market + s + low + s + high + s +
last + s + weightAverage + s +
index + s + buyVol + s + sellVol;
1. Most of browsers wait for 1-5 kb of data before showing any results. You can check if Play Framework actually sends data with command curl http://localhost:9000.
2. You create streamer twice, remove first final ResultStreamer<EpexEntry> streamer = new ResultStreamer<EpexEntry>();
3. - You use Page class for retrieving large data set - this is incorrect. Actually you do one big initial request and then one request per iteration. This is SLOW. Use simple findIterate().
add this to EpexEntry (feel free to change it as you need)
public static QueryIterator<EpexEntry> all() {
return find.order(ORDER_BY).findIterate();
your new stream method implementation:
public void stream(Out<String> out, QueryIterator<T> iterator, UserRequest req) {
if(req.getFormat().equalsIgnoreCase("json")) {
JsonContext context = Ebean.createJsonContext();
while (iterator.hasNext()) {
out.write(context.toJsonString(iterator.next()) + ", ");
iterator.close(); // its important to close iterator
} else // csv implementation here
And your onReady method:
QueryIterator<EpexEntry> iterator = EpexEntry.all();
ResultStreamer<EpexEntry> streamer = new ResultStreamer<EpexEntry>();
streamer.stream(new BuffOut(out, 10000), iterator, req); // notice buffering here
4. Another problem is - you call Out<String>.write() too often. Call of write() means that server needs to send new chunk of data to client immediately. Every call of Out<String>.write() have significant overhead.
Overhead appears because server needs to wrap response into chunked result - 6-7 bytes for each message Chunked response Format. Since you send small messages, overhead is significant.
Also, server needs to wrap your reply in TCP packet which size will be far less from optimal.
And, server needs to perform some internal action to send an chunk, this is also require some resources. As result, download bandwidth will be far from optimal.
Here is simple test: send 10000 lines of text TEST0 to TEST9999 in chunks. This takes 3 sec on my computer in average. But with buffering this takes 65 ms. Also, download sizes are 136 kb and 87.5 kb.
Example with buffering:
public class Application extends Controller {
public static Result showEpex() {
Chunks<String> chunks = new StringChunks() {
public void onReady(play.mvc.Results.Chunks.Out<String> out) {
new ResultStreamer().stream(out);
return ok(chunks).as("text/plain");
new BuffOut class. It's dumb, I know
public class BuffOut {
private StringBuilder sb;
private Out<String> dst;
public BuffOut(Out<String> dst, int bufSize) {
this.dst = dst;
this.sb = new StringBuilder(bufSize);
public void write(String data) {
if ((sb.length() + data.length()) > sb.capacity()) {
public void close() {
if (sb.length() > 0)
This implementation have 3 second download time and 136 kb size
public class ResultStreamer {
public void stream(Out<String> out) {
for (int i = 0; i < 10000; i++) {
out.write("TEST" + i + "\n");
This implementation have 65 ms download time and 87.5 kb size
public class ResultStreamer {
public void stream(Out<String> out) {
BuffOut out2 = new BuffOut(out, 1000);
for (int i = 0; i < 10000; i++) {
out2.write("TEST" + i + "\n");
This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I'm working since short time with Hadoop and trying to implement a join in Java. It doesn't matter if Map-Side or Reduce-Side. I took Reduce-Side join since it was supposed to be easier to implement. I know that Java is not the best choice for joins, aggregations etc. and should better pick Hive or Pig which I have done already. However I'm working on a research project and I have to use all of those 3 languages in order to deliver a comparison.
Anyway, I have two input files with different structure. One is key|value and the other one is key|value1;value2;value3;value4. One record from each input file looks like following:
Input1: 1;2010-01-10T00:00:01
Input2: 1;23;Blue;2010-01-11T00:00:01;9999-12-31T23:59:59
I followed the example in the Hadoop Definitve Guide book, but it didn't work for me. I'm posting my java files here, so you can see what's wrong.
public class LookupReducer extends Reducer<TextPair,Text,Text,Text> {
private String result = "";
private String msisdn;
private String attribute, product;
private long trans_dt_long, start_dt_long, end_dt_long;
private String trans_dt, start_dt, end_dt;
public void reduce(TextPair key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
//value without key to remember
Iterator<Text> iter = values.iterator();
for (Text val : values) {
Text recordNoKey = val; //new Text(iter.next());
String valSplitted[] = recordNoKey.toString().split(";");
//if the input is coming from CDR set corresponding values
trans_dt = recordNoKey.toString();
trans_dt_long = dateToLong(recordNoKey.toString());
//if the input is coming from Attributes set corresponding values
else if(key.getSecond().toString().equals(Attribute.ATT_TAG))
attribute = valSplitted[0];
product = valSplitted[1];
start_dt = valSplitted[2];
start_dt_long = dateToLong(valSplitted[2]);
end_dt = valSplitted[3];
end_dt_long = dateToLong(valSplitted[3]);;
Text record = val; //iter.next();
//System.out.println("RECORD: " + record);
Text outValue = new Text(recordNoKey.toString() + ";" + record.toString());
if(start_dt_long < trans_dt_long && trans_dt_long < end_dt_long)
//concat output columns
result = attribute + ";" + product + ";" + trans_dt;
context.write(key.getFirst(), new Text(result));
System.out.println("KEY: " + key);
private static long dateToLong(String date){
DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date parsedDate = null;
try {
parsedDate = formatter.parse(date);
} catch (ParseException e) {
// TODO Auto-generated catch block
long dateInLong = parsedDate.getTime();
return dateInLong;
public static class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair(){
set(new Text(), new Text());
public TextPair(String first, String second){
set(new Text(first), new Text(second));
public TextPair(Text first, Text second){
set(first, second);
public void set(Text first, Text second){
this.first = first;
this.second = second;
public Text getFirst() {
return first;
public void setFirst(Text first) {
this.first = first;
public Text getSecond() {
return second;
public void setSecond(Text second) {
this.second = second;
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
public int hashCode(){
return first.hashCode() * 163 + second.hashCode();
public boolean equals(Object o){
if(o instanceof TextPair)
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
return false;
public String toString(){
return first + ";" + second;
public int compareTo(TextPair tp) {
// TODO Auto-generated method stub
int cmp = first.compareTo(tp.first);
if(cmp != 0)
return cmp;
return second.compareTo(tp.second);
public static class FirstComparator extends WritableComparator {
protected FirstComparator(){
super(TextPair.class, true);
public int compare(WritableComparable comp1, WritableComparable comp2){
TextPair pair1 = (TextPair) comp1;
TextPair pair2 = (TextPair) comp2;
int cmp = pair1.getFirst().compareTo(pair2.getFirst());
if(cmp != 0)
return cmp;
return -pair1.getSecond().compareTo(pair2.getSecond());
public static class GroupComparator extends WritableComparator {
protected GroupComparator()
super(TextPair.class, true);
public int compare(WritableComparable comp1, WritableComparable comp2)
TextPair pair1 = (TextPair) comp1;
TextPair pair2 = (TextPair) comp2;
return pair1.compareTo(pair2);
public class Joiner extends Configured implements Tool {
public static final String DATA_SEPERATOR =";"; //Define the symbol for seperating the output data
public static final String NUMBER_OF_REDUCER = "1"; //Define the number of the used reducer jobs
public static final String COMPRESS_MAP_OUTPUT = "true"; //if the output from the mapping process should be compressed, set COMPRESS_MAP_OUTPUT = "true" (if not set it to "false")
public static final String
USED_COMPRESSION_CODEC = "org.apache.hadoop.io.compress.SnappyCodec"; //set the used codec for the data compression
public static final boolean JOB_RUNNING_LOCAL = true; //if you run the Hadoop job on your local machine, you have to set JOB_RUNNING_LOCAL = true
//if you run the Hadoop job on the Telefonica Cloud, you have to set JOB_RUNNING_LOCAL = false
public static final String OUTPUT_PATH = "/home/hduser"; //set the folder, where the output is saved. Only needed, if JOB_RUNNING_LOCAL = false
public static class KeyPartitioner extends Partitioner<TextPair, Text> {
public int getPartition(/*[*/TextPair key/*]*/, Text value, int numPartitions) {
System.out.println("numPartitions: " + numPartitions);
return (/*[*/key.getFirst().hashCode()/*]*/ & Integer.MAX_VALUE) % numPartitions;
private static Configuration hadoopconfig() {
Configuration conf = new Configuration();
conf.set("mapred.textoutputformat.separator", DATA_SEPERATOR);
conf.set("mapred.compress.map.output", COMPRESS_MAP_OUTPUT);
//conf.set("mapred.map.output.compression.codec", USED_COMPRESSION_CODEC);
conf.set("mapred.reduce.tasks", NUMBER_OF_REDUCER);
return conf;
public int run(String[] args) throws Exception {
// TODO Auto-generated method stub
if ((args.length != 3) && (JOB_RUNNING_LOCAL)) {
System.err.println("Usage: Lookup <CDR-inputPath> <Attribute-inputPath> <outputPath>");
//starting the Hadoop job
Configuration conf = hadoopconfig();
Job job = new Job(conf, "Join cdrs and attributes");
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, CDRMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, AttributeMapper.class);
//FileInputFormat.addInputPath(job, new Path(otherArgs[0])); //expecting a folder instead of a file
FileOutputFormat.setOutputPath(job, new Path(args[2]));
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Joiner(), args);
public class Attribute {
public static final String ATT_TAG = "1";
public static class AttributeMapper
extends Mapper<LongWritable, Text, TextPair, Text>{
private static Text values = new Text();
//private Object output = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//partition the input line by the separator semicolon
String[] attributes = value.toString().split(";");
String valuesInString = "";
if(attributes.length != 5)
System.err.println("Input column number not correct. Expected 5, provided " + attributes.length
+ "\n" + "Check the input file");
if(attributes.length == 5)
//setting the values with the input values read above
valuesInString = attributes[1] + ";" + attributes[2] + ";" + attributes[3] + ";" + attributes[4];
//writing out the key and value pair
context.write( new TextPair(new Text(String.valueOf(attributes[0])), new Text(ATT_TAG)), values);
public class CDR {
public static final String CDR_TAG = "0";
public static class CDRMapper
extends Mapper<LongWritable, Text, TextPair, Text>{
private static Text values = new Text();
private Object output = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//partition the input line by the separator semicolon
String[] cdr = value.toString().split(";");
//setting the values with the input values read above
//output = CDR_TAG + cdr[1];
//writing out the key and value pair
context.write( new TextPair(new Text(String.valueOf(cdr[0])), new Text(CDR_TAG)), values);
I would be glad if anyone could at least post a link for a tutorial or a simple example where such a join functionality is implemented. I searched a lot, but either the code was not complete or there was not enough explanation.
To be honest, I have no idea what your code is trying to do, but that's probably because I'd do it in a different way and not familiar with the API's you're using.
I would start from scratch as follows:
Create a mapper to read one of the files. For each line read, write a key value pair to the context. The key is a Text created from the key and the value is another Text created by concatenating a "1" with the entire input record.
Create another mapper for the other file. This mapper acts just like the first mapper, but the value is a Text created by concatenating a "2" with the entire input record.
Write a reducer to do the join. The reduce() method will get all records written for a specific key. You can tell which input file (and therefore the data format for the record) by looking to see whether the value starts with a "1" or a "2". Once you know whether or not you have one, the other or both record types, you can write whatever logic you need to merge the data from the one or two records.
By the way, you use the MultipleInputs class to configure more than one mapper in your job/driver class.
I'm using java.util.resourcebundle to format my JSTL messages and this works fine:
I use the class MessageFormat you can see here. Now I want to encapsulate this to a method that is just getParametrizedMessage(String key, String[]parameters) but I'm not sure how to do it. Now there is quite a lot of work to display just one or two messages with parameters:
UserMessage um = null;
ResourceBundle messages = ResourceBundle.getBundle("messages");
String str = messages.getString("PF1");
Object[] messageArguments = new String[]{nyreg.getNummer()};
MessageFormat formatter = new MessageFormat("");
String outputPI14 = formatter.format(messageArguments);
String outputPI15 = formatter.format(messageArguments)
if(checkIfPCTExistInDB && nyreg.isExistInDB()) {
//um = new ExtendedUserMessage(MessageHandler.getParameterizedMessage("PI15", new String[]{nyreg.getNummer()}) , UserMessage.TYPE_INFORMATION, "Info");
um = new ExtendedUserMessage(outputPI15 , UserMessage.TYPE_INFORMATION, "Info");
…and so on. Now can I move this logic to a static class MessageHandler.getParameterizedMessage that now is not working and looking like this:
private final static String dictionaryFileName="messages.properties";
public static String getParameterizedMessage(String key, String [] params){
if (dictionary==null){
return getParameterizedMessage(dictionary,key,params);
private static void loadDictionary(){
String fileName = dictionaryFileName;
try {
dictionary=new Properties();
InputStream fileInput = MessageHandler.class.getClassLoader().getResourceAsStream(fileName);
catch(Exception e) {
System.err.println("Exception reading propertiesfile in init "+e);
How can I make using my parametrized messages as easy as calling a method with key and parameter?
Thanks for any help
The logic comes from an inherited method that in in the abstract class that this extends. The method looks like:
protected static String getParameterizedMessage(Properties dictionary,String key,String []params){
if (dictionary==null){
return "ERROR";
String msg = dictionary.getProperty(key);
if (msg==null){
return "?!Meddelande " +key + " saknas!?";
if (params==null){
return msg;
StringBuffer buff = new StringBuffer(msg);
for (int i=0;i<params.length;i++){
String placeHolder = "<<"+(i+1)+">>";
if (buff.indexOf(placeHolder)!=-1){
else {
return buff.toString();
I think I must rewrite the above method in order to make it work like a resourcebundle rather than just a dictionary.
Update 2
The code that seems to work is here
public static String getParameterizedMessage(String key, Object [] params){
ResourceBundle messages = ResourceBundle.getBundle("messages");
MessageFormat formatter = new MessageFormat("");
return formatter.format(params);
I'm not really sure what you're trying to achive, here's what I did in the past:
public static final String localize(final Locale locale, final String key, final Object... param) {
final String name = "message";
final ResourceBundle rb;
/* Resource bundles are cached internally,
never saw a need to implement another caching level
try {
rb = ResourceBundle.getBundle(name, locale, Thread.currentThread()
} catch (MissingResourceException e) {
throw new RuntimeException("Bundle not found:" + name);
String keyValue = null;
try {
keyValue = rb.getString(key);
} catch (MissingResourceException e) {
// LOG.severe("Key not found: " + key);
keyValue = "???" + key + "???";
/* Message formating is expensive, try to avoid it */
if (param != null && param.length > 0) {
return MessageFormat.format(keyValue, param);
} else {
return keyValue;