Mapping username to tweets with storm

Mapping username to tweets with storm - java

I'm trying to create a topology that has:1 spout that emits tweets and two bolts:
a TweetParserBolt that collects tweets
and UserParserBolt that collects the tweeters' username.
Suppose I've created a third bolt that anchors the TweetParserBolt and the UserParserBolt so that it can map the tweeter's username to a list of tweets that she/he have already posted.The problem I've encountered is that the bolt returns a null list of tweets.
Can anyone please help me understand what's wrong with the code
Below is my code for the topology and the three bolts:
public class TwitterTopology {
private static String consumerKey = "*********************";
private static String consumerSecret = "*****************";
private static String accessToken = "********************";
private static String accessTokenSecret = "****************";
public static void main(String [] args) throws Exception{
/*** SETUP ***/
String remoteClusterTopologyName = null;
if (args!=null) {
if (args.length==1) {
remoteClusterTopologyName = args[0];
}
// If credentials are provided as commandline arguments
else if (args.length==4) {
accessToken =args[0];
accessTokenSecret =args[1];
consumerKey =args[2];
consumerSecret =args[3];
}
}
/**************** ****************/
TopologyBuilder builder = new TopologyBuilder();
FilterQuery filterQuery = new FilterQuery();
filterQuery.track(new String[]{"#cloudcomputing"});
filterQuery.language(new String[]{"en"});
TwitterSpout spout = new TwitterSpout( accessToken, accessTokenSecret,consumerKey, consumerSecret, filterQuery);
builder.setSpout("TwitterSpout",spout,1);
builder.setBolt("TweetParserBolt",new TweetParserBolt(),4).shuffleGrouping("TwitterSpout");
builder.setBolt("UserMapperBolt",new UserParserBolt()).shuffleGrouping("TwitterSpout");
UserAndTweetsMapperBolt()).fieldsGrouping(("TweetParserBolt"), new Fields("username","tweet","bolt"))
.fieldsGrouping(("UserMapperBolt"),new Fields("username","tweet","bolt"));
Config conf = new Config();
conf.setDebug(true);
if (remoteClusterTopologyName!=null) {
conf.setNumWorkers(4);
StormSubmitter.submitTopology(remoteClusterTopologyName, conf, builder.createTopology());
}
else {
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", conf, builder.createTopology());
Thread.sleep(460000);
cluster.shutdown();
}
}
public class TweetParserBolt extends BaseRichBolt {
private OutputCollector collector;
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer){
declarer.declare(new Fields("username","tweet","bolt"));
}
#Override
public void prepare(Map map,TopologyContext context,OutputCollector collector){
this.collector=collector;
}
#Override
public void execute(Tuple tuple){
Status tweet=(Status)tuple.getValue(0);
String username=tweet.getUser().getScreenName();
collector.emit(tuple,new Values(username,tweet,"tweet_parser_bolt"));
}
}
public class UserParserBolt extends BaseRichBolt{
private OutputCollector collector;
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer){
declarer.declare(new Fields("username","tweet"));
}
#Override
public void prepare(Map map,TopologyContext context,OutputCollector collector){
this.collector=collector;
}
#Override
public void execute(Tuple tuple){
Status tweet=(Status)tuple.getValue(0);
String username=tweet.getUser().getScreenName();
collector.emit(tuple,new Values(username,tweet,"user_parser_bolt"));
}
}
public class UserAndTweetsMapperBolt extends BaseRichBolt {
private OutputCollector collector;
List<Tuple>listOfTuples;
Map<String,Status>tempTweetsMap;
Map<String,List<Status>>UserAndTweetsMap;
List<Status>tweets;
List<String>tempUsers;
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer){
declarer.declare(new Fields("username","tweets"));
}
#Override
public void prepare(Map map,TopologyContext context,OutputCollector collector){
this.collector=collector;
this.listOfTuples=new ArrayList<Tuple>();
this.tempTweetsMap=new HashMap<String, Status>();
this.UserAndTweetsMap=new HashMap<String, List<Status>>();
this.tempUsers=new ArrayList<String>();
this.tweets=new ArrayList<Status>();
}
#Override
public void execute(Tuple tuple){
//String username=tuple.getStringByField("username");
//Status status=(Status)tuple.getValueByField("tweet");
String username=tuple.getValue(0).toString();
String sourceComponent=tuple.getSourceComponent();
if(sourceComponent.equals("TwitterParserBolt")){
String tempUser1=tuple.getValue(0).toString();
Status tempStatus1=(Status)tuple.getValue(1);
tempTweetsMap.put(tempUser1,tempStatus1);
}else if(sourceComponent.equals("UserParserBolt")){
String tempUser2=tuple.getValue(0).toString();
Status tempStatus2=(Status)tuple.getValue(1);
tempUsers.add(tempUser2);
}
for(int i=0;i<tempUsers.size();i++){
for(int j=0;j<tempTweetsMap.size();j++){
if(tempUsers.get(i).equals(tempTweetsMap.get(j).getUser().getScreenName())){
tweets.add(tempTweetsMap.get(j));
}
}
}
collector.emit(new Values(username,tweets));
}
}

You need to do a fields grouping on just the username in the bolt that combines them. If you group by all the fields as you're doing now, you may or may not get all the tweets for the same user in the same task. Also, your map will only capture the last status for any given user. If you want them all you need to make the value an array of statuses.
UserAndTweetsMapperBolt().
fieldsGrouping(("TweetParserBolt"), new Fields("username")).
fieldsGrouping(("UserMapperBolt"),new Fields("username"));

Related

How to configure correct parallelism in persistor bolt?

I'm using apache storm to create a topology that initially read a "stream" of tuple in a file, and next it split and store the tuples in mongodb.
I've a cluster on Atlas with a shared replica set. I've already developed the topology, and the solution works properly if I use a single thread.
public static StormTopology build() {
return buildWithSpout();
}
public static StormTopology buildWithSpout() {
Config config = new Config();
TopologyBuilder builder = new TopologyBuilder();
CsvSpout datasetSpout = new CsvSpout("file.txt");
SplitterBolt splitterBolt = new SplitterBolt(",");
PartitionMongoInsertBolt insertPartitionBolt = new PartitionMongoInsertBolt();
builder.setSpout(DATA_SPOUT_ID, datasetSpout, 1);
builder.setBolt(DEPENDENCY_SPLITTER_ID, splitterBolt, 1).shuffleGrouping(DATA_SPOUT_ID);
builder.setBolt(UPDATER_COUNTER_ID, insertPartitionBolt, 1).shuffleGrouping(DEPENDENCY_SPLITTER_ID);
}
However, when I use parallel processes, my persistor bolt don't save all tuples in mongodb, despite the tuples are correctly emitted by the previous bolt.
builder.setSpout(DATA_SPOUT_ID, datasetSpout, 1);
builder.setBolt(DEPENDENCY_SPLITTER_ID, splitterBolt, 3).shuffleGrouping(DATA_SPOUT_ID);
builder.setBolt(UPDATER_COUNTER_ID, insertPartitionBolt, 3).shuffleGrouping(DEPENDENCY_SPLITTER_ID);
This is my first bolt:
public class SplitterBolt extends BaseBasicBolt {
private String del;
private MongoConnector db = null;
public SplitterBolt(String del) {
this.del = del;
}
public void prepare(Map stormConf, TopologyContext context) {
db = MongoConnector.getInstance();
}
public void execute(Tuple input, BasicOutputCollector collector) {
String tuple = input.getStringByField("tuple");
int idTuple = Integer.parseInt(input.getStringByField("id"));
String opString = "";
String[] data = tuple.split(this.del);
for(int i=0; i < data.length; i++) {
OpenBitSet attrs = new OpenBitSet();
attrs.fastSet(i);
opString = Utility.toStringOpenBitSet(attrs, 5);
collector.emit(new Values(idTuple, opString, data[i]));
}
db.incrementCount();
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("idtuple","binaryattr","value"));
}
}
And this is my persistor bolt that store in mongo all tuples:
public class PartitionMongoInsertBolt extends BaseBasicBolt {
private MongoConnector mongodb = null;
public void prepare(Map stormConf, TopologyContext context) {
//Singleton Instance
mongodb = MongoConnector.getInstance();
}
public void execute(Tuple input, BasicOutputCollector collector) {
mongodb.insertUpdateTuple(input);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {}
}
My only doubt is that I used a singleton pattern for the connection class to mongo. Can this be a problem?
UPDATE
This is my MongoConnector class:
public class MongoConnector {
private MongoClient mongoClient = null;
private MongoDatabase database = null;
private MongoCollection<Document> partitionCollection = null;
private static MongoConnector mongoInstance = null;
public MongoConnector() {
MongoClientURI uri = new MongoClientURI("connection string");
this.mongoClient = new MongoClient(uri);
this.database = mongoClient.getDatabase("db.database");
this.partitionCollection = database.getCollection("db.collection");
}
public static MongoConnector getInstance() {
if (mongoInstance == null)
mongoInstance = new MongoConnector();
return mongoInstance;
}
public void insertUpdateTuple2(Tuple tuple) {
int idTuple = (Integer) tuple.getValue(0);
String attrs = (String) tuple.getValue(1);
String value = (String) tuple.getValue(2);
value = value.replace('.', ',');
Bson query = Filters.eq("_id", attrs);
Document docIterator = this.partitionCollection.find(query).first();
if (docIterator != null) {
Bson newValue = new Document(value, idTuple);
Bson updateDocument = new Document("$push", newValue);
this.partitionCollection.updateOne(docIterator, updateDocument);
} else {
Document document = new Document();
document.put("_id", attrs);
ArrayList<Integer> partition = new ArrayList<Integer>();
partition.add(idTuple);
document.put(value, partition);
this.partitionCollection.insertOne(document);
}
}
}
SOLUTION UPDATE
I've solved the problem chainging this line:
this.partitionCollection.updateOne(docIterator, updateDocument);
in
this.partitionCollection.findOneAndUpdate(query, updateDocument);

Why is this cache not getting evicted?

AdminSOAPRunner:
#Component
public class AdminSOAPRunner {
private static final Logger LOGGER = LoggerFactory.getLogger(AdminSOAPRunner.class);
private String userId;
public String getUserId() {
return userId;
}
public void setUserId(String userId) {
this.userId = userId;
}
#Autowired
private AdminAuth adminAuthenticator;
#Autowired
private AdminBean adminBean;
private AccountService accountService;
private void setBindingProviderByAccountService() {
WSBindingProvider bindingProvider = (WSBindingProvider) this.accountService;
bindingProvider.getRequestContext().put(BindingProvider.ENDPOINT_ADDRESS_PROPERTY, adminBean.getAccountUrl());
LOGGER.info("Endpoint {}", adminBean.getAccountUrl());
}
private RequestInfo getRequestInfo() {
RequestInfo requestInfo = new RequestInfo();
requestInfo.setAppName(adminBean.getAppName());
requestInfo.setUserId(this.getUserId());
requestInfo.setTrace(UUID.randomUUID().toString());
return requestInfo;
}
public List<ApplyAccountResult> getAccounts(ApplyAccountRequest request) {
AccountService_Service service = null;
URL serviceWSDL = AccountService_Service.class.getResource("/Account-service/Account-service.wsdl");
service = new AccountService_Service(serviceWSDL);
SOAPHandlerResolver SOAPHandlerResolver = new SOAPHandlerResolver();
SOAPHandlerResolver.getHandlerList().add(new SOAPHandler(this.adminAuthenticator));
service.setHandlerResolver(SOAPHandlerResolver);
if (accountService == null) {
accountService = service.getAccountService();
}
setBindingProviderByAccountService();
ApplyAccountAccountResponse response = null;
LOGGER.info("Making a SOAP request.");
response = AccountService.applyAccount(request, getRequestInfo(), new Holder<ResponseInfo>());
LOGGER.info("SOAP request completed.");
return response.getApplyAccountResults();
}
SOAPHandlerResolver:
public class SOAPHandlerResolver implements HandlerResolver {
#SuppressWarnings("rawtypes")
private List<Handler> handlerList;
public SOAPHandlerResolver() {
this.handlerList = null;
}
#SuppressWarnings("rawtypes")
public List<Handler> getHandlerList() {
if (this.handlerList == null) {
this.handlerList = new ArrayList<>();
}
return this.handlerList;
}
#SuppressWarnings("rawtypes")
#Override
public List<Handler> getHandlerChain(PortInfo portInfo) {
List<Handler> handlerChain = new ArrayList<>();
if (this.handlerList == null || this.handlerList.isEmpty()) {
this.handlerList = new ArrayList<>();
this.handlerList.add(new SOAPHandler(null));
}
handlerChain.addAll(this.handlerList);
return handlerChain;
}
}
SOAPHandler
public class SOAPHandler implements SOAPHandler<SOAPMessageContext> {
private AdminAuth adminAuth;
private static final Logger LOGGER = LoggerFactory.getLogger(SOAPHandler.class);
public MosaicOnboardSOAPHandler(AdminAuth adminAuth) {
if (adminAuth == null) {
adminAuth = new AdminAuth();
LOGGER.info("AdminAuth found null. Creating new adminAuth instance.");
}
this.adminAuth = adminAuth;
}
#Override
public boolean handleMessage(SOAPMessageContext context) {
Boolean outboundProperty = (Boolean) context.get(MessageContext.MESSAGE_OUTBOUND_PROPERTY);
if (outboundProperty) {
#SuppressWarnings("unchecked")
Map<String, List<String>> headers = (Map<String, List<String>>) context.get(MessageContext.HTTP_REQUEST_HEADERS);
if (headers == null) {
headers = new HashMap<>();
context.put(MessageContext.HTTP_REQUEST_HEADERS, headers);
}
List<String> cookie = headers.get("Cookie");
if (cookie == null) {
cookie = new ArrayList<>();
headers.put("Cookie", cookie);
}
cookie.add(this.adminAuth.getToken());
}
return true;
}
#Override
public boolean handleFault(SOAPMessageContext context) {
return false;
}
#Override
public void close(MessageContext context) {
}
#Override
public Set<QName> getHeaders() {
return null;
}
}
AdminAuth:
#Component
public class AdminAuth {
#Autowired
private AdminBean adminBean;
private static final Logger LOG = LoggerFactory.getLogger(Admin.class);
private String token;
private void generateToken() {
try {
AdminTokenHelper adminTokenHelper = new AdminTokenHelper(adminBean.getAutheticationServerURL(), adminBean.getLicense());
token = adminTokenHelper.getToken(adminBean.getUsername(), adminBean.getPassword().toCharArray());
LOG.info("Token generation successful");
} catch (Exception ex) {
ex.printStackTrace();
LOG.error("Token generation failed");
LOG.error(ex.getMessage());
throw new RuntimeException("Token generation failed", ex);
}
}
#Cacheable(value = "tokenCache")
public String getToken() {
LOG.warn("Token not available. Generating a new token.");
generateToken();
return token;
}
}
ehcache.xml
<cache name="tokenCache" maxEntriesLocalHeap="1" eternal="false" timeToIdleSeconds="895" timeToLiveSeconds="895" memoryStoreEvictionPolicy="LRU"/>
Applcation
#EnableCaching
#SpringBootApplication
public class Application extends SpringBootServletInitializer {
public static void main(final String[] args) {
SpringApplication.run(Application.class, args);
}
#Override
protected SpringApplicationBuilder configure(final SpringApplicationBuilder application) {
return application.sources(Application.class).profiles(determineEnvironmentProfile());
}
}
In AdminAuth, it uses functional user to generate token. the token generated for authentication expires in 15 minutes. So my purpose was to write cache so that all the calls from ui can use the same token regardless of actual user. So i set the time 14:55 to generate new token. Now the problem comes when it's after 15 minutes and the cache doesn't evict the old toeken so that call uses the old and expired token and it fails.
I tried different eviction policies like LRU, LFU, FiFO but nothing is working. The calls are coming from ui through tomcat container in spring boot 1.3.
Why is this not getting evicted? What am i missing? Any help is appreciated

Replace #Cacheable(value = "tokenCache") with #Cacheable("tokenCache")

From the comments:
The dependency on spring-boot-starter-cache was missing. This prevented Spring Boot from automatically configuring the CacheManager. Once this dependency was added, the cache configuration worked.
See http://docs.spring.io/spring-boot/docs/1.3.x/reference/html/boot-features-caching.html

Caching in storm bolts

I am trying to cache some data in storm bolt, but not sure if this is right way to do it or not. In below class employee id and employe name are cached to a hash map. For this a database call has been made to Employee table to select all employees and populate a hash map in prepare method (is this right place to initialize map?).
After some logging it turns out (while running storm topology), topology is making multiple database connections and initializing map multiple times. Ofcourse I want to avoid this, that is why I want to cache the result so that it does not go to database everytime. Please help?
public class TestBolt extends BaseRichBolt {
private static final long serialVersionUID = 2946379346389650348L;
private OutputCollector collector;
private Map<String, String> employeeIdToNameMap;
private static final Logger LOG = Logger.getLogger(TestBolt.class);
#Override
public void execute(Tuple tuple) {
String employeeId = tuple.getStringByField("employeeId");
String employeeName = employeeIdToNameMap.get(employeeId);
collector.emit(tuple, new Values(employeeId, employeeName));
collector.ack(tuple);
}
#Override
public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
// TODO Auto-generated method stub
this.collector = collector;
try {
employeeIdToNameMap = createEmployeIdToNameMap();
} catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields(/*some fields*/));
}
private Map<String, String> createEmployeIdToNameMap() throws SQLException {
final Map<String, String> employeeIdToNameMap = new HashMap<>();
final DatabaseManager dbm = new PostgresManager();
final String query = "select id, name from employee;";
final Connection conn = dbm.createDefaultConnection();
final ResultSet result = dbm.executeSelectQuery(conn, query);
while(result.next()) {
String employeId = result.getString("id");
String name = result.getString("name");
employeeIdToNameMap.put(employeId, name);
}
conn.close();
return employeeIdToNameMap;
}
}
SOLUTION
I created synchronized map and its working fine for me
private static Map<String, String> employeeIdToNameMap = Collections
.synchronizedMap(new HashMap<String, String>());

Since you have multiple bolt tasks, you can mark employeeIdToNameMap static and volatile. Initialize the map in prepare like this -
try {
synchronized(TestBolt.class) {
if (null == employeeIdToNameMap) {
employeeIdToNameMap = createEmployeIdToNameMap();
}
}
} catch (SQLException e) {
...
}

bolt that counts number of a user's original tweets

I'm trying count the number of a user's original tweets after i've stored all of the tweets i've downloaded to a MongoDB database using storm. Anyways whenever i count the number of the authors original tweets using the following code,it keeps reading (and counting) the same tweet.
Bolt:
public class CalculateTheMetrics extends BaseBasicBolt {
Map<String,Double>OT1=new HashMap<String, Double>();
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("USERNAME","OT1"));
}
#Override
public void execute(Tuple input,BasicOutputCollector collector) {
String author=input.getString(0);
String tweet=input.getString(2);
Double OT1=this.OT1.get(author);
if(OT1==null){
OT1=0.0;
}
if(author!=null && tweet!=null ){
if(!tweet.startsWith("#") || !tweet.startsWith("RT")){
OT1+=1;
}
this.OT1.put(author,OT1);
System.out.println(author+" +OT1);
collector.emit(new Values(author,OT1))
}
}
Topology:
public class TheAuthorsAndTheirTweetData {
public static void main(String[]args) throws Exception{
TopologyBuilder topologyBuilder=new TopologyBuilder();
topologyBuilder.setSpout("READ_TWEET_DATA_FROM_MONGODB", new ReadLinesFromTextFile("tweets.txt"));
topologyBuilder.setBolt("TWEET_DATA_FROM_MONGODB_TO_FURTHER_PROCESSING",new FromMongoDBToProcessing()).shuffleGrouping("READ_TWEET_DATA_FROM_MONGODB");
topologyBuilder.setSpout("READ_THE_AUTHORS_FROM_TEXT_FILE",new ReadLastLineFromTextFile("authors.txt"));
topologyBuilder.setBolt("FROM_THE_AUTHORS_TEXT_FILE_TO_FURTHER_PROCESSING", new FromTheAuthorsTextFileToFurtherProcessing()).shuffleGrouping("READ_THE_AUTHORS_FROM_TEXT_FILE");
topologyBuilder.setBolt("SEARCH_FOR_THE_AUTHORS_TWEET_DATA",new SearchForTheAuthorsTweetData(),16).fieldsGrouping("TWEET_DATA_FROM_MONGODB_TO_FURTHER_PROCESSING",new Fields("USERNAME","ID")).fieldsGrouping("FROM_THE_AUTHORS_TEXT_FILE_TO_FURTHER_PROCESSING",new Fields("USERNAME","ID"));
topologyBuilder.setBolt("CALCULATE_THE_METRICS",new CalculateTheMetrics(),64).fieldsGrouping("SEARCH_FOR_THE_AUTHORS_TWEET_DATA",new Fields("USERNAME"));
Config config=new Config();
if(args!=null && args.length>0){
config.setNumWorkers(10);
config.setNumAckers(5);
config.setMaxSpoutPending(100);
StormSubmitter.submitTopology(args[0], config, topologyBuilder.createTopology());
}else{
LocalCluster localCluster=new LocalCluster();
localCluster.submitTopology("Test",config,topologyBuilder.createTopology());
Utils.sleep(1*60*60*1000);
localCluster.killTopology("Test");
localCluster.shutdown();
}
}
}
What I want is,for it to stop reading repeatedly the same tweet and counting the same tweet.Please help

Something like this?
public class Calculate1Metric extends BaseRichBolt {
private OutputCollector collector;
Map<String ,Integer>OT1;
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("username","OT1"));
}
#Override
public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
this.collector=collector;
this.OT1=new HashMap<String, Integer>();
}
#Override
public void execute(Tuple input) {
final String sourceComponent = input.getSourceComponent();
String author = input.getString(0);
String tweet = input.getString(2);
if (author != null && tweet != null) {
Integer OT1 = this.OT1.get(author);
if (OT1 == null) {
OT1 = 0;
}
if (!tweet.startsWith("#") || !tweet.contains("RT ") || !tweet.startsWith("RT")) {
OT1 += 1;
}
if(!this.OT1.containsKey(author)) {
this.OT1.put(author, OT1);
}else{
collector.emit(new Values(author,OT1,OT2));
System.out.println(author + " " + OT1+" "+OT2);
this.OT1.remove(author);
}
}else{
collector.fail(input);
}
collector.ack(input);
}

Reduce does not start, After map completes

Below is the code for my Implementation of a simple MapReduce Job using a custom writable comparable.
public class MapReduceKMeans {
public static class MapReduceKMeansMapper extends
Mapper<Object, Text, SongDataPoint, Text> {
public void map(Object key, Text value, Context context)
throws InterruptedException, IOException {
String str = value.toString();
// Reading Line one by one from the input CSV.
String split[] = str.split(",");
String trackId = split[0];
String title = split[1];
String artistName = split[2];
SongDataPoint songDataPoint =
new SongDataPoint(new Text(trackId), new Text(title),
new Text(artistName));
context.write(songDataPoint, new Text());
}
}
public static class MapReduceKMeansReducer extends
Reducer<SongDataPoint, Text, Text, NullWritable> {
public void reduce(SongDataPoint key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
StringBuilder sb = new StringBuilder();
sb.append(key.getTrackId()).append("\t").
append(key.getTitle()).append("\t")
.append(key.getArtistName()).append("\t");
String write = sb.toString();
context.write(new Text(write), NullWritable.get());
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err
.println("Usage:<CsV Out Path> <Final Out Path>");
System.exit(2);
}
Job job = new Job(conf, "Song Data Trial");
job.setJarByClass(MapReduceKMeans.class);
job.setMapperClass(MapReduceKMeansMapper.class);
job.setReducerClass(MapReduceKMeansReducer.class);
job.setOutputKeyClass(SongDataPoint.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
When I debug my code reads all the rows in the CSV file but it does not enter the reduce job at all.
I also have made use of the SongDataPoint as my custom writable.
Its code is as below.
public class SongDataPoint implements WritableComparable<SongDataPoint> {
Text trackId;
Text title;
Text artistName;
public SongDataPoint() {
this.trackId = new Text();
this.title = new Text();
this.artistName = new Text();
}
public SongDataPoint(Text trackId, Text title, Text artistName) {
this.trackId = trackId;
this.title = title;
this.artistName = artistName;
}
#Override
public void readFields(DataInput in) throws IOException {
this.trackId.readFields(in);
this.title.readFields(in);
this.artistName.readFields(in);
}
#Override
public void write(DataOutput out) throws IOException {
}
public Text getTrackId() {
return trackId;
}
public void setTrackId(Text trackId) {
this.trackId = trackId;
}
public Text getTitle() {
return title;
}
public void setTitle(Text title) {
this.title = title;
}
public Text getArtistName() {
return artistName;
}
public void setArtistName(Text artistName) {
this.artistName = artistName;
}
#Override
public int compareTo(SongDataPoint o) {
// TODO Auto-generated method stub
int compare = getTrackId().compareTo(o.getTrackId());
return compare;
}
}
Any help is appreciated. Thanks.

Your output key class class as per Driver is SongDataPoint.class and output value class as Text.class but actually you are writing Text as key in Reducer and Nullwritable as value in Reducer.

you should also specify the Mapper output values as following.
job.setMapOutputKeyClass(SongDataPoint.class);
job.setMapOutputValueClass(Text.class);

My write method in my CustomWritable Class was left blank by mistake. It solved the problem after writing the proper code in it.
public void write(DataOutput out) throws IOException {
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Mapping username to tweets with storm - java

Related

How to configure correct parallelism in persistor bolt?

Why is this cache not getting evicted?

Caching in storm bolts

bolt that counts number of a user's original tweets

Reduce does not start, After map completes

Categories

Resources