how do i remove redundant tuples in microarray data using java programming? - java
In WEKA-a data mining software for the MICROARRAY DATA, how can i remove the redundant tuples from the existing data set? The code to remove the redundancy should be in JAVA.
i.e, the data set contains data such as
H,A,X,1,3,1,1,1,1,1,0,0,0
D,R,O,1,3,1,1,2,1,1,0,0,0
H,A,X,1,3,1,1,1,1,1,0,0,0
C,S,O,1,3,1,1,2,1,1,0,0,0
H,A,X,1,3,1,1,1,1,1,0,0,0
here the tuples 1,4,5 are redundant.
The code should return the following REDUNDANCY REMOVED data set...
H,A,X,1,3,1,1,1,1,1,0,0,0
D,R,O,1,3,1,1,2,1,1,0,0,0
C,S,O,1,3,1,1,2,1,1,0,0,0
You could use one of the classes that implements the Set such as java.util.HashSet.
You can load your data set into the Set and then extract them either by converting to an array via the Set.toArray() method or by iterating over the set.
Set<Tuple> tupleSet = new HashSet<Tuple>();
for (Tuple tuple: tupleList) {
tupleSet.add(tuple);
}
// now all of your tuples are unique
for (Tuple tuple: tupleSet) {
System.out.println("tuple: " + tuple);
}
Related
Spring Elasticsearch - bulk save multiple indices in one line?
I have multiple documents with different index name that bulk saves in elasticsearch: public void bulkCreateOrUpdate(List personUpdateList, List addressUpdateList, List positionUpdateList) { this.operations.bulkUpdate(personUpdateList,Person.class); this.operations.bulkUpdate(addressUpdateList, Address.class); this.operations.bulkUpdate(positionUpdateList, Position.class); } However, is this still possible to be optimized by calling just a single line, saving multiple list of different index types?
Tldr; The bulk api certainly allows for it. This is a valid call POST _bulk {"index":{"_index":"index_1"}} {"data":"data"} {"index":{"_index":"index_2"}} {"data":"data"} How does your Java Client deal with it ... I am not sure. Solution Java client - Bulk This could be done: BulkRequest.Builder br = new BulkRequest.Builder(); br.operations(op -> op .index(idx -> idx .index("index_1") .id("1") .document(document) ) ); br.operations(op -> op .index(idx -> idx .index("index_2") .id("1") .document(document) ) ); Java Rest Client - Bulk This could be done this way: BulkRequest request = new BulkRequest(); request.add(new IndexRequest("index_1").id("1") .source(XContentType.JSON,"data", "data")); request.add(new IndexRequest("index_2").id("1") .source(XContentType.JSON,"data", "data"));
For Spring Data Elasticsearch : The ElasticsearchOperations.bulkXXX() methods take a List<IndexQuery> as first parameter. You can set an index name on each of these objects to specify in which index the data should be written/updated. The index name taken from the last parameter (either the entity class or an IndexCoordinates object) is used in case that no index name is set in the IndexQuery.
Dataflow CoGroupByKey is very slow for more than 10000 elements per key
I have two PCollection<KV<String, TableRow>> one has ~7 Million rows and the other has ~1 Million rows. What I want to do is apply left outer join between these two PCollections and in case of successful join put all the data of right TableRow To left TableRow and return the results. I have tried using CoGroupByKey in apache beam SDK 2.10.0 for java and here I am getting so many Hot Keys so my Fetching Result after CoGrupByKey is getting slower with Waring 'More 10000 elements per key, need to reiterate'. I have also tried shuffle mode service but no help. PCollection<TableRow> finalResultCollection = coGbkResultCollection.apply(ParDo.of( new DoFn<KV<K, CoGbkResult>, TableROw>() { #Override public void processElement(ProcessContext c) { KV<K, CoGbkResult> e = c.element(); // Get all collection 1 values Iterable<TableRow> pt1Vals = e.getValue().getAll(t1); Iterable<TableRow> pt2Vals = e.getValue().getAll(t2); for (TableRow tr : pt1Vals) { TableRow out = tr.clone(); if(pt2Vals.iterator().hasNext()) { for (TableRow tr1 : pt2Vals) { out.putAll(tr1); c.output(out); } } else { c.output(out); } } } })); What is the way to perform these type of joins in dataflow?
I have made some research and I have found some information that can help you. The data sent transferred by dataflow between PCollections (serializable objects) may not exist in a single machine. Furthermore, a transformation like GroupByKey/CoGroupByKey needs requires all the data to collected in one place before the resultant populated, I don't know if you have it in a different structure. Besides you can redistribute your keys, put less workers and add more memory, or, try using Combine.perKey. Also you can try this workaround, or, you can read this article and have more information that can help you.
Can not modify value in JavaRDD
I have a question about how to update JavaRDD values. I have a JavaRDD<CostedEventMessage> with message objects containing information about to which partition of kafka topic it should be written to. I'm trying to change the partitionId field of such objects using the following code: rddToKafka = rddToKafka.map(event -> repartitionEvent(event, numPartitions)); where the repartitionEvent logic is: costedEventMessage.setPartitionId(1); return costedEventMessage; But the modification does not happen. Could you please advice why and how to correctly modify values in a JavaRDD?
Spark is lazy, so from the code you pasted above it's not clear if you actually performed any action on the JavaRDD (like collect or forEach) and how you came to the conclusion that data was not changed. For example, if you assumed that by running the following code: List<CostedEventMessage> messagesLst = ...; JavaRDD<CostedEventMessage> rddToKafka = javaSparkContext.parallelize(messagesLst); rddToKafka = rddToKafka.map(event -> repartitionEvent(event, numPartitions)); Each element in messagesLst would have partition set to 1, you are wrong. That would hold true if you added for example: messagesLst = rddToKafka.collect(); For more details refer to documentation
Java 8 Stream API Filtering
I have a collection of objects, with the following type: { String action_name; //add or delete long action_time; String action_target; } Need to get the latest merged operation on each action_target Sample input data: [add|1001|item1, add|1002|item2, delete|1003|item1, add|1004|item1] Expected result: [add|1002|item2, add|1004|item1] Sample input data: [add|1001|item1, add|1002|item2, delete|1003|item1] Expected result: [add|1002|item2] Sample input data: [delete|1001|item1, add|1002|item2, add|1003|item1] Expected result: [add|1002|item2, add|1003|item1] Is this approachable using Java8 stream APIs? Thanks.
You want to group by one criteria (the action_target) combined with reducing the groups to the maximum of their action_time values: Map<String,Item> map=items.stream().collect( Collectors.groupingBy(item->item.action_target, Collectors.collectingAndThen( Collectors.maxBy(Comparator.comparing(item->item.action_time)), Optional::get))); This returns a Map<String,Item> but, of course, you may call values() on it to get a collection of items. Beautified with static imports, the code looks like: Map<String,Item> map=items.stream().collect(groupingBy(item->item.action_target, collectingAndThen(maxBy(comparing(item->item.action_time)), Optional::get))); Your additional request of taking care of idempotent "add" and follow-up "delete" actions can be simplified to “remove items whose last action is "delete"” which can be implemented just by doing that after collecting using a mutable map: HashMap<String,Item> map=items.stream().collect(groupingBy( item->item.action_target, HashMap::new, collectingAndThen(maxBy(comparing(item->item.action_time)), Optional::get))); map.values().removeIf(item->item.action_name.equals("delete"));
retrieving the values from the nested hashmap
I have a XML file with many copies of table node structure as below: <databasetable TblID=”123” TblName=”Department1_mailbox”> <SelectColumns> <Slno>dept1_slno</Slno> <To>dept1_to</To> <From>dept1_from</From> <Subject>dept1_sub</Subject> <Body>dept1_body</Body> <BCC>dept1_BCC</BCC> <CC>dept1_CC</CC> </SelectColumns> <WhereCondition>MailSentStatus=’New’</WhereCondition> <UpdateSuccess> <MailSentStatus>’Yes’</MailSentStatus> <MailSentFailedReason>’Mail Sent Successfully’</MailSentFailedReason> </UpdateSuccess> <UpdateFailure> <MailSentStatus>’No’</MailSentStatus> <MailSentFailedReason>’Mail Sending Failed ’</MailSentFailedReason> </ UpdateFailure> </databasetable> As it is not an efficient manner to traverse the file for each time to fetch the details of each node for the queries in the program, I used the nested hashmap concept to store the details while traversing the XML file for the first time. The structure I used is as below: MapMaster Key Value 123 MapDetails Key Value TblName Department1_mailbox SelectColumns mapSelect Key Value Slno dept1_slno To dept1_to From dept1_from Subject dept1_sub Body dept1_body BCC dept1_BCC CC dept1_CC WhereCondition MailSentStatus=’New’ UpdateSuccess mapUS MailSentStatus ’Yes’ MailSentFailedReason ’Mail Sent Successfully’ UpdateFailure mapUF MailSentStatus ’No’ MailSentFailedReason ’Mail Sending Failed’ But the problem I’m facing now is regarding retrieving the Value part using the nested Keys. For example, If I need the value of Slno Key, I have to specify TblID, SelectColumns, Slno in nested form like: Stirng Slno = ((HashMap)((HashMap)mapMaster.get(“123”))mapDetails.get(“SelectColumns”))mapSelect.get(“Slno”); This is unconvinent to use in a program. Please suggest a solution but don’t tell that iterators are available. As I’ve to fetch the individual value from the map according to the need of my program. EDIT:my program has to fetch the IDs of the department for which there is privilege to send mails and then these IDs are compared with the IDs in XML file. Only information of those IDs are fetched from XML which returned true in comparison. This is all my program. Please help. Thanks in advance, Vishu
Never cast to specific Map implementation. Better use casting to Map interface, i.e. ((Map)one.get("foo")).get("bar") Do not use casting in your case. You can define collection using generics, so compiler will do work for you: Map<String, Map> one = new HashMap<String, Map>(); Map<String, Integer> two = new HashMap<String, Integer>(); Now your can say: int n = one.get("foo").get("bar"); No casting, no problems. But the better solution is not to use nested tables at all. Create your custom classes like SelectColumns, WhereCondition etc. Each class should have appropriate private fields, getters and setters. Now parse your XML creating instance of these classes. And then use getters to traverse the data structure. BTW if you wish to use JAXB you do not have to do almost anything! Something like the following: Unmarshaller u = JAXBContext.newInstance(SelectColumns.class, WhereCondition.class).createUnmarshaller(); SelectColumns[] columns = (SelectColumns[])u.unmarshal(in);
One approach to take would be to generate fully qualified keys that contain the XML path to the element or attribute. These keys would be unique, stored in a single hashmap and get you to the element quickly. Your code would simply have to generate a unique textual representation of the path and store and retrieve the xml element based on the key.