Output matrix correctly in Spark Java - java

I would like to know how I go about getting the correct output, I want the output to have the same format as the input. I'm just not quite sure how to map a rowNatrix to have this output.
Input File
0,0,0.0
0,1,1.0
0,2,2.0
0,3,3.0
0,4,4.0
1,0,5.0
1,1,6.0
1,2,7.0
1,3,8.0
1,4,9.0
Code
String inputPathA = "data/At.txt";
SparkConf conf = new SparkConf().setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> fileA = sc.textFile(inputPathA);
JavaRDD<MatrixEntry> matrixA = fileA.map(new Function<String, MatrixEntry>() {
public MatrixEntry call(String x){
String[] indeceValue = x.split(",");
long i = Long.parseLong(indeceValue[0]);
long j = Long.parseLong(indeceValue[1]);
double value = Double.parseDouble(indeceValue[2]);
return new MatrixEntry(i, j, value );
}
});
CoordinateMatrix cooMatrixA = new CoordinateMatrix(matrixA.rdd());
BlockMatrix matA = cooMatrixA.toBlockMatrix();
BlockMatrix ata = matA.transpose().multiply(matA);
IndexedRowMatrix id = ata.toIndexedRowMatrix();
RowMatrix rm = id.toRowMatrix();
RDD<Vector> result = rm.rows();
result.saveAsTextFile("data/output1")
the output I get
(5,[0,1,2,3,4],[45.0,58.0,71.0,84.0,97.0])
(5,[0,1,2,3,4],[25.0,30.0,35.0,40.0,45.0])
(5,[0,1,2,3,4],[30.0,37.0,44.0,51.0,58.0])
(5,[0,1,2,3,4],[40.0,51.0,62.0,73.0,84.0])
(5,[0,1,2,3,4],[35.0,44.0,53.0,62.0,71.0])
How do I map that correctly in Spark (Java) to be the same as my input?

rowMatrix has no meaningful row indices so it cannot be converted back to the same shape as an input. Instead you simply convert BlockMatrix back to CoordinateMatrix and prepare JavaRDD<String> which can be saved:
JavaRDD<MatrixEntry> entries = ata.toCoordinateMatrix().entries().toJavaRDD();
JavaRDD<String> output = entries.map(new Function<MatrixEntry, String>() {
public String call(MatrixEntry e) {
return String.format("%d,%d,%s", e.i(), e.j(), e.value());
}
});
output.saveAsTextFile("data/output1");

Related

How to add a new column with values from a list in Apache Spark Java?

I've run into an issue with my code. I'm attempting to read a CSV file through dataframe and then add a new column with values from an ArrayList.
However, I cannot seem to use either the ArrayList or an array without an error. It wants me to enter the values for the new column manually. How can I get around this, please?
Exception in thread "main" org.apache.spark.SparkRuntimeException: The feature is not supported: literal for '[[153.41, [153.41, .... Then it keeps going until "of class java.util.ArrayList.
at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError"
I've put the line in bold and added a comment on the same line
public static void dataframe(){
// TODO Auto-generated method stub
SparkSession spark = SparkSession.builder().appName("RDD or DataFrame").getOrCreate();
String path = "C:\\Users\\Paolo Agyei\\Desktop\\Computer Science\\Java\\SparkSimpleApp\\data.csv";
Dataset<Row> csvDataset = spark.read().format("csv").option("header", "true")
.load(path);
// Filtering columns by value
Dataset<Row> result = csvDataset.filter( col("status").equalTo("authorized"));
result = result.filter( col("card_present_flag").equalTo("0"));
// Collecting columns to be split
List<Row> long_lat = csvDataset.select("long_lat").collectAsList();
List<Row> merchant_long_lat = csvDataset.select("merchant_long_lat").collectAsList();
// Lists to hold result of long_lat
**ArrayList<String> longing = new ArrayList<String>();**
ArrayList<String> lat = new ArrayList<String>();
// Lists to hold result of merchant_long_lat
ArrayList<String> merch_long = new ArrayList<String>();
ArrayList<String> merch_lat = new ArrayList<String>();
for (Row row: long_lat) {
convert = row.toString().split(" -",2);
**longing.add(convert[0]);**
lat.add(convert[1]);
}
for (Row row: merchant_long_lat) {
convert = row.toString().split("-",2);
merch_long.add(convert[0]);
if(convert.length>1)
merch_lat.add(convert[1]);
else
merch_lat.add("null");
}
// Adding new columns
**result = result.withColumn("long",lit(longing));** // Issue
/*
result = result.withColumn("lat", null);
result = result.withColumn("merch_long", null);
result = result.withColumn("merch_lat", null);
result = result.drop("long_lat","merchant_long_lat");
result.show();
*/
System.out.println("Hello World!");
}

Java converting object array to string array

So i've been trying to solve this issue for hours but cant seem to find an answer which would work.
i have an object array which stores flight information and i had to remove flights which had Valstybe: "Maldyvai"
so i made a new object array without them, but when i try to print it i get a memory location.
How do i convert the object array to string array?
even though i have a tostring method in my java class
package com.company;
import java.util.*;
import com.company.Isvestine.OroUostasKeleivis;
public class Main {
public static void main(String[] args) {
// write your code here
OroUostasKeleivis Keleiviai1 = new OroUostasKeleivis("Skrydis","Washington","JAV","Tomas","tomaitis","Washington",5465);
OroUostasKeleivis Keleiviai2 = new OroUostasKeleivis("Skrydis","Washington","Maldyvai","Tomas","tomaitis","Maldyvai",5466);
OroUostasKeleivis Keleiviai3 = new OroUostasKeleivis("Skrydis","Washington","JAV","Tomas","tomaitis","Washington",5467);
OroUostasKeleivis Keleiviai4 = new OroUostasKeleivis("Skrydis","Washington","Maldyvai","Tomas","tomaitis","Maldyvai",5468);
OroUostasKeleivis Keleiviai5 = new OroUostasKeleivis("Skrydis","Washington","JAV","Tomas","tomaitis","Washington",5469);
OroUostasKeleivis Keleiviai6 = new OroUostasKeleivis("Skrydis","Washington","Maldyvai","Tomas","tomaitis","Maldyvai",5470);
OroUostasKeleivis Keleiviai7 = new OroUostasKeleivis("Skrydis","Washington","JAV","Tomas","tomaitis","Washington",5475);
OroUostasKeleivis Keleiviai8 = new OroUostasKeleivis("Skrydis","Washington","Maldyvai","Tomas","tomaitis","Maldyvai",5476);
OroUostasKeleivis Keleiviai9 = new OroUostasKeleivis("Skrydis","Washington","JAV","Tomas","tomaitis","Washington",5477);
OroUostasKeleivis Keleiviai10 = new OroUostasKeleivis("Skrydis","Washington","JAV","Tomas","tomaitis","Washington",5488);
OroUostasKeleivis[] keleiviai = new OroUostasKeleivis[10];
keleiviai[0] = Keleiviai1;
keleiviai[1] = Keleiviai2;
keleiviai[2] = Keleiviai3;
keleiviai[3] = Keleiviai4;
keleiviai[4] = Keleiviai5;
keleiviai[5] = Keleiviai6;
keleiviai[6] = Keleiviai7;
keleiviai[7] = Keleiviai8;
keleiviai[8] = Keleiviai9;
keleiviai[9] = Keleiviai10;
for (OroUostasKeleivis keleiveliai:keleiviai) {
System.out.println(keleiveliai);
}
System.out.println("test debug");
OroUostasKeleivis[] keleiviaibemaldyvu = new OroUostasKeleivis[10];
for (int i = 0; i < 10; i++) {
}
System.out.println(IsstrintiMaldyvus(keleiviai));
String convertedStringObject = IsstrintiMaldyvus(keleiviai) .toString();
System.out.println(convertedStringObject );
}
static Object[] IsstrintiMaldyvus(OroUostasKeleivis[] keleiviai){
OroUostasKeleivis[] keleiviaiBeMaldyvu = new OroUostasKeleivis[10];
int pozicija = 0;
for ( OroUostasKeleivis keleiveliai: keleiviai) {
if (keleiveliai.getValstybe() != "Maldyvai"){
keleiviaiBeMaldyvu[pozicija] = keleiveliai;
pozicija++;
}
}
return keleiviaiBeMaldyvu;
}
}
but when i try to print it i get a memory location
Yes, you will NOT have result as you expected, especially calling toString() with any array. See documentation of java.lang.Object.toString() for more details.
So how can we solve problem?
first, override toString() method in OroUostasKeleivis like this:
class OroUostasKeleivis {
#Override
public String toString() {
// your implementation here
return null; // TODO: change here
}
}
Second, you may do either way:
If you're interested in just print out, you can do that with System.out.println(keleiveliai) in for-each loop like you do.
If you're interested in converting OroUostasKeleivis[] to String[], you can:
// this requires Java 8 or later
String[] converted = Arrays.asList(keleiviai)
.stream()
.map(OroUostasKeleivis::toString)
.toArray(String[]::new);
// then use `converted`
Use System.out.println(Arrays.toString(IsstrintiMaldyvus(keleiviai)))
https://www.geeksforgeeks.org/arrays-tostring-in-java-with-examples/
It will print the array contents similar to how ArrayList would get printed if it had the same content.
Think of it as:
[ obj1.toString(), obj2.toString(), ... ]
Using java.util.Arrays#stream(T[]) filter and convert object array to string array and use java.util.Arrays#toString(java.lang.Object[]) convert array to readable string.
final String[] oroUostasKeleivis = Arrays.stream(keleiviai)
.filter(
k -> k.getValStybe() != "Maldyvai"
)
// or other convert code
.map(OroUostasKeleivis::toString)
.toArray(String[]::new);
System.out.println(Arrays.toString(oroUostasKeleivis));

Find duplicates in first column and take average based on third column

My issue here is I need to compute average time for each Id and compute average time of each id.
Sample data
T1,2020-01-16,11:16pm,start
T2,2020-01-16,11:18pm,start
T1,2020-01-16,11:20pm,end
T2,2020-01-16,11:23pm,end
I have written a code in such a way that I kept first column and third column in a map.. something like
T1, 11:16pm
but I could not able to compute values after keeping those values in a map. Also tried to keep them in string array and split into line by line. By same issue facing for that approach also.
**
public class AverageTimeGenerate {
public static void main(String[] args) throws IOException {
File file = new File("/abc.txt");
try (BufferedReader reader = new BufferedReader(new FileReader(file))) {
while (true) {
String line = reader.readLine();
if (line == null) {
break;
}
ArrayList<String> list = new ArrayList<>();
String[] tokens = line.split(",");
for (String s: tokens) {
list.add(s);
}
Map<String, String> map = new HashMap<>();
String[] data = line.split(",");
String ids= data[0];
String dates = data[1];
String transactionTime = data[2];
String transactionStartAndEndTime = data[3];
String[] transactionIds = ids.split("/n");
String[] timeOfEachTransaction = transactionTime.split("/n");
for(String id : transactionIds) {
for(String time : timeOfEachTransaction) {
map.put(id, time);
}
}
}
}
}
}
Can anyone suggest me is it possible to find duplicates in a map and compute values in map, Or is there any other way I can do this so that the output should be like
`T1 2:00
T2 5:00'
I don't know what is your logic to complete the average time but you can save data in map for one particular transaction. The map structure can be like this. Transaction id will be the key and all the time will be in array list.
Map<String,List<String>> map = new HashMap<String,List<String>>();
You can do like this:
Map<String, String> result = Files.lines(Paths.get("abc.txt"))
.map(line -> line.split(","))
.map(arr -> {
try {
return new AbstractMap.SimpleEntry<>(arr[0],
new SimpleDateFormat("HH:mm").parse(arr[2]));
} catch (ParseException e) {
return null;
}
}).collect(Collectors.groupingBy(Map.Entry::getKey,
Collectors.collectingAndThen(Collectors
.mapping(Map.Entry::getValue, Collectors.toList()),
list -> toStringTime.apply(convert.apply(list)))));
for simplify I've declared two functions.
Function<List<Date>, Long> convert = list -> (list.get(1).getTime() - list.get(0).getTime()) / 2;
Function<Long, String> toStringTime = l -> l / 60000 + ":" + l % 60000 / 1000;

How to extract key phrases from a given text with OpenNLP?

I'm using Apache OpenNLP and i'd like to extract the Keyphrases of a given text. I'm already gathering entities - but i would like to have Keyphrases.
The problem i have is that i can't use TF-IDF cause i don't have models for that and i only have a single text (not multiple documents)
Here is some code (prototyped - not so clean)
public List<KeywordsModel> extractKeywords(String text, NLPProvider pipeline) {
SentenceDetectorME sentenceDetector = new SentenceDetectorME(pipeline.getSentencedetecto("en"));
TokenizerME tokenizer = new TokenizerME(pipeline.getTokenizer("en"));
POSTaggerME posTagger = new POSTaggerME(pipeline.getPosmodel("en"));
ChunkerME chunker = new ChunkerME(pipeline.getChunker("en"));
ArrayList<String> stopwords = pipeline.getStopwords("en");
Span[] sentSpans = sentenceDetector.sentPosDetect(text);
Map<String, Float> results = new LinkedHashMap<>();
SortedMap<String, Float> sortedData = new TreeMap(new MapSort.FloatValueComparer(results));
float sentenceCounter = sentSpans.length;
float prominenceVal = 0;
int sentences = sentSpans.length;
for (Span sentSpan : sentSpans) {
prominenceVal = sentenceCounter / sentences;
sentenceCounter--;
String sentence = sentSpan.getCoveredText(text).toString();
int start = sentSpan.getStart();
Span[] tokSpans = tokenizer.tokenizePos(sentence);
String[] tokens = new String[tokSpans.length];
for (int i = 0; i < tokens.length; i++) {
tokens[i] = tokSpans[i].getCoveredText(sentence).toString();
}
String[] tags = posTagger.tag(tokens);
Span[] chunks = chunker.chunkAsSpans(tokens, tags);
for (Span chunk : chunks) {
if ("NP".equals(chunk.getType())) {
int npstart = start + tokSpans[chunk.getStart()].getStart();
int npend = start + tokSpans[chunk.getEnd() - 1].getEnd();
String potentialKey = text.substring(npstart, npend);
if (!results.containsKey(potentialKey)) {
boolean hasStopWord = false;
String[] pKeys = potentialKey.split("\\s+");
if (pKeys.length < 3) {
for (String pKey : pKeys) {
for (String stopword : stopwords) {
if (pKey.toLowerCase().matches(stopword)) {
hasStopWord = true;
break;
}
}
if (hasStopWord == true) {
break;
}
}
}else{
hasStopWord=true;
}
if (hasStopWord == false) {
int count = StringUtils.countMatches(text, potentialKey);
results.put(potentialKey, (float) (Math.log(count) / 100) + (float)(prominenceVal/5));
}
}
}
}
}
sortedData.putAll(results);
System.out.println(sortedData);
return null;
}
What it basically does is giving me the Nouns back and sorting them by prominence value (where is it in the text?) and counts.
But honestly - this doesn't work soo good.
I also tried it with lucene analyzer but the results were also not so good.
So - how can i achieve what i want to do? I already know of KEA/Maui-indexer etc (but i'm afraid i can't use them because of GPL :( )
Also interesting? Which other algorithms can i use instead of TF-IDF?
Example:
This text: http://techcrunch.com/2015/09/04/etsys-pulling-the-plug-on-grand-st-at-the-end-of-this-month/
Good output in my opinion: Etsy, Grand St., solar chargers, maker marketplace, tech hardware
Finally, i found something:
https://github.com/srijiths/jtopia
It is using the POS from opennlp/stanfordnlp. It has an ALS2 license. Haven't measured precision and recall yet but it delivers great results in my opinion.
Here is my code:
Configuration.setTaggerType("openNLP");
Configuration.setSingleStrength(6);
Configuration.setNoLimitStrength(5);
// if tagger type is "openNLP" then give the openNLP POS tagger path
//Configuration.setModelFileLocation("model/openNLP/en-pos-maxent.bin");
// if tagger type is "default" then give the default POS lexicon file
//Configuration.setModelFileLocation("model/default/english-lexicon.txt");
// if tagger type is "stanford "
Configuration.setModelFileLocation("Dont need that here");
Configuration.setPipeline(pipeline);
TermsExtractor termExtractor = new TermsExtractor();
TermDocument topiaDoc = new TermDocument();
topiaDoc = termExtractor.extractTerms(text);
//logger.info("Extracted terms : " + topiaDoc.getExtractedTerms());
Map<String, ArrayList<Integer>> finalFilteredTerms = topiaDoc.getFinalFilteredTerms();
List<KeywordsModel> keywords = new ArrayList<>();
for (Map.Entry<String, ArrayList<Integer>> e : finalFilteredTerms.entrySet()) {
KeywordsModel keyword = new KeywordsModel();
keyword.setLabel(e.getKey());
keywords.add(keyword);
}
I modified the Configurationfile a bit so that the POSModel is loaded from the pipeline instance.

Aggregate data in CSV file using Java

I have a big CSV file, thousands of rows, and I want to aggregate some columns using java code.
The file in the form:
1,2012,T1
2,2015,T2
3,2013,T1
4,2012,T1
The results should be:
T, Year, Count
T1,2012, 2
T1,2013, 1
T2,2015, 1
Put your data to a Map like structure, each time add +1 to a stored value when a key (in your case ""+T+year) found.
You can use map like
Map<String, Integer> rowMap = new HashMap<>();
rowMap("T1", 1);
rowMap("T2", 2);
rowMap("2012", 1);
or you can define your own class with T and Year field by overriding hashcode and equals method. Then you can use
Map<YourClass, Integer> map= new HashMap<>();
T1,2012, 2
String csv =
"1,2012,T1\n"
+ "2,2015,T2\n"
+ "3,2013,T1\n"
+ "4,2012,T1\n";
Map<String, Integer> map = new TreeMap<>();
BufferedReader reader = new BufferedReader(new StringReader(csv));
String line;
while ((line = reader.readLine()) != null) {
String[] fields = line.split(",");
String key = fields[2] + "," + fields[1];
Integer value = map.get(key);
if (value == null)
value = 0;
map.put(key, value + 1);
}
System.out.println(map);
// -> {T1,2012=2, T1,2013=1, T2,2015=1}
Use uniVocity-parsers for the best performance. It should take 1 second to process 1 million rows.
CsvParserSettings settings = new CsvParserSettings();
settings.selectIndexes(1, 2); //select the columns we are going to read
final Map<List<String>, Integer> results = new LinkedHashMap<List<String>, Integer>(); //stores the results here
//Use a custom implementation of RowProcessor
settings.setRowProcessor(new AbstractRowProcessor() {
#Override
public void rowProcessed(String[] row, ParsingContext context) {
List<String> key = Arrays.asList(row); // converts the input array to a List - lists implement hashCode and equals based on their values so they can be used as keys on your map.
Integer count = results.get(key);
if (count == null) {
count = 0;
}
results.put(key, count + 1);
}
});
//creates a parser with the above configuration and RowProcessor
CsvParser parser = new CsvParser(settings);
String input = "1,2012,T1"
+ "\n2,2015,T2"
+ "\n3,2013,T1"
+ "\n4,2012,T1";
//the parse() method will parse and submit all rows to your RowProcessor - use a FileReader to read a file instead the String I'm using as example.
parser.parse(new StringReader(input));
//Here are the results:
for(Entry<List<String>, Integer> entry : results.entrySet()){
System.out.println(entry.getKey() + " -> " + entry.getValue());
}
Output:
[2012, T1] -> 2
[2015, T2] -> 1
[2013, T1] -> 1
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Categories