finding Bigrams in spark with java(8) - java

I have tokenized the sentences into word RDD. so now i need Bigrams.
ex. This is my test => (This is), (is my), (my test)
I have search thru and found .sliding operator for this purpose. But I'm not getting this option on my eclipse (may it is available for newer version of spark)
So what how can i make this happen w/o .sliding?
Adding code to get started-
public static void biGram (JavaRDD<String> in)
{
JavaRDD<String> sentence = in.map(s -> s.toLowerCase());
//get bigram from sentence w/o sliding - CODE HERE
}

You can simply use n-gram transformation feature in spark.
public static void biGram (JavaRDD<String> in)
{
//Converting string into row
JavaRDD<Row> sentence = sentence.map(s -> RowFactory.create(s.toLowerCase()));
StructType schema = new StructType(new StructField[] {
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
//Creating dataframe
DataFrame dataFrame = sqlContext.createDataFrame(sentence, schema);
//Tokenizing sentence into words
RegexTokenizer rt = new RegexTokenizer().setInputCol("sentence").setOutputCol("split")
.setMinTokenLength(4)
.setPattern("\\s+");
DataFrame rtDF = rt.transform(dataFrame);
//Creating bigrams
NGram bigram = new NGram().setInputCol(rt.getOutputCol()).setOutputCol("bigram").setN(2); //Here setN(2) means bigram
DataFrame bigramDF = bigram.transform(rtDF);
System.out.println("Result :: "+bigramDF.select("bigram").collectAsList());
}

sliding is indeed the way to go with ngrams. thing is, sliding works on iterators, just split your sentence and slide over the array. I am adding a Scala code.
val sentences:RDD[String] = in.map(s => s.toLowerCase())
val biGrams:RDD[Iterator[Array[String]]] = sentences.map(s => s.split(" ").sliding(2))

Related

Iterating over multiple CSVs and joining with Spark SQL

I have several csv files with the same headers and the same IDs. I am attempting to iterate to merge all files up to one indexed '31'. In my while loop, I'm trying to initialise the merged dataset so it can be used for the remainder of the loop. In the last line, I was told that the 'local variable merged may not have been initialised'. How should I instead be doing this?
SparkSession spark = SparkSession.builder().appName("testSql")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///c:tmp")
.getOrCreate();
Dataset<Row> first = spark.read().option("header", true).csv("mypath/01.csv");
Dataset<Row> second = spark.read().option("header", true).csv("mypath/02.csv");
IntStream.range(3, 31)
.forEach(i -> {
while(i==3) {
Dataset<Row> merged = first.join(second, first.col("customer_id").equalTo(second.col("customer_id")));
}
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
Dataset<Row> merged = merged.join(next, merged.col("customer_id").equalTo(next.col("customer_id")));
EDITED based on feedback in the comments.
Following your pattern, something like this would work:
Dataset<Row> ds1 = spark.read().option("header", true).csv("mypath/01.csv");
Dataset<?>[] result = {ds1};
IntStream.range(2, 31)
.forEach(i -> {
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
result[0] = result[0].join(next, "customer_id");
});
We're wrapping Dataset into an array in order to work around the restriction on variable capture in lambda expressions.
The more straightforward way, for this particular case, is to simply use a for-loop rather than stream.forEach:
Dataset<Row> result = spark.read().option("header", true).csv("mypath/01.csv");
for( int i = 2 ; i < 31 ; i++ ) {
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
result[0] = result[0].join(next, "customer_id");
};

Spark Sorting with JavaRDD<String>

Let's say I have a file with line of strings and I import it to a JavaRDD, if I am trying to sort the strings and export as a new file, how should I do it? The code below is my attempt and it is not working
JavaSparkContext sparkContext = new JavaSparkContext("local[*]", "Spark Sort");
Configuration hadoopConfig = sparkContext.hadoopConfiguration();
hadoopConfig.set("fs.hdfs.imp", DistributedFileSystem.class.getName());
hadoopConfig.set("fs.file.impl", LocalFileSystem.class.getName());
JavaRDD<String> lines = sparkContext.textFile(args[0]);
JavaRDD<String> sorted = lines.sortBy(i->i, true,1);
sorted.saveAsTextFile(args[1]);
What I mean by "not working" is that the output file is not sorted. I think the issue is with my "i->i" code, I am not sure how to make it sort with the compare method of strings as each "i" will be a string (also not sure how to make it compare between different "i"
EDIT
I have modified the code as per the comments, I suspect the file was being read as 1 giant string.
JavaSparkContext sparkContext = new JavaSparkContext("local[*]", "Spark Sort");
Configuration hadoopConfig = sparkContext.hadoopConfiguration();
hadoopConfig.set("fs.hdfs.imp", DistributedFileSystem.class.getName());
hadoopConfig.set("fs.file.impl", LocalFileSystem.class.getName());
long start = System.currentTimeMillis();
List<String> array = buildArrayList(args[0]);
JavaRDD<String> lines = sparkContext.parallelize(array);
JavaRDD<String> sorted = lines.sortBy(i->i, true, 1);
sorted.saveAsTextFile(args[1]);
Still not sorting it :(
I made a little research. Your code is correct. Here are the samples which I tested:
Spark initizalization
SparkSession spark = SparkSession.builder().appName("test")
.config("spark.debug.maxToStringFields", 10000)
.config("spark.sql.tungsten.enabled", true)
.enableHiveSupport().getOrCreate();
JavaSparkContext jSpark = new JavaSparkContext(spark.sparkContext());
Example for RDD
//RDD
JavaRDD rdd = jSpark.parallelize(Arrays.asList("z", "b", "c", "a"));
JavaRDD sorted = rdd.sortBy(i -> i, true, 1);
List<String> result = sorted.collect();
result.stream().forEach(i -> System.out.println(i));
The output is
a
b
c
z
You also can use dataset API
//Dataset
Dataset<String> stringDataset = spark.createDataset(Arrays.asList("z", "b", "c", "a"), Encoders.STRING());
Dataset<String> sortedDataset = stringDataset.sort(stringDataset.col(stringDataset.columns()[0]).desc()); //by defualt is ascending order
result = sortedDataset.collectAsList();
result.stream().forEach(i -> System.out.println(i));
The output is
z
c
b
a
Your problem I think is that your text file have a specific lines separator. If it's so - you can use flatMap function to split your giant text string into line strings.
Here the example with Dataset
//flatMap example
Dataset<String> singleLineDS= spark.createDataset(Arrays.asList("z:%b:%c:%a"), Encoders.STRING());
Dataset<String> splitedDS = singleLineDS.flatMap(i->Arrays.asList(i.split(":%")).iterator(),Encoders.STRING());
Dataset<String> sortedSplitedDs = splitedDS.sort(splitedDS.col(splitedDS.columns()[0]).desc());
result = sortedSplitedDs.collectAsList();
result.stream().forEach(i -> System.out.println(i));
So you should find which separator is in your text file and adopt the code above for your task

Duplicate column when I create an IndexedRowMatrix using Spark

I need to calculate the pairwise similarity between several Documents. For that, I procced as follows:
JavaPairRDD<String,String> files = sc.wholeTextFiles(file_path);
System.out.println(files.count()+"**");
JavaRDD<Row> rowRDD = files.map((Tuple2<String, String> t) -> {
return RowFactory.create(t._1,t._2.replaceAll("[^\\w\\s]+","").replaceAll("\\d", ""));
});
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> rows = spark.createDataFrame(rowRDD, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> tokenized_rows = tokenizer.transform(rows);
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered_words");
Dataset<Row> filtred_rows = remover.transform(tokenized_rows);
CountVectorizerModel cvModel = new CountVectorizer().setInputCol("filtered_words").setOutputCol("rowF").setVocabSize(100000).fit(filtred_rows);
Dataset<Row> verct_rows = cvModel.transform(filtred_rows);
IDF idf = new IDF().setInputCol("rowF").setOutputCol("features");
IDFModel idfModel = idf.fit(verct_rows);
Dataset<Row> rescaledData = idfModel.transform(verct_rows);
JavaRDD<IndexedRow> vrdd = rescaledData.toJavaRDD().map((Row r) -> {
//DenseVector dense;
String s = r.getAs(0);
int index = new Integer(s.replace(s.substring(0,24),"").replace(s.substring(s.indexOf(".txt")),""));
SparseVector sparse = (SparseVector) r.getAs(5);
//dense = sparse.toDense();parseVector) r.getAs(5);
org.apache.spark.mllib.linalg.Vector vec = org.apache.spark.mllib.linalg.Vectors.dense(sparse.toDense().toArray());
return new IndexedRow(index, vec);
});
System.out.println(vrdd.count()+"---");
IndexedRowMatrix mat = new IndexedRowMatrix(vrdd.rdd());
System.out.println(mat.numCols()+"---"+mat.numRows());
Unfortunately, the results show that the IndexedRowMatrix is created with 4 columns (the first one is duplicated) even if my dataset contains 3 documents.
3**
3--
1106---4
Can you help me to detect the cause of this duplication?
Most likely there is no duplication at all and your data simply doesn't follow the specification, which requires indices to be consecutive, zero-based, integers. Therefore numRows is max(row.index for row in rows) + 1
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.linalg.Vectors
new IndexedRowMatrix(sc.parallelize(Seq(
IndexedRow(5, Vectors.sparse(5, Array(), Array()))) // Only one non-empty row
)).numRows
// res4: Long = 6

Getting original text after using stanford NLP parser

Hello people of the internet,
We're having the following problem with the Stanford NLP API:
We have a String that we want to transform into a list of sentences.
First, we used String sentenceString = Sentence.listToString(sentence); but listToString does not return the original text because of the tokenization. Now we tried to use listToOriginalTextString in the following way:
private static List<String> getSentences(String text) {
Reader reader = new StringReader(text);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();
for (List<HasWord> sentence : dp) {
String sentenceString = Sentence.listToOriginalTextString(sentence);
sentenceList.add(sentenceString.toString());
}
return sentenceList;
}
This does not work. Apparently we have to set an attribute " invertible " to true but we don't know how to. How can we do this?
In general, how do you use listToOriginalTextString properly? What preparations do you need?
sincerely,
Khayet
If I understand correctly, you want to get the mapping of tokens to the original input text after tokenization. You can do it like this;
//split via PTBTokenizer (PTBLexer)
List<CoreLabel> tokens = PTBTokenizer.coreLabelFactory().getTokenizer(new StringReader(text)).tokenize();
//do the processing using stanford sentence splitter (WordToSentenceProcessor)
WordToSentenceProcessor processor = new WordToSentenceProcessor();
List<List<CoreLabel>> splitSentences = processor.process(tokens);
//for each sentence
for (List<CoreLabel> s : splitSentences) {
//for each word
for (CoreLabel token : s) {
//here you can get the token value and position like;
//token.value(), token.beginPosition(), token.endPosition()
}
}
String sentenceStr = sentence.get(CoreAnnotations.TextAnnotation.class)
It gives you original text. An example for JSONOutputter.java file :
l2.set("id", sentence.get(CoreAnnotations.SentenceIDAnnotation.class));
l2.set("index", sentence.get(CoreAnnotations.SentenceIndexAnnotation.class));
l2.set("sentenceOriginal",sentence.get(CoreAnnotations.TextAnnotation.class));
l2.set("line", sentence.get(CoreAnnotations.LineNumberAnnotation.class));

LDA in Spark 1.3.1. Converting raw data into Term Document Matrix?

I'm trying out LDA with Spark 1.3.1 in Java and got this error:
Error: application failed with exception
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For input string: "��"
My .txt file looks like this:
put weight find difficult pull ups push ups now
blindness diseases everything eyes work perfectly except ability take light use light form images
role model kid
Dear recall saddest memory childhood
This is the code:
import scala.Tuple2;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.clustering.LDAModel;
import org.apache.spark.mllib.clustering.LDA;
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.SparkConf;
public class JavaLDA {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("LDA Example");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load and parse the data
String path = "/tutorial/input/askreddit20150801.txt";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Vector> parsedData = data.map(
new Function<String, Vector>() {
public Vector call(String s) {
String[] sarray = s.trim().split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++)
values[i] = Double.parseDouble(sarray[i]);
return Vectors.dense(values);
}
}
);
// Index documents with unique IDs
JavaPairRDD<Long, Vector> corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
new Function<Tuple2<Vector, Long>, Tuple2<Long, Vector>>() {
public Tuple2<Long, Vector> call(Tuple2<Vector, Long> doc_id) {
return doc_id.swap();
}
}
));
corpus.cache();
// Cluster the documents into three topics using LDA
LDAModel ldaModel = new LDA().setK(100).run(corpus);
// Output topics. Each is a distribution over words (matching word count vectors)
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
+ " words):");
Matrix topics = ldaModel.topicsMatrix();
for (int topic = 0; topic < 100; topic++) {
System.out.print("Topic " + topic + ":");
for (int word = 0; word < ldaModel.vocabSize(); word++) {
System.out.print(" " + topics.apply(word, topic));
}
System.out.println();
}
ldaModel.save(sc.sc(), "myLDAModel");
}
}
Anyone know why this happened? I'm just trying LDA Spark for the first time. Thanks.
values[i] = Double.parseDouble(sarray[i]);
Why are you trying to convert each word of your text file into a Double?
That's the answer to your issue:
http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#parseDouble%28java.lang.String%29
Your code is expecting the input file to be a bunch of lines of whitespace separated text that looks like numbers. Assuming your text is words instead:
Get a list of every word that appears in your corpus:
JavaRDD<String> words =
data.flatMap((FlatMapFunction<String, String>) s -> {
s = s.replaceAll("[^a-zA-Z ]", "");
s = s.toLowerCase();
return Arrays.asList(s.split(" "));
});
Make a map giving each word an integer associated with it:
Map<String, Long> vocab = words.zipWithIndex().collectAsMap();
Then instead of your parsedData doing what it's doing up there, make it look up each word, find the associated number, go to that location in an array, and add 1 to the count for that word.
JavaRDD<Vector> tokens = data.map(
(Function<String, Vector>) s -> {
String[] vals = s.split("\\s");
double[] idx = new double[vocab.size() + 1];
for (String val : vals) {
idx[vocab.get(val).intValue()] += 1.0;
}
return Vectors.dense(idx);
}
);
This results in an RDD of vectors, where each vector is vocab.size() long, and each spot in the vector is the count of how many times that vocab word appeared in the line.
I modified this code slightly from what I'm currently using and didn't test it, so there could be errors in it. Good luck!

Categories