I need to calculate the pairwise similarity between several Documents. For that, I procced as follows:
JavaPairRDD<String,String> files = sc.wholeTextFiles(file_path);
System.out.println(files.count()+"**");
JavaRDD<Row> rowRDD = files.map((Tuple2<String, String> t) -> {
return RowFactory.create(t._1,t._2.replaceAll("[^\\w\\s]+","").replaceAll("\\d", ""));
});
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> rows = spark.createDataFrame(rowRDD, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> tokenized_rows = tokenizer.transform(rows);
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered_words");
Dataset<Row> filtred_rows = remover.transform(tokenized_rows);
CountVectorizerModel cvModel = new CountVectorizer().setInputCol("filtered_words").setOutputCol("rowF").setVocabSize(100000).fit(filtred_rows);
Dataset<Row> verct_rows = cvModel.transform(filtred_rows);
IDF idf = new IDF().setInputCol("rowF").setOutputCol("features");
IDFModel idfModel = idf.fit(verct_rows);
Dataset<Row> rescaledData = idfModel.transform(verct_rows);
JavaRDD<IndexedRow> vrdd = rescaledData.toJavaRDD().map((Row r) -> {
//DenseVector dense;
String s = r.getAs(0);
int index = new Integer(s.replace(s.substring(0,24),"").replace(s.substring(s.indexOf(".txt")),""));
SparseVector sparse = (SparseVector) r.getAs(5);
//dense = sparse.toDense();parseVector) r.getAs(5);
org.apache.spark.mllib.linalg.Vector vec = org.apache.spark.mllib.linalg.Vectors.dense(sparse.toDense().toArray());
return new IndexedRow(index, vec);
});
System.out.println(vrdd.count()+"---");
IndexedRowMatrix mat = new IndexedRowMatrix(vrdd.rdd());
System.out.println(mat.numCols()+"---"+mat.numRows());
Unfortunately, the results show that the IndexedRowMatrix is created with 4 columns (the first one is duplicated) even if my dataset contains 3 documents.
3**
3--
1106---4
Can you help me to detect the cause of this duplication?
Most likely there is no duplication at all and your data simply doesn't follow the specification, which requires indices to be consecutive, zero-based, integers. Therefore numRows is max(row.index for row in rows) + 1
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.linalg.Vectors
new IndexedRowMatrix(sc.parallelize(Seq(
IndexedRow(5, Vectors.sparse(5, Array(), Array()))) // Only one non-empty row
)).numRows
// res4: Long = 6
Related
I have a dataset with such schema:
{"user":"A10T7BS07XCWQ1","recommendations":[{"iID":34142,"rating":22.998692},{"iID":24963,"rating":22.678337},{"iID":47761,"rating":22.31455},{"iID":28694,"rating":21.269365},{"iID":36890,"rating":21.143366},{"iID":48522,"rating":20.678747},{"iID":20032,"rating":20.330639},{"iID":57315,"rating":20.099955},{"iID":18148,"rating":20.07064},{"iID":7321,"rating":19.754635}]}
I try to flatMap my dataset by such way:
StructType struc = new StructType();
struc.add("user", DataTypes.StringType, false);
struc.add("item", DataTypes.IntegerType, false);
struc.add("relevance", DataTypes.DoubleType, false);
ExpressionEncoder<Row> encoder = RowEncoder.apply(struc);
Dataset<Row> recomenderResult = userRecs.flatMap((FlatMapFunction<Row, Row>) row -> {
String user = row.getString(0);
List<Row> recsWithIntItemID = row.getList(1);
Integer item;
Double relevance;
List<Row> rows = new ArrayList<>();
for (Row rec : recsWithIntItemID) {
item = rec.getInt(0);
relevance = (double) rec.getFloat(1);
System.out.println(user + " : " + item + " : " + relevance);
Row newRow = RowFactory.create(user, item, relevance);
rows.add(newRow);
}
System.out.println("++++++++++++++++++++++++++++++++");
return rows.iterator();
}, encoder);
recomenderResult.write().json("temp2");
recomenderResult.show();
system output is folowing:
...
A1049B0RS95K7B : 24708 : 17.146669387817383
A1049B0RS95K7B : 2825 : 16.809375762939453
A1049B0RS95K7B : 36503 : 16.758258819580078
++++++++++++++++++++++++++++++++
...
But Row instance is empty, show() method gives such output:
++
||
++
||
||
I have no idea why my result dataset is empty. I have already watched all topics on this site relevant to my problem and used google, but I have not found solution of my problem. Could somebody help me?
It was very stupid bug :( Simple answer, mistake was here:
StructType struc = new StructType();
struc = struc.add("user", DataTypes.StringType, false);
struc = struc.add("item", DataTypes.IntegerType, false);
struc = struc.add("relevance", DataTypes.DoubleType, false);
ExpressionEncoder<Row> encoder = RowEncoder.apply(struc);
It costs me 2 days and one night...
Using spark java I have created dataframe on comma delimiter source file. In sourcefile if last column contains blank value then its throwing arrayindexoutofbound error. Below is sample data and code. is there any way I can handle this error because there is lot of chance getting blank values in last column. In below sample data 4th row causing issue.
Sample Data:
1,viv,chn,34
2,man,gnt,56
3,anu,pun,22
4,raj,bang,*
Code:
JavaRDD<String> dataQualityRDD = spark.sparkContext().textFile(inputFile, 1).toJavaRDD();
String schemaString = schemaColumns;
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = dataQualityRDD.map((Function<String, Row>) record -> {
// String[] attributes = record.split(attributes[0], attributes[1].trim());
Object[] items = record.split(fileSplit);
// return RowFactory.create(attributes[0], attributes[1].trim());
return RowFactory.create(items);
});
}
}
I used spark 2.0 and was able to read the csv without any exception:
SparkSession spark = SparkSession.builder().config("spark.master", "local").getOrCreate();
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(spark.sparkContext());
JavaRDD<Row> csvRows = spark.read().csv("resources/csvwithnulls.csv").toJavaRDD();
StructType schema = DataTypes.createStructType(
new StructField[] { new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("fname", DataTypes.StringType, false, Metadata.empty()),
new StructField("lname", DataTypes.StringType, false, Metadata.empty()),
new StructField("age", DataTypes.StringType, false, Metadata.empty()) });
Dataset<Row> newCsvRows = spark.createDataFrame(csvRows, schema);
newCsvRows.show();
Used exactly the rows you have and it worked fine: see the output:
We need to merge two dataset which are having different column names, there are no common columns across the datasets.
We have tried couple of approaches, both of the approaches are not yielding result. Kindly let us know how to combine two dataset using Apache spark Java
Input data set 1
"405-048011-62815", "CRC Industries",
"630-0746","Dixon value",
"4444-444","3M INdustries",
"555-55","Dixon coupling valve"
Input dataset 2
"222-2222-5555", "Tata",
"7777-88886","WestSide",
"22222-22224","Reliance",
"33333-3333","V industries"
Expected out is
----------label1----|------sentence1------|------label2---|------sentence2-----------
| 405-048011-62815 | CRC Industries | 222-2222-5555 | Tata|
| 630-0746 | Dixon value | 7777-88886 | WestSide|
-------------------------------------------------------------------------------------
`
List<Row> data = Arrays.asList(
RowFactory.create("405-048011-62815", "CRC Industries"),
RowFactory.create("630-0746","Dixon value"),
RowFactory.create("4444-444","3M INdustries"),
RowFactory.create("555-55","Dixon coupling valve"));
StructType schema = new StructType(new StructField[] {new StructField("label1", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence1", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
List<String> listStrings = new ArrayList<String>();
listStrings.add("405-048011-62815");
listStrings.add("630-0746");
Dataset<Row> matchFound1=sentenceDataFrame.filter(col("label1").isin(listStrings.stream().toArray(String[]::new)));
matchFound1.show();
listStrings.clear();
listStrings.add("222-2222-5555");
listStrings.add("7777-88886");
List<Row> data2 = Arrays.asList(
RowFactory.create("222-2222-5555", "Tata"),
RowFactory.create("7777-88886","WestSide"),
RowFactory.create("22222-22224","Reliance"),
RowFactory.create("33333-3333","V industries"));
StructType schema2 = new StructType(new StructField[] {new StructField("label2", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence2", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);
Dataset<Row> matchFound2=sentenceDataFrame2.filter(col("label2").isin(listStrings.stream().toArray(String[]::new)));
matchFound2.show();
//Approach 1
Dataset<Row> matchFound3=matchFound1.select(matchFound1.col("label1"),matchFound1.col("sentence1"),matchFound2.col("label2"),
matchFound2.col("sentence2"));
System.out.println("After concat");
matchFound3.show();
//Approach 2
Dataset<Row> matchFound4=matchFound1.filter(concat((col("label1")),matchFound1.col("sentence1"),matchFound2.col("label2"),
matchFound2.col("sentence2")));
System.out.println("After concat 2");
matchFound4.show();`
Error for each of the approaches are as follows
Approach 1 error
----------
org.apache.spark.sql.AnalysisException: resolved attribute(s) label2#10,sentence2#11 missing from label1#0,sentence1#1 in operator !Project [label1#0, sentence1#1, label2#10, sentence2#11];;
!Project [label1#0, sentence1#1, label2#10, sentence2#11]
+- Filter label1#0 IN (405-048011-62815,630-0746)
+- LocalRelation [label1#0, sentence1#1]
----------
Error for each of the approaches are as follows
Approach 2 error
org.apache.spark.sql.AnalysisException: filter expression 'concat(`label1`, `sentence1`, `label2`, `sentence2`)' of type string is not a boolean.;;
!Filter concat(label1#0, sentence1#1, label2#10, sentence2#11)
+- Filter label1#0 IN (405-048011-62815,630-0746)
+- LocalRelation [label1#0, sentence1#1]
hope this work for you
DF
val pre: Array[String] = Array("CRC Industries", "Dixon value" ,"3M INdustries" ,"Dixon coupling valve")
val rea: Array[String] = Array("405048011-62815", "630-0746", "4444-444", "555-55")
val df1 = sc.parallelize( rea zip pre).toDF("label1","sentence1")
val preasons2: Array[String] = Array("Tata", "WestSide","Reliance", "V industries")
val reasonsI2: Array[String] = Array( "222-2222-5555", "7777-88886", "22222-22224", "33333-3333")
val df2 = sc.parallelize( reasonsI2 zip preasons2 ).toDF("label2","sentence2")
String Indexer
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("label1")
.setOutputCol("label1Index")
val indexed = indexer.fit(df1).transform(df1)
indexed.show()
val indexer1 = new StringIndexer()
.setInputCol("label2")
.setOutputCol("label2Index")
val indexed1 = indexer1.fit(df2).transform(df2)
indexed1.show()
Join
val rnd_reslt12 = indexed.join(indexed1 , indexed.col("label1Index")===indexed1.col("label2Index")).drop(indexed.col("label1Index")).drop(indexed1.col("label2Index"))
rnd_reslt12.show()
+---------------+--------------------+-------------+------------+
| label1| sentence1| label2| sentence2|
+---------------+--------------------+-------------+------------+
| 630-0746| Dixon value|222-2222-5555| Tata|
| 4444-444| 3M INdustries| 22222-22224| Reliance|
| 555-55|Dixon coupling valve| 33333-3333|V industries|
|405048011-62815| CRC Industries| 7777-88886| WestSide|
+---------------+--------------------+-------------+------------+
With string indexer i have done with java, this will work.
public class StringIndexer11 {
public static void main(String[] args) {
Dataset<Row> csvDataSet=null;
try{
System.setProperty("hadoop.home.dir", "D:\\AI matching\\winutil");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();
List<Row> data = Arrays.asList(
RowFactory.create("405-048011-62815", "CRC Industries"),
RowFactory.create("630-0746","Dixon value"),
RowFactory.create("4444-444","3M INdustries"),
RowFactory.create("555-55","Dixon coupling valve"));
StructType schema = new StructType(new StructField[] {new StructField("label1", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence1", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
List<String> listStrings = new ArrayList<String>();
listStrings.add("405-048011-62815");
listStrings.add("630-0746");
Dataset<Row> matchFound1=sentenceDataFrame.filter(col("label1").isin(listStrings.stream().toArray(String[]::new)));
matchFound1.show();
listStrings.clear();
listStrings.add("222-2222-5555");
listStrings.add("7777-88886");
StringIndexer indexer = new StringIndexer()
.setInputCol("label1")
.setOutputCol("label1Index");
Dataset<Row> Dataset1 = indexer.fit(matchFound1).transform(matchFound1);
//Dataset1.show();
List<Row> data2 = Arrays.asList(
RowFactory.create("222-2222-5555", "Tata"),
RowFactory.create("7777-88886","WestSide"),
RowFactory.create("22222-22224","Reliance"),
RowFactory.create("33333-3333","V industries"));
StructType schema2 = new StructType(new StructField[] {new StructField("label2", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence2", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);
Dataset<Row> matchFound2=sentenceDataFrame2.filter(col("label2").isin(listStrings.stream().toArray(String[]::new)));
matchFound2.show();
StringIndexer indexer1 = new StringIndexer()
.setInputCol("label2")
.setOutputCol("label2Index");
Dataset<Row> Dataset2 = indexer1.fit(matchFound2).transform(matchFound2);
//Dataset2.show();
Dataset<Row> Finalresult = Dataset1.join(Dataset2 , Dataset1.col("label1Index").equalTo(Dataset2.col("label2Index"))).drop(Dataset1.col("label1Index")).drop(Dataset2.col("label2Index"));
Finalresult.show();
}catch(Exception e)
{
e.printStackTrace();
}
}
I have tokenized the sentences into word RDD. so now i need Bigrams.
ex. This is my test => (This is), (is my), (my test)
I have search thru and found .sliding operator for this purpose. But I'm not getting this option on my eclipse (may it is available for newer version of spark)
So what how can i make this happen w/o .sliding?
Adding code to get started-
public static void biGram (JavaRDD<String> in)
{
JavaRDD<String> sentence = in.map(s -> s.toLowerCase());
//get bigram from sentence w/o sliding - CODE HERE
}
You can simply use n-gram transformation feature in spark.
public static void biGram (JavaRDD<String> in)
{
//Converting string into row
JavaRDD<Row> sentence = sentence.map(s -> RowFactory.create(s.toLowerCase()));
StructType schema = new StructType(new StructField[] {
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
//Creating dataframe
DataFrame dataFrame = sqlContext.createDataFrame(sentence, schema);
//Tokenizing sentence into words
RegexTokenizer rt = new RegexTokenizer().setInputCol("sentence").setOutputCol("split")
.setMinTokenLength(4)
.setPattern("\\s+");
DataFrame rtDF = rt.transform(dataFrame);
//Creating bigrams
NGram bigram = new NGram().setInputCol(rt.getOutputCol()).setOutputCol("bigram").setN(2); //Here setN(2) means bigram
DataFrame bigramDF = bigram.transform(rtDF);
System.out.println("Result :: "+bigramDF.select("bigram").collectAsList());
}
sliding is indeed the way to go with ngrams. thing is, sliding works on iterators, just split your sentence and slide over the array. I am adding a Scala code.
val sentences:RDD[String] = in.map(s => s.toLowerCase())
val biGrams:RDD[Iterator[Array[String]]] = sentences.map(s => s.split(" ").sliding(2))
I am trying to add a column to my DataFrame that serves as a unique ROW_ID for the column. So, it would be something like this
1, user1
2, user2
3, user3
...
I could have done this easily using a hashMap with an integer iterating but I can't do this in spark using the map function on DataFrame since I can't have an integer increasing inside the map function. Is there any way that I can do this by appending one column to my existing DataFrame or any other way?
PS: I know there is a very similar post, but that's for Scala and not java.
Thanks in advance
I did it by adding a column containing UUIDs in a new Column in DataFrame.
StructType objStructType = inputDataFrame.schema();
StructField []arrStructField=objStructType.fields();
List<StructField> fields = new ArrayList<StructField>();
List<StructField> newfields = new ArrayList<StructField>();
List <StructField> listFields = Arrays.asList(arrStructField);
StructField a = DataTypes.createStructField(leftCol,DataTypes.StringType, true);
fields.add(a);
newfields.addAll(listFields);
newfields.addAll(fields);
final int size = objStructType.size();
JavaRDD<Row> rowRDD = inputDataFrame.javaRDD().map(new Function<Row, Row>() {
private static final long serialVersionUID = 3280804931696581264L;
public Row call(Row tblRow) throws Exception {
Object[] newRow = new Object[size+1];
int rowSize= tblRow.length();
for (int itr = 0; itr < rowSize; itr++)
{
if(tblRow.apply(itr)!=null)
{
newRow[itr] = tblRow.apply(itr);
}
}
newRow[size] = UUID.randomUUID().toString();
return RowFactory.create(newRow);
}
});
inputDataFrame = objsqlContext.createDataFrame(rowRDD, DataTypes.createStructType(newfields));
Ok, I found the solution to this problem and I'm posting it in case someone would have the same problem:
The way to do this it zipWithIndex from JavaRDD()
df.javaRDD().zipWithIndex().map(new Function<Tuple2<Row, Long>, Row>() {
#Override
public Row call(Tuple2<Row, Long> v1) throws Exception {
return RowFactory.create(v1._1().getString(0), v1._2());
}
})