I have a dataset with such schema:
{"user":"A10T7BS07XCWQ1","recommendations":[{"iID":34142,"rating":22.998692},{"iID":24963,"rating":22.678337},{"iID":47761,"rating":22.31455},{"iID":28694,"rating":21.269365},{"iID":36890,"rating":21.143366},{"iID":48522,"rating":20.678747},{"iID":20032,"rating":20.330639},{"iID":57315,"rating":20.099955},{"iID":18148,"rating":20.07064},{"iID":7321,"rating":19.754635}]}
I try to flatMap my dataset by such way:
StructType struc = new StructType();
struc.add("user", DataTypes.StringType, false);
struc.add("item", DataTypes.IntegerType, false);
struc.add("relevance", DataTypes.DoubleType, false);
ExpressionEncoder<Row> encoder = RowEncoder.apply(struc);
Dataset<Row> recomenderResult = userRecs.flatMap((FlatMapFunction<Row, Row>) row -> {
String user = row.getString(0);
List<Row> recsWithIntItemID = row.getList(1);
Integer item;
Double relevance;
List<Row> rows = new ArrayList<>();
for (Row rec : recsWithIntItemID) {
item = rec.getInt(0);
relevance = (double) rec.getFloat(1);
System.out.println(user + " : " + item + " : " + relevance);
Row newRow = RowFactory.create(user, item, relevance);
rows.add(newRow);
}
System.out.println("++++++++++++++++++++++++++++++++");
return rows.iterator();
}, encoder);
recomenderResult.write().json("temp2");
recomenderResult.show();
system output is folowing:
...
A1049B0RS95K7B : 24708 : 17.146669387817383
A1049B0RS95K7B : 2825 : 16.809375762939453
A1049B0RS95K7B : 36503 : 16.758258819580078
++++++++++++++++++++++++++++++++
...
But Row instance is empty, show() method gives such output:
++
||
++
||
||
I have no idea why my result dataset is empty. I have already watched all topics on this site relevant to my problem and used google, but I have not found solution of my problem. Could somebody help me?
It was very stupid bug :( Simple answer, mistake was here:
StructType struc = new StructType();
struc = struc.add("user", DataTypes.StringType, false);
struc = struc.add("item", DataTypes.IntegerType, false);
struc = struc.add("relevance", DataTypes.DoubleType, false);
ExpressionEncoder<Row> encoder = RowEncoder.apply(struc);
It costs me 2 days and one night...
Related
JavaRDD rawText = raw
.filter(
(FilterFunction) f -> {
if (f.getAs("content") == null)
return false;
return true;
}
).toJavaRDD().mapPartitions(
partition -> {
List rows = new ArrayList();
while (partition.hasNext()) {
Row row = partition.next();
String content = row.getAs("content");
String category = row.getAs("category");
if (content != null) {
List words = Arrays.asList("a", "b");
rows.add(RowFactory.create(category, words));
}
}
return rows.iterator();
}
);
List structFields = new ArrayList();
structFields.add(DataTypes.createStructField("category", DataTypes.StringType, true));
structFields.add(DataTypes.createStructField("content", DataTypes.createArrayType(DataTypes.StringType), false, Metadata.empty()));
StructType structType = DataTypes.createStructType(structFields);
Dataset dataset = sparkSession.createDataFrame(rawText, structType);
dataset.show();
I'm stucking with converting dataString to dataset arraylist of string. Can you help me with this problem. my error:
Two non-abstract methods "public scala.collection.Iterator scala.collection.IterableOnceOps.toIterator()" have the same parameter types, declaring type and return type
Using spark java I have created dataframe on comma delimiter source file. In sourcefile if last column contains blank value then its throwing arrayindexoutofbound error. Below is sample data and code. is there any way I can handle this error because there is lot of chance getting blank values in last column. In below sample data 4th row causing issue.
Sample Data:
1,viv,chn,34
2,man,gnt,56
3,anu,pun,22
4,raj,bang,*
Code:
JavaRDD<String> dataQualityRDD = spark.sparkContext().textFile(inputFile, 1).toJavaRDD();
String schemaString = schemaColumns;
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = dataQualityRDD.map((Function<String, Row>) record -> {
// String[] attributes = record.split(attributes[0], attributes[1].trim());
Object[] items = record.split(fileSplit);
// return RowFactory.create(attributes[0], attributes[1].trim());
return RowFactory.create(items);
});
}
}
I used spark 2.0 and was able to read the csv without any exception:
SparkSession spark = SparkSession.builder().config("spark.master", "local").getOrCreate();
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(spark.sparkContext());
JavaRDD<Row> csvRows = spark.read().csv("resources/csvwithnulls.csv").toJavaRDD();
StructType schema = DataTypes.createStructType(
new StructField[] { new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("fname", DataTypes.StringType, false, Metadata.empty()),
new StructField("lname", DataTypes.StringType, false, Metadata.empty()),
new StructField("age", DataTypes.StringType, false, Metadata.empty()) });
Dataset<Row> newCsvRows = spark.createDataFrame(csvRows, schema);
newCsvRows.show();
Used exactly the rows you have and it worked fine: see the output:
I need to calculate the pairwise similarity between several Documents. For that, I procced as follows:
JavaPairRDD<String,String> files = sc.wholeTextFiles(file_path);
System.out.println(files.count()+"**");
JavaRDD<Row> rowRDD = files.map((Tuple2<String, String> t) -> {
return RowFactory.create(t._1,t._2.replaceAll("[^\\w\\s]+","").replaceAll("\\d", ""));
});
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> rows = spark.createDataFrame(rowRDD, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> tokenized_rows = tokenizer.transform(rows);
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered_words");
Dataset<Row> filtred_rows = remover.transform(tokenized_rows);
CountVectorizerModel cvModel = new CountVectorizer().setInputCol("filtered_words").setOutputCol("rowF").setVocabSize(100000).fit(filtred_rows);
Dataset<Row> verct_rows = cvModel.transform(filtred_rows);
IDF idf = new IDF().setInputCol("rowF").setOutputCol("features");
IDFModel idfModel = idf.fit(verct_rows);
Dataset<Row> rescaledData = idfModel.transform(verct_rows);
JavaRDD<IndexedRow> vrdd = rescaledData.toJavaRDD().map((Row r) -> {
//DenseVector dense;
String s = r.getAs(0);
int index = new Integer(s.replace(s.substring(0,24),"").replace(s.substring(s.indexOf(".txt")),""));
SparseVector sparse = (SparseVector) r.getAs(5);
//dense = sparse.toDense();parseVector) r.getAs(5);
org.apache.spark.mllib.linalg.Vector vec = org.apache.spark.mllib.linalg.Vectors.dense(sparse.toDense().toArray());
return new IndexedRow(index, vec);
});
System.out.println(vrdd.count()+"---");
IndexedRowMatrix mat = new IndexedRowMatrix(vrdd.rdd());
System.out.println(mat.numCols()+"---"+mat.numRows());
Unfortunately, the results show that the IndexedRowMatrix is created with 4 columns (the first one is duplicated) even if my dataset contains 3 documents.
3**
3--
1106---4
Can you help me to detect the cause of this duplication?
Most likely there is no duplication at all and your data simply doesn't follow the specification, which requires indices to be consecutive, zero-based, integers. Therefore numRows is max(row.index for row in rows) + 1
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.linalg.Vectors
new IndexedRowMatrix(sc.parallelize(Seq(
IndexedRow(5, Vectors.sparse(5, Array(), Array()))) // Only one non-empty row
)).numRows
// res4: Long = 6
Actually I have done client side pagination using datatables but now requirement has changed due to large number of records around 100K.
I need to implement server side pagination.
For that I used below code
GSP
$('#data-grid-table').dataTable( {
"processing" : true,
"sAjaxSource": dataUrl,
"serverSide" : true,
"sPaginationType": "full",
"iDisplayLength": 25,
"aLengthMenu": [[25, 50, 100, -1], [25, 50, 100, "All"]],
"scrollX": true,
"bFilter": false,
"columnDefs": [ {
searchable: false,
"orderable": false,
className: "view-cell",
targets: 0
}],
"aaSorting": [[1,'asc']],
"fnDrawCallback": function( oSettings ) {
var callBackFlag = $("#hidden-view-flag").val()
if(callBackFlag=="1"){
$("#hidden-view-flag").val("2")
}
if(callBackFlag == "2"){
$("#hidden-view-flag").val("3")
}
if(hideViewColumn){
$(".view-cell").hide();
}
$('.datasetTable, tbody').find('tr').each(function(){
$(this).find('th:nth-child(1)').removeClass("sorting_asc");
});
}
});
Controller
dbObjArray = new BasicDBObject[2]
dbObjArray[0]= cruxLevel
dbObjArray[1] = project
List<DBObject> pipeline = Arrays.asList(dbObjArray)
if (!datasetObject?.isFlat && jsonFor != 'collection-grid') {
output= dataSetCollection.aggregate(pipeline)
}else{
//def skipRecords = params.iDisplayStart
//def limitRecords = params.iDisplayLength
//println 'params.iDisplayStart' + params.iDisplayStart
//println 'params.iDisplayLength' + params.iDisplayLength
println 'else-====================='
DBObject limit = new BasicDBObject('$limit':10);
DBObject skip = new BasicDBObject('$skip':5);
isFlatOutput = true;
dbObjArray = new BasicDBObject[3]
dbObjArray[0]= project
dbObjArray[1]= skip
dbObjArray[2]= limit
List<DBObject> pipeline1 = Arrays.asList(dbObjArray)
AggregationOptions aggregationOptions = AggregationOptions.builder()
.batchSize(100)
.outputMode(AggregationOptions.OutputMode.CURSOR)
.allowDiskUse(true)
.build();
SimpleDateFormat sdfDate = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
Date now = new Date();
println 'Start time to fetch -------------------------------------' + sdfDate.format(now)
output = dataSetCollection.aggregate(pipeline1,aggregationOptions)
Date now1 = new Date();
println 'End time to fetch-------------------------------' + sdfDate.format(now1)
}
if(isFlatOutput){
SimpleDateFormat sdfDate = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
Date now2 = new Date();
println 'Start time to retrieve-------------------------------' + sdfDate.format(now2)
while(output.hasNext()) {
dataList.add(output.next());
}
Date now3 = new Date();
println 'End time to retrieve-------------------------------' + sdfDate.format(now3)
}
I was not getting how to take limit and skip so I have hard coded it.
Actual result: 10 results are displaying but next is disabled.
Expected result : 10 results should display and fetch next 10 records on click of next button.
Kindly tell me where I am going wrong.
def skipRecords
def limitRecords
if(params.iDisplayStart == null){
skipRecords = 0;
}
if(params.iDisplayLength == null){
limitRecords = 25;
}
def dbObjArrayTotal = new BasicDBObject[1]
dbObjArrayTotal[0]= project
List<DBObject> pipelineTotal = Arrays.asList(dbObjArrayTotal)
AggregationOptions aggregationOptions = AggregationOptions.builder()
.batchSize(100)
.outputMode(AggregationOptions.OutputMode.CURSOR)
.allowDiskUse(true)
.build();
def totalCount = dataSetCollection.aggregate(pipelineTotal,aggregationOptions)
totalCount = totalCount.size()
if(limitRecords == -1){
limitRecords = totalCount
}
DBObject limit = new BasicDBObject('$limit':limitRecords);
DBObject skip = new BasicDBObject('$skip':skipRecords);
dbObjArray = new BasicDBObject[3] dbObjArray[0]= project
dbObjArray[1]= skip
dbObjArray[2]= limit
List<DBObject> flatPipeline = Arrays.asList(dbObjArray)
output = dataSetCollection.aggregate(flatPipeline,aggregationOptions)
def serverData = [
"iTotalRecords" : totalCount,
"iTotalDisplayRecords" : totalCount,
"aaData": yourResult]
return serverData;
And above GSP is correct use as it is.
I am trying to add a column to my DataFrame that serves as a unique ROW_ID for the column. So, it would be something like this
1, user1
2, user2
3, user3
...
I could have done this easily using a hashMap with an integer iterating but I can't do this in spark using the map function on DataFrame since I can't have an integer increasing inside the map function. Is there any way that I can do this by appending one column to my existing DataFrame or any other way?
PS: I know there is a very similar post, but that's for Scala and not java.
Thanks in advance
I did it by adding a column containing UUIDs in a new Column in DataFrame.
StructType objStructType = inputDataFrame.schema();
StructField []arrStructField=objStructType.fields();
List<StructField> fields = new ArrayList<StructField>();
List<StructField> newfields = new ArrayList<StructField>();
List <StructField> listFields = Arrays.asList(arrStructField);
StructField a = DataTypes.createStructField(leftCol,DataTypes.StringType, true);
fields.add(a);
newfields.addAll(listFields);
newfields.addAll(fields);
final int size = objStructType.size();
JavaRDD<Row> rowRDD = inputDataFrame.javaRDD().map(new Function<Row, Row>() {
private static final long serialVersionUID = 3280804931696581264L;
public Row call(Row tblRow) throws Exception {
Object[] newRow = new Object[size+1];
int rowSize= tblRow.length();
for (int itr = 0; itr < rowSize; itr++)
{
if(tblRow.apply(itr)!=null)
{
newRow[itr] = tblRow.apply(itr);
}
}
newRow[size] = UUID.randomUUID().toString();
return RowFactory.create(newRow);
}
});
inputDataFrame = objsqlContext.createDataFrame(rowRDD, DataTypes.createStructType(newfields));
Ok, I found the solution to this problem and I'm posting it in case someone would have the same problem:
The way to do this it zipWithIndex from JavaRDD()
df.javaRDD().zipWithIndex().map(new Function<Tuple2<Row, Long>, Row>() {
#Override
public Row call(Tuple2<Row, Long> v1) throws Exception {
return RowFactory.create(v1._1().getString(0), v1._2());
}
})