I am using a 3 node standalone spark cluster with 1 master and 2 workers, along with a 2 node cassandra ring, here is a sample code of what I am trying to do
SparkConf conf = new SparkConf(true);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
String query = "Select address from " + CASSANDRA_KEYSPACE + "." + CASSANDRA_COLUMN_FAMILY + " where ras_ = '01'";
CassandraSQLContext sqlContext = new CassandraSQLContext(sc);
DataFrame resultsFrame = sqlContext.sql(query);
JavaRDD<Row> resultsRDD = resultsFrame.javaRDD();
JavaRDD<String> dataRDD = resultsRDD.map(row -> row.getString(0));
dataRDD.saveAsTextFile("output");
From the System.out.println, I know I have some data as a result of my query, but in my project home, in the output directory, the only files I am getting are _SUCCESS and ._SUCCESS.crc and none of the part-* files. Is this expected behavior ? if not, where am I going wrong ?
Well, looks like we have the same situation here since we both use more than one node, turns out the file is not guaranteed to save on which node.
In my case, it was not saved on the master which i run the script but one of the slaves.
Hope helpful.
Related
I'm on a heatmap project for my university, we have to get some data (212Go) from a txt file (coordinates, height), then put it in HBase to retrieve it on a web client with Express.
I practiced using a 144Mo file, this is working :
SparkConf conf = new SparkConf().setAppName("PLE");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> data = context.textFile(args[0]);
Connection co = ConnectionFactory.createConnection(getConf());
createTable(co);
Table table = co.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put(Bytes.toBytes("KEY"));
for (String s : data.collect()) {
String[] tmp = s.split(",");
put.addImmutable(FAMILY,
Bytes.toBytes(tmp[2]),
Bytes.toBytes(tmp[0]+","+tmp[1]));
}
table.put(put);
But I now that I use the 212Go file, I got some memory errors, I guess the collect method gather all the data in memory, so 212Go is too much.
So now I'm trying this :
SparkConf conf = new SparkConf().setAppName("PLE");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> data = context.textFile(args[0]);
Connection co = ConnectionFactory.createConnection(getConf());
createTable(co);
Table table = co.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put(Bytes.toBytes("KEY"));
data.foreach(line ->{
String[] tmp = line.split(",");
put.addImmutable(FAMILY,
Bytes.toBytes(tmp[2]),
Bytes.toBytes(tmp[0]+","+tmp[1]));
});
table.put(put);
And I'm getting "org.apache.spark.SparkException: Task not serializable", I searched about it and tried some fixing, without success, upon what I read here : Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects
Actually I don't understand everything in this topic, I'm just a student, maybe the answer to my problem is obvious, maybe not, anyway thanks in advance !
As a rule of thumb - serializing database connections (any type) doesn't make sense. There are not designed to be serialized and deserialized, Spark or not.
Create connection for each partition:
data.foreachPartition(partition -> {
Connection co = ConnectionFactory.createConnection(getConf());
... // All required setup
Table table = co.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put(Bytes.toBytes("KEY"));
while (partition.hasNext()) {
String line = partition.next();
String[] tmp = line.split(",");
put.addImmutable(FAMILY,
Bytes.toBytes(tmp[2]),
Bytes.toBytes(tmp[0]+","+tmp[1]));
}
... // Clean connections
});
I also recommend reading Design Patterns for using foreachRDD from the official Spark Streaming programming guide.
If I have a row with ID = "my_row_id", how can I use "my_row_id" to get the region where this row is when using Hbase's Java API? (the code I am writing will run on each region server)
I found examples like this:
for (JVMClusterUtil.RegionServerThread t : util.getMiniHBaseCluster().getLiveRegionServerThreads()) {
for (HRegionInfo r : t.getRegionServer().getOnlineRegions()) {
if (!Arrays.equals(r.getTableDesc().getName(), "my_table"))) {
continue;
}
HRegion region = t.getRegionServer().getOnlineRegion(r.getRegionName());
}
}
But util.getMiniHBaseCluster() seems to be from a test utility, How would I get a "real" instance of JVMClusterUtil.RegionServerThread on a region server?
You don't need to run code on every region server to find where row is located.
HBase table has splits, which partition data across the nodes in cluster. This is meta information used by HBase master. It's available in api too:
HTable table = new HTable(configuration, tableName);
RegionLocator regionLocator = table.getRegionLocator();
HRegionLocation location = regionLocator.getRegionLocation(myRow);
String host = location.getHostname();
Using test cases I was able to see how ELKI can be used directly from Java but now I want to read my data from MongoDB and then use ELKI to cluster geographic (long, lat) data.
I can only cluster data from a CSV file using ELKI. Is it possible to connect de.lmu.ifi.dbs.elki.database.Database with MongoDB? I can see from the java debugger that there is a databaseconnection field in de.lmu.ifi.dbs.elki.database.Database.
I query MongoDB creating POJO for each row and now I want to cluster these objects using ELKI.
It is possible to read data from MongoDB and write it in a CSV file then use ELKI to read that CSV file but I would like to know if there is a simpler solution.
---------FINDINGS_1:
From ELKI - Use List<String> of objects to populate the Database I found that I need to implement de.lmu.ifi.dbs.elki.datasource.DatabaseConnection and specifically override the loadData() method which returns an instance of MultiObjectsBundle.
So I think I should wrap a list of POJO with MultiObjectsBundle. Now i'm looking at the MultiObjectsBundle and it looks like the data should be held in columns. Why columns datatype is List> shouldnt it be List? just a list of items you want to cluster?
I'm a little confused. How is ELKI going to know that it should look at the long and lat for POJO? Where do I tell ELKI to do this? Using de.lmu.ifi.dbs.elki.data.type.SimpleTypeInformation?
---------FINDINGS_2:
I have tried to use ArrayAdapterDatabaseConnection and I have tried implementing DatabaseConnection. Sorry I need thing in very simple terms for me to understand.
This is my code for clustering:
int minPts=3;
double eps=0.08;
double[][] data1 = {{-0.197574246, 51.49960695}, {-0.084605692, 51.52128377}, {-0.120973687, 51.53005939}, {-0.156876, 51.49313},
{-0.144228881, 51.51811784}, {-0.1680743, 51.53430039}, {-0.170134484,51.52834133}, { -0.096440751, 51.5073853},
{-0.092754157, 51.50597426}, {-0.122502346, 51.52395143}, {-0.136039674, 51.51991453}, {-0.123616824, 51.52994371},
{-0.127854211, 51.51772703}, {-0.125979294, 51.52635795}, {-0.109006325, 51.5216612}, {-0.12221963, 51.51477076}, {-0.131161087, 51.52505093} };
// ArrayAdapterDatabaseConnection dbcon = new ArrayAdapterDatabaseConnection(data1);
DatabaseConnection dbcon = new MyDBConnection();
ListParameterization params = new ListParameterization();
params.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.MINPTS_ID, minPts);
params.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.EPSILON_ID, eps);
params.addParameter(DBSCAN.DISTANCE_FUNCTION_ID, EuclideanDistanceFunction.class);
params.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dbcon);
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,
RStarTreeFactory.class);
params.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID,
SortTileRecursiveBulkSplit.class);
params.addParameter(AbstractPageFileFactory.Parameterizer.PAGE_SIZE_ID, 1000);
Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, params);
db.initialize();
GeneralizedDBSCAN dbscan = ClassGenericsUtil.parameterizeOrAbort(GeneralizedDBSCAN.class, params);
Relation<DoubleVector> rel = db.getRelation(TypeUtil.DOUBLE_VECTOR_FIELD);
Relation<ExternalID> relID = db.getRelation(TypeUtil.EXTERNALID);
DBIDRange ids = (DBIDRange) rel.getDBIDs();
Clustering<Model> result = dbscan.run(db);
int i =0;
for(Cluster<Model> clu : result.getAllClusters()) {
System.out.println("#" + i + ": " + clu.getNameAutomatic());
System.out.println("Size: " + clu.size());
System.out.print("Objects: ");
for(DBIDIter it = clu.getIDs().iter(); it.valid(); it.advance()) {
DoubleVector v = rel.get(it);
ExternalID exID = relID.get(it);
System.out.print("DoubleVec: ["+v+"]");
System.out.print("ExID: ["+exID+"]");
final int offset = ids.getOffset(it);
System.out.print(" " + offset);
}
System.out.println();
++i;
}
The ArrayAdapterDatabaseConnection produces two clusters, I just had to play around with the value of epsilon, when I set epsilon=0.008 dbscan started creating clusters. When i set epsilon=0.04 all the items were in 1 cluster.
I have also tried to implement DatabaseConnection:
#Override
public MultipleObjectsBundle loadData() {
MultipleObjectsBundle bundle = new MultipleObjectsBundle();
List<Station> stations = getStations();
List<DoubleVector> vecs = new ArrayList<DoubleVector>();
List<ExternalID> ids = new ArrayList<ExternalID>();
for (Station s : stations){
String strID = Integer.toString(s.getId());
ExternalID i = new ExternalID(strID);
ids.add(i);
double[] st = {s.getLongitude(), s.getLatitude()};
DoubleVector dv = new DoubleVector(st);
vecs.add(dv);
}
SimpleTypeInformation<DoubleVector> type = new VectorFieldTypeInformation<>(DoubleVector.FACTORY, 2, 2, DoubleVector.FACTORY.getDefaultSerializer());
bundle.appendColumn(type, vecs);
bundle.appendColumn(TypeUtil.EXTERNALID, ids);
return bundle;
}
These long/lat are associated with an ID and I need to link them back to this ID to the values. Is the only way to go that using the ID offset (in the code above)? I have tried to add ExternalID column but I don't know how to retrieve the ExternalID for a particular NumberVector?
Also after seeing Using ELKI's Distance Function I tried to use Elki's longLatDistance but it doesn't work and I could not find any examples to implement it.
The interface for data sources is called DatabaseConnection.
JavaDoc of DatabaseConnection
You can implement a MongoDB-based interface to get the data.
It is not complicated interface, it has a single method.
I am running PIG script using Java API (pigserver.registerscript)
I need to find out number of records processed and number of output records
using java API.
How to implement the same.
I have simple pig script as follows
A = LOAD '$input_file_path' USING PigStorage('$') as (id:double,name:chararray,code:int);
Dump A;
B = FOREACH A GENERATE PigUdf2(id),name;
Dump B;
& java code is .....
PigStats ps;
HashMap<String, String> m = new HashMap();
Path p = new Path("/home/shweta/Desktop/debugging_pig_udf/pig_in");
m.put("input_file_path", p.toString()); PigServer pigServer = new PigServer(ExecType.LOCAL);
pigServer.registerScript("/home/shweta/Desktop/debugging_pig_udf/pig_script.pig",m);
I need to find out Input records and output records COUNT using java API
When I am creating a new H2 database via ORMLite the database file get created but after I close my application, all the data that it stored in the database is lost:
JdbcConnectionSource connection =
new JdbcConnectionSource("jdbc:h2:file:" + path.getAbsolutePath() + ".h2.db");
TableUtils.createTable(connection, SomeClass.class);
Dao<SomeClass, Integer> dao = DaoManager.createDao(connection, SomeClass.class);
SomeClass sc = new SomeClass(id, ...);
dao.create(sc);
SomeClass retrieved = dao.queryForId(id);
System.out.println("" + retrieved);
This code will produce good results. It will print the object that I stored.
But when I start the application again this time without creating the table and storing new object I get an exception telling me that the required table is not exists:
JdbcConnectionSource connection =
new JdbcConnectionSource("jdbc:h2:file:" + path.getAbsolutePath() + ".h2.db");
Dao<SomeClass, Integer> dao = DaoManager.createDao(connection, SomeClass.class);
SomeClass retrieved = dao.queryForId(id); // will produce an exception..
System.out.println("" + retrieved);
The following worked fine for me if I ran it once and then a second time with the createTable turned off. The 2nd insert gave me a primary key violation of course but that was expected. It created the file with (as #Thomas mentioned) a ".h2.db.h2.db" prefix.
Some questions:
After you run your application the first time, can you see the path file being created?
Is it on permanent storage and not in some temporary location cleared by the OS?
Any chance some other part of your application is clearing it before the database code begins?
Hope this helps.
#Test
public void testStuff() throws Exception {
File path = new File("/tmp/x");
JdbcConnectionSource connection = new JdbcConnectionSource("jdbc:h2:file:"
+ path.getAbsolutePath() + ".h2.db");
// TableUtils.createTable(connection, SomeClass.class);
Dao<SomeClass, Integer> dao = DaoManager.createDao(connection,
SomeClass.class);
int id = 131233;
SomeClass sc = new SomeClass(id, "fopewjfew");
dao.create(sc);
SomeClass retrieved = dao.queryForId(id);
System.out.println("" + retrieved);
connection.close();
}
I can see Russia from my house:
> ls -l /tmp/
...
-rw-r--r-- 1 graywatson wheel 14336 Aug 31 08:47 x.h2.db.h2.db
Did you close the database? It is closed automatically but it's better to close it manually (so recovery is faster).
In many cases the database URL is the problem. Are you sure the same path is used in both cases? Otherwise you end up with two databases. By the way, ".h2.db" is added automatically, you don't need to add it manually.
To better analyze the problem, you could append ;TRACE_LEVEL_FILE=2 to the database URL, and then check in the *.trace.db file what SQL statements were executed against the database.