I am really new to this big data i need to know can hbase be embedded in java application.
As is it developed in java can hbase be added as an library and do the operations?
If so can any one give an simple tutorial or sample code.
HBase does not run embedded it runs on top of Hadoop and it is aimed at big data and a large number of servers.
It does have a Java API which you can use e.g. Charles Menguy's reply
You can definitely write a Java application to use Hbase, there is a Java API that is provided in the main Hbase distribution.
You should take the latest build from the official website and get the hbase-0.xx.x.jar jar that you can use to build your application. If you want to look at classpath dependencies for hbase, once you installed hbase you can just do hbase classpath and that should print you the list of jars you need.
You can probably find a lot of example of Java apps doing Hbase operations on Google, but here is an example for the usual operations:
// get the hbase config
Configuration config = HBaseConfiguration.create();
// specify which table you want to use
HTable table = new HTable(config, "mytable");
// add a row to your table
Put p = new Put(Bytes.toBytes("myrow"));
// specify the column family, column qualifier and column value
p.add(Bytes.toBytes("myfamily"), Bytes.toBytes("myqualifier"), Bytes.toBytes("myvalue"));
// commit to your table
table.put(p);
// define which row you want to get
Get g = new Get(Bytes.toBytes("myrow"));
// get your row
Result r = table.get(g);
// choose what you want to extract from your row
byte[] value = r.getValue(Bytes.toBytes("myfamily"), Bytes.toBytes("myqualifier"));
// convert to a string
System.out.println("GET: " + Bytes.toString(value));
// do a scan operation
Scan s = new Scan();
s.addColumn(Bytes.toBytes("myfamily"), Bytes.toBytes("myqualifier"));
ResultScanner scanner = table.getScanner(s);
Related
I just started apache-spark with java. There is many documents saved in a collection I want to find a document on the basis of some key and update it.
Simply I want to do find and update in apache-spark with java
This is the code to read a document form mongo
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("collection", "bb_playing22");
ReadConfig readConfig = ReadConfig.create(createJavaSparkContext()).withOptions(readOverrides);
JavaMongoRDD<Document> customRdd = MongoSpark.load(createJavaSparkContext(), readConfig);
JavaRDD<Document> rdd = customRdd.filter(((t1) -> {
return t1.getLong("playing22_id") == 3; //To change body of generated lambdas, choose Tools | Templates.
}));
but not able to update this document
Spark does a lazy evaluation of your transformations hence unless you call an action over an rdd, it won't execute any transformations you code.
See common actions here.
Also, this does not seem like a use case where you'd like to use spark.
If your actual requirement is just to update a document as per a value then you should look into indexing in mongo and just update a document through a mongo driver in java directly.
I am currently new to machine learning and I will be working on a project that involves using a Machine Learning library to detect and alert about possible anomalies. I will be using Apache Spark and I decided to use the KMeans method to solve the project.
The main project consists on analyzing daily files and detecting fluctuating changes in some of the records and reporting them as possible anomalies (if they are considered one based on the model). The files are generated at the end of a day and my program needs to check them on the morning of the next day to see if there is an anomaly. However, I need to check anomalies file vs file, NOT within the file. This means that I have to compare the data of every file and see if it fits to the model I would create following the specific algorithm. What I'm trying to say is that I have some valid data that I will apply the algorithm to in order to train my model. Then I have to apply this same model to other files of the same format but, obviously, different data. I'm not looking for a prediction column but rather detecting anomalies in these other files. If there is an anomaly the program should tell me which row/column has the anomaly and then I have to program it to send an email saying that there is a possible anomaly in the specific file.
Like I said I am new to machine learning. I want to know how I can use the KMeans algorithm to detect outliers/anomalies on a file.
So far I have created the model:
SparkConf conf = new SparkConf().setAppName("practice").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession
.builder()
.appName("Anomaly Detection")
.getOrCreate();
String day1txt = "C:\\Users\\User\\Documents\\day1.txt";
String day2txt = "C:\\Users\\User\\Documents\\day2.txt";
Dataset<Row> day1 = spark.read().
option("header", "true").
option("delimiter", "\t").
option("inferSchema", "true").
csv(day1txt);
day1 = day1.withColumn("Size", day1.col("Size").cast("Integer"));
day1 = day1.withColumn("Records", day1.col("Records").cast("Integer"));
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"Size", "Records"})
.setOutputCol("features");
Dataset<Row> day1vector = assembler.transform(day1);
KMeans kmeans = new KMeans().setK(5).setSeed(1L);
KMeansModel model = kmeans.fit(day1vector);
I don't know what to do from this point on to detect outliers. I have several other .txt files that should have "normalized" data, and also I have a couple of files that have "tampered/not-normalized" data. Do I need to train my model with all the test data I have available, and if so, how can I train a model using different datasets? Or can I only train it with one dataset and test it with the others?
EDIT:
This is a sample of the file (day1.txt) I will be using (dummy data of course / top 10)
Name Size Records
File1 1000 104370
File2 990 101200
File3 1500 109123
File4 2170 113888
File5 2000 111974
File6 1820 110666
File7 1200 106771
File8 1500 108991
File9 1000 104007
File10 1300 107037
This is considered normal data, and I will have different files with the same format but different values around the same range. Then I have some files where I purposely added an outlier, like Size: 1000, Records: 50000.
How can I detect that with KMeans? Or if KMeans is not the perfect model, which model should I use and how should I go around it?
There is a simple approach for this. create your clusters with kmeans, then for each clusters, set some good radius with respect to center of that cluster, if some point lie out of that radius, it is an outlier.
Try looking at this: https://arxiv.org/pdf/1402.6859.pdf
There is some outlier detection Technics like: OneClassSvm or AngleBaseOutlierDetection and so on. Try looking at this: http://scikit-learn.org/stable/modules/outlier_detection.html
I am writing a code in which I need to access a particular column of a spreadsheet, already uploaded on my GDrive. I have found the Java API that is required for my project.
https://developers.google.com/google-apps/spreadsheets/#fetching_specific_rows_or_columns
I have have already installed : Mercurial, Maven and Google PlugIn for eclipse, as well.
Now when I run my code, the following error comes : " The type com.google.gdata.client.GoogleService cannot be resolved. It is indirectly referenced from required .class files".
According to me this error is might be because of the path/URL given in the google API to access a particular spreadsheet :
URL SPREADSHEET_FEED_URL = new URL(
"https://spreadsheets.google.com/feeds/spreadsheets/private/full");
PLEASE HELP!!
Do you have the required Google Data API Client libraries downloaded or not? If not, make sure those libraries are downloaded and check this link
It seems this is the dependency: http://mvnrepository.com/artifact/com.google.gdata/core/1.47.1
Hope that helps!
The Google Spreadsheets API documentation (Fetching specific rows or columns) suggests the following...being worksheet the specific worksheet instance of your spreadsheet, and let's say 6 the column list of values that you want to get:
URL cellFeedUrl = new URI(worksheet.getCellFeedUrl().toString() +
"?min-row=2&min-col=6&max-col=6").toURL();
CellFeed cell = service.getFeed(cellFeedUrl, CellFeed.class);
for(CellEntry eachCellEntry : cell.getEntries()){
// prints value of each cell in column 6 but the title
log.info(eachCellEntry.getCell().getValue());
}
I don't like the passing of the hard coded arguments in a String manually for the URL when using CellFeed...like min-row, min-col, max-col...I am trying to find a better way to do this, perhaps using a query.
Is it possible to get URLs into Nutch directly from a database or a service etc. I'm not interested in the ways which data is taken from the database or service and written to seed.txt.
No. This cannot be done directly with the default nutch codebase. You need to modify Injector.java to achieve that.
EDIT:
Try using DBInputFormat : an InputFormat that reads input data from an SQL table. You need to modify the Inject code here (line 3 in snippet below):
JobConf sortJob = new NutchJob(getConf());
sortJob.setJobName("inject " + urlDir);
FileInputFormat.addInputPath(sortJob, urlDir);
sortJob.setMapperClass(InjectMapper.class);
Is there a simple Java library or approach that will take a SQL query and load data in a CSV file to oracle database. Pls help
You don't have to use Java to load a data file into a table unless it is absolutely necessary. Instead, I'd recommend Oracle's command-line SQL*Loader utility which was designed specially for this purpose.
For similar tasks I usually use Groovy scripts as it's really easy and quick to write and runs on the JVM off course.
...an example:
import groovy.sql.Sql
def file1 = new File(/C:\Documents and Settings\USER\Desktop\Book1.csv/)
def reader = new FileReader(file1)
def sql = Sql.newInstance("jdbc:oracle:thin:#XXXXXX:XXXX:XXX", "SCHEMA",
"USER", "oracle.jdbc.driver.OracleDriver")
reader.each { line ->
fields = line.split(';')
sql.executeInsert("insert into YOUR_TABLE values(${fields[0]},${fields[1]},${fields[2]})")
}
It's a basic example, if you have double quotes and semi columns in your csv you will probably want to use something like OpenCSV to handle that.
You could transform each line in the CSV with regular expressions, to make an insert query, and then send to Oracle (with JDBC).
I think this tool will help you for any type of database import-export problem.
http://www.dmc-fr.com/home_en.php
Do you have that CSV in a file on the database server or can you store it there? Then you may try to have Oracle open it by declaring a DIRECTORY object for the path the file is in and then create an EXTERNAL TABLE which you can query in SQL afterwards. Oracle does the parsing of the file for you.
If you are open to Python you can do bulk load using SQL*Loader
loadConf=('sqlldr userid=%s DATA=%s control=%s LOG=%s.log BAD=%s.bad DISCARD=%s.dsc' % (userid,datafile, ctlfile,fn,fn,fn)).split(' ')
p = Popen(loadConf, stdin=PIPE, stdout=PIPE, stderr=PIPE, shell=False, env=os.environ)
output, err = p.communicate()
It's will be much faster that row insert.
I uploaded basic working example here.