I am creating a prototype for my thesis using the model generated/ trained model in weka. My thesis is about emotion analysis on text. Now I have the test data/set that I want to classify using the model/trained model.
this is my partial code that reads arff file and have a filter (stringToWordVector):
Classify ct = new Classify(TextJ48.model); // loads model
string sample = getARFFile();
StringBuilder buffer = new StringBuilder(sample);
BufferedReader reader = new BufferedReader(new java.io.StringReader(buffer.ToString()));
weka.core.converters.ArffLoader.ArffReader arff = new weka.core.converters.ArffLoader.ArffReader(reader);
Instances dataRaw = arff.getData();
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataRaw);
Instances dataFiltered = Filter.useFilter(dataRaw, filter);
When I show the dataFilteredit successfully filtered from words to numeric.
this is the classify class:
public Classify(string filename)
{
try
{
classifier = (Classifier)weka.core.SerializationHelper.read(filename);
}
catch (java.lang.Exception ex)
{
lblProgress.Text = ex.getMessage();
}
loadAttributes();
this.fileName = filename;
}
I don't know what to do in loadAttributes() My plan is to add all attributes in FastVector,I saw in some sources they adds attributes easily because they have a fixed sized attributes, but in my case I have different number of attributes that are based from the text.
Now how do I classify the text that I input using the model.
Related
I have used weka and made a Naive Bayes classifier, by using weka GUI. Then I have saved this model by following this tutorial. Now I want to load this model through Java code but I am unable to find any way to load a saved model using weka.
This is my requirement that I have to made model separately and then use it in a separate program.
If anyone can guide me in this regard I will be thankful to you.
You can easily load a saved model in java using this command:
Classifier myCls = (Classifier) weka.core.SerializationHelper.read(pathToModel);
For a complete workflow in Java I wrote the following article in SO Documentation, now copied here:
Text Classification in Weka
Text Classification with LibLinear
Create training instances from .arff file
private static Instances getDataFromFile(String path) throws Exception{
DataSource source = new DataSource(path);
Instances data = source.getDataSet();
if (data.classIndex() == -1){
data.setClassIndex(data.numAttributes()-1);
//last attribute as class index
}
return data;
}
Instances trainingData = getDataFromFile(pathToArffFile);
Use StringToWordVector to transform your string attributes to number representation:
Important features of this filter:
tf-idf representation
stemming
lowercase words
stopwords
n-gram representation*
StringToWordVector() filter = new StringToWordVector();
filter.setWordsToKeep(1000000);
if(useIdf){
filter.setIDFTransform(true);
}
filter.setTFTransform(true);
filter.setLowerCaseTokens(true);
filter.setOutputWordCounts(true);
filter.setMinTermFreq(minTermFreq);
filter.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,StringToWordVector.TAGS_FILTER));
NGramTokenizer t = new NGramTokenizer();
t.setNGramMaxSize(maxGrams);
t.setNGramMinSize(minGrams);
filter.setTokenizer(t);
WordsFromFile stopwords = new WordsFromFile();
stopwords.setStopwords(new File("data/stopwords/stopwords.txt"));
filter.setStopwordsHandler(stopwords);
if (useStemmer){
Stemmer s = new /*Iterated*/LovinsStemmer();
filter.setStemmer(s);
}
filter.setInputFormat(trainingData);
Apply the filter to trainingData: trainingData = Filter.useFilter(trainingData, filter);
Create the LibLinear Classifier
SVMType 0 below corresponds to the L2-regularized logistic regression
Set setProbabilityEstimates(true) to print the output probabilities
Classifier cls = null;
LibLINEAR liblinear = new LibLINEAR();
liblinear.setSVMType(new SelectedTag(0, LibLINEAR.TAGS_SVMTYPE));
liblinear.setProbabilityEstimates(true);
// liblinear.setBias(1); // default value
cls = liblinear;
cls.buildClassifier(trainingData);
Save model
System.out.println("Saving the model...");
ObjectOutputStream oos;
oos = new ObjectOutputStream(new FileOutputStream(path+"mymodel.model"));
oos.writeObject(cls);
oos.flush();
oos.close();
Create testing instances from .arff file
Instances trainingData = getDataFromFile(pathToArffFile);
Load classifier
Classifier myCls = (Classifier) weka.core.SerializationHelper.read(path+"mymodel.model");
Use the same StringToWordVector filter as above or create a new one for testingData, but remember to use the trainingData for this command:filter.setInputFormat(trainingData); This will make training and testing instances compatible.
Alternatively you could use InputMappedClassifier
Apply the filter to testingData: testingData = Filter.useFilter(testingData, filter);
Classify!
1.Get the class value for every instance in the testing set
for (int j = 0; j < testingData.numInstances(); j++) {
double res = myCls.classifyInstance(testingData.get(j));
}
res is a double value that corresponds to the nominal class that is defined in .arff file. To get the nominal class use : testintData.classAttribute().value((int)res)
2.Get the probability distribution for every instance
for (int j = 0; j < testingData.numInstances(); j++) {
double[] dist = first.distributionForInstance(testInstances.get(j));
}
dist is a double array that contains the probabilities for every class defined in .arff file
Note. Classifier should support probability distributions and enable them with: myClassifier.setProbabilityEstimates(true);
Problem: I want to read a section of a file from HDFS and return it, such as lines 101-120 from a file of 1000 lines.
I don't want to use seek because I have read that it is expensive.
I have log files which I am using PIG to process down into meaningful sets of data. I've been writing an API to return the data for consumption and display by a front end. Those processed data sets can be large enough that I don't want to read the entire file out of Hadoop in one slurp to save wire time and bandwidth. (Let's say 5 - 10MB)
Currently I am using a BufferedReader to return small summary files which is working fine
ArrayList lines = new ArrayList();
...
for (FileStatus item: items) {
// ignoring files like _SUCCESS
if(item.getPath().getName().startsWith("_")) {
continue;
}
in = fs.open(item.getPath());
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String line;
line = br.readLine();
while (line != null) {
line = line.replaceAll("(\\r|\\n)", "");
lines.add(line.split("\t"));
line = br.readLine();
}
}
I've poked around the interwebs quite a bit as well as Stack but haven't found exactly what I need.
Perhaps this is completely the wrong way to go about doing it and I need a completely separate set of code and different functions to manage this. Open to any suggestions.
Thanks!
As added noted based on research from the below discussions:
How does Hadoop process records records split across block boundaries?
Hadoop FileSplit Reading
I think SEEK is a best option for reading files with huge volumes. It did not cause any problems to me as the volume of data that i was reading was in the range of 2 - 3GB. I did not encounter any issues till today but we did use file splitting to handle the large data set. below is the code which you can use for reading purpose and test the same.
public class HDFSClientTesting {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try{
//System.loadLibrary("libhadoop.so");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
conf.addResource(new Path("core-site.xml"));
String Filename = "/dir/00000027";
long ByteOffset = 3185041;
SequenceFile.Reader rdr = new SequenceFile.Reader(fs, new Path(Filename), conf);
Text key = new Text();
Text value = new Text();
rdr.seek(ByteOffset);
rdr.next(key,value);
//Plain text
JSONObject jso = new JSONObject(value.toString());
String content = jso.getString("body");
System.out.println("\n\n\n" + content + "\n\n\n");
File file =new File("test.gz");
file.createNewFile();
}
catch (Exception e ){
throw new RuntimeException(e);
}
finally{
}
}
}
Looked at lots of examples for this, and so far no luck. I'd like to classify free text.
Configure a text classifier. (FilteredClassifier using StringToWordVector and LibSVM)
Train the classifier (add in lots of documents, train on filtered text)
Serialize the FilteredClassifier to disk, quit the app
Then later
Load up the serialized FilteredClassifier
Classify stuff!
It goes ok up to when I try to read from disk and classify things. All the documents and examples show the training list and testing list being built at the same time, and in my case, I'm trying to build a testing list after the fact.
A FilteredClassifier alone is not enough to create a testing Instance with the same "dictionary" as the original training set, so how do I save everything I need to classify at a later date?
http://weka.wikispaces.com/Use+WEKA+in+your+Java+code just says "Instances loaded from somewhere" and doesn't say anything about using a similar dictionary.
ClassifierFramework cf = new WekaSVM();
if (!cf.isTrained()) {
train(cf); // Train, save to disk
cf = new WekaSVM(); // reloads from file
}
cf.test("this is a test");
Ends up throwing
java.lang.ArrayIndexOutOfBoundsException: 2
at weka.core.DenseInstance.value(DenseInstance.java:332)
at weka.filters.unsupervised.attribute.StringToWordVector.convertInstancewoDocNorm(StringToWordVector.java:1587)
at weka.filters.unsupervised.attribute.StringToWordVector.input(StringToWordVector.java:688)
at weka.classifiers.meta.FilteredClassifier.filterInstance(FilteredClassifier.java:465)
at weka.classifiers.meta.FilteredClassifier.distributionForInstance(FilteredClassifier.java:495)
at weka.classifiers.AbstractClassifier.classifyInstance(AbstractClassifier.java:70)
at ratchetclassify.lab.WekaSVM.test(WekaSVM.java:125)
Serialize your Instances which holds the definition of the trained data -similar dictionary?- while you are serializing your classifier:
Instances trainInstances = ... //
Instances trainHeader = new Instances(trainInstances, 0);
trainHeader.setClassIndex(trainInstances .classIndex());
OutputStream os = new FileOutputStream(fileName);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(os);
objectOutputStream.writeObject(classifier);
if (trainHeader != null)
objectOutputStream.writeObject(trainHeader);
objectOutputStream.flush();
objectOutputStream.close();
To desialize:
Classifier classifier = null;
Instances trainHeader = null;
InputStream is = new BufferedInputStream(new FileInputStream(fileName));
ObjectInputStream objectInputStream = new ObjectInputStream(is);
classifier = (Classifier) objectInputStream.readObject();
try { // see if we can load the header
trainHeader = (Instances) objectInputStream.readObject();
} catch (Exception e) {
}
objectInputStream.close();
Use trainHeader to create new Instance:
int numAttributes = trainHeader.numAttributes();
double[] vals = new double[numAttributes];
for (int i = 0; i < numAttributes - 1; i++) {
Attribute attribute = trainHeader.attribute(i);
//If your attribute is nominal or string:
double value = attribute.indexOfValue(myStrVal); //get myStrVal from your source
//If your attribute is numeric
double value = myNumericVal; //get myNumericVal from your source
vals[i] = value;
}
vals[numAttributes] = Instance.missingValue();
Instance instance = new Instance(1.0, vals);
instance.setDataset(trainHeader);
return instance;
Please tell me how to append data in docx file using java and docx4j.
What I'm doing is, I am using a template in docx format in which some field are dilled by java at run time,
My problem is for every group of data it creates a new file and i just want to append the new file into 1 file. and this is not done using java streams
String outputfilepath = "e:\\Practice/DOC/output/generatedLatterOUTPUT.docx";
String outputfilepath1 = "e:\\Practice/DOC/output/generatedLatterOUTPUT1.docx";
WordprocessingMLPackage wordMLPackage;
public void templetsubtitution(String name, String age, String gender, Document document)
throws Exception {
// input file name
String inputfilepath = "e:\\Practice/DOC/profile.docx";
// out put file name
// id of Xml file
String itemId1 = "{A5D3A327-5613-4B97-98A9-FF42A2BA0F74}".toLowerCase();
String itemId2 = "{A5D3A327-5613-4B97-98A9-FF42A2BA0F74}".toLowerCase();
String itemId3 = "{A5D3A327-5613-4B97-98A9-FF42A2BA0F74}".toLowerCase();
// Load the Package
if (inputfilepath.endsWith(".xml")) {
JAXBContext jc = Context.jcXmlPackage;
Unmarshaller u = jc.createUnmarshaller();
u.setEventHandler(new org.docx4j.jaxb.JaxbValidationEventHandler());
org.docx4j.xmlPackage.Package wmlPackageEl = (org.docx4j.xmlPackage.Package) ((JAXBElement) u
.unmarshal(new javax.xml.transform.stream.StreamSource(
new FileInputStream(inputfilepath)))).getValue();
org.docx4j.convert.in.FlatOpcXmlImporter xmlPackage = new org.docx4j.convert.in.FlatOpcXmlImporter(
wmlPackageEl);
wordMLPackage = (WordprocessingMLPackage) xmlPackage.get();
} else {
wordMLPackage = WordprocessingMLPackage
.load(new File(inputfilepath));
}
CustomXmlDataStoragePart customXmlDataStoragePart = wordMLPackage
.getCustomXmlDataStorageParts().get(itemId1);
// Get the contents
CustomXmlDataStorage customXmlDataStorage = customXmlDataStoragePart
.getData();
// Change its contents
((CustomXmlDataStorageImpl) customXmlDataStorage).setNodeValueAtXPath(
"/ns0:orderForm[1]/ns0:record[1]/ns0:name[1]", name,
"xmlns:ns0='EasyForm'");
customXmlDataStoragePart = wordMLPackage.getCustomXmlDataStorageParts()
.get(itemId2);
// Get the contents
customXmlDataStorage = customXmlDataStoragePart.getData();
// Change its contents
((CustomXmlDataStorageImpl) customXmlDataStorage).setNodeValueAtXPath(
"/ns0:orderForm[1]/ns0:record[1]/ns0:age[1]", age,
"xmlns:ns0='EasyForm'");
customXmlDataStoragePart = wordMLPackage.getCustomXmlDataStorageParts()
.get(itemId3);
// Get the contents
customXmlDataStorage = customXmlDataStoragePart.getData();
// Change its contents
((CustomXmlDataStorageImpl) customXmlDataStorage).setNodeValueAtXPath(
"/ns0:orderForm[1]/ns0:record[1]/ns0:gender[1]", gender,
"xmlns:ns0='EasyForm'");
// Apply the bindings
BindingHandler.applyBindings(wordMLPackage.getMainDocumentPart());
File f = new File(outputfilepath);
wordMLPackage.save(f);
FileInputStream fis = new FileInputStream(f);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
byte[] buf = new byte[1024];
try {
for (int readNum; (readNum = fis.read(buf)) != -1;) {
bos.write(buf, 0, readNum);
}
// System.out.println( buf.length);
} catch (IOException ex) {
}
byte[] bytes = bos.toByteArray();
FileOutputStream file = new FileOutputStream(outputfilepath1, true);
DataOutputStream out = new DataOutputStream(file);
out.write(bytes);
out.flush();
out.close();
System.out.println("..done");
}
public static void main(String[] args) {
utility u = new utility();
u.templetsubtitution("aditya",24,mohan);
}
thanks in advance
If I understand you correctly, you're essentially talking about merging documents. There are two very simple approaches that you can use, and their effectiveness really depends on the structure and onward use of your data:
PhilippeAuriach describes one approach in his answer, which entails
appending all components within a MaindocumentPart instance to
another. In terms of the final docx file, this means the content
that appears in document.xml -- it won't take into account headers
and footers ( for example), but that may be fine for you.
You can insert multiple documents into a single docx file by inserting them
as AltChunk elements (see the docx4j documentation). This will
bring everything from one Word file into another, headers and all.
The downside of this is that your final document won't be a proper
flowing Word file until you open it and save it in MS Word itself
(the imported components remain as standalone files within the docx
bundle). This will cause you issues if you want to generated
'merged' files and then do something with them like render PDFs --
the merged content will simply be ignored.
The more complete (and complex) approach is to perform a "deep merge". This updates and maintains all references held within a document. Imported content becomes part of the main "flow" of the document (i.e. it is not stored as separate references), so the end result is a properly-merged file which can be rendered to PDF or whatever.
The downside to this is you need a good knowledge of docx structure and the API, and you will be writing a fair amount of code (I would recommend buying a license to Plutext's MergeDocx instead).
I had to deal with similar things, and here is what I did (probably not the most efficient, but working) :
create a finalDoc loading the template, and emptying it (so you have the styles in this doc)
for each data row, create a new doc loading the template, then replace your fields with your values
use the function below to append the doc filled with the datas to the finalDoc :
public static void append(WordprocessingMLPackage docDest, WordprocessingMLPackage docSource) {
List<Object> objects = docSource.getMainDocumentPart().getContent();
for(Object o : objects){
docDest.getMainDocumentPart().getContent().add(o);
}
}
Hope this helps.
I'm reading 2 csv files: store_inventory & new_acquisitions.
I want to be able to compare the store_inventory csv file with new_acquisitions.
1) If the item names match just update the quantity in store_inventory.
2) If new_acquisitions has a new item that does not exist in store_inventory, then add it to the store_inventory.
Here is what i have done so far but its not very good. I added comments where i need to add taks 1 & 2.
Any advice or code to do the above tasks would be great! thanks.
File new_acq = new File("/src/test/new_acquisitions.csv");
Scanner acq_scan = null;
try {
acq_scan = new Scanner(new_acq);
} catch (FileNotFoundException ex) {
Logger.getLogger(mainpage.class.getName()).log(Level.SEVERE, null, ex);
}
String itemName;
int quantity;
Double cost;
Double price;
File store_inv = new File("/src/test/store_inventory.csv");
Scanner invscan = null;
try {
invscan = new Scanner(store_inv);
} catch (FileNotFoundException ex) {
Logger.getLogger(mainpage.class.getName()).log(Level.SEVERE, null, ex);
}
String itemNameInv;
int quantityInv;
Double costInv;
Double priceInv;
while (acq_scan.hasNext()) {
String line = acq_scan.nextLine();
if (line.charAt(0) == '#') {
continue;
}
String[] split = line.split(",");
itemName = split[0];
quantity = Integer.parseInt(split[1]);
cost = Double.parseDouble(split[2]);
price = Double.parseDouble(split[3]);
while(invscan.hasNext()) {
String line2 = invscan.nextLine();
if (line2.charAt(0) == '#') {
continue;
}
String[] split2 = line2.split(",");
itemNameInv = split2[0];
quantityInv = Integer.parseInt(split2[1]);
costInv = Double.parseDouble(split2[2]);
priceInv = Double.parseDouble(split2[3]);
if(itemName == itemNameInv) {
//update quantity
}
}
//add new entry into csv file
}
Thanks again for any help. =]
Suggest you use one of the existing CSV parser such as Commons CSV or Super CSV instead of reinventing the wheel. Should make your life a lot easier.
Your implementation makes the common mistake of breaking the line on commas by using line.split(","). This does not work because the values themselves might have commas in them. If that happens, the value must be quoted, and you need to ignore commas within the quotes. The split method can not do this -- I see this mistake a lot.
Here is the source of an implementation that does it correctly:
http://agiletribe.purplehillsbooks.com/2012/11/23/the-only-class-you-need-for-csv-files/
With help of the open source library uniVocity-parsers, you could develop with pretty clean code as following:
private void processInventory() throws IOException {
/**
* ---------------------------------------------
* Read CSV rows into list of beans you defined
* ---------------------------------------------
*/
// 1st, config the CSV reader with row processor attaching the bean definition
CsvParserSettings settings = new CsvParserSettings();
settings.getFormat().setLineSeparator("\n");
BeanListProcessor<Inventory> rowProcessor = new BeanListProcessor<Inventory>(Inventory.class);
settings.setRowProcessor(rowProcessor);
settings.setHeaderExtractionEnabled(true);
// 2nd, parse all rows from the CSV file into the list of beans you defined
CsvParser parser = new CsvParser(settings);
parser.parse(new FileReader("/src/test/store_inventory.csv"));
List<Inventory> storeInvList = rowProcessor.getBeans();
Iterator<Inventory> storeInvIterator = storeInvList.iterator();
parser.parse(new FileReader("/src/test/new_acquisitions.csv"));
List<Inventory> newAcqList = rowProcessor.getBeans();
Iterator<Inventory> newAcqIterator = newAcqList.iterator();
// 3rd, process the beans with business logic
while (newAcqIterator.hasNext()) {
Inventory newAcq = newAcqIterator.next();
boolean isItemIncluded = false;
while (storeInvIterator.hasNext()) {
Inventory storeInv = storeInvIterator.next();
// 1) If the item names match just update the quantity in store_inventory
if (storeInv.getItemName().equalsIgnoreCase(newAcq.getItemName())) {
storeInv.setQuantity(newAcq.getQuantity());
isItemIncluded = true;
}
}
// 2) If new_acquisitions has a new item that does not exist in store_inventory,
// then add it to the store_inventory.
if (!isItemIncluded) {
storeInvList.add(newAcq);
}
}
}
Just follow this code sample I worked out according to your requirements. Note that the library provided simplified API and significent performance for parsing CSV files.
The operation you are performing will require that for each item in your new acquisitions, you will need to search each item in inventory for a match. This is not only not efficient, but the scanner that you have set up for your inventory file would need to be reset after each item.
I would suggest that you add your new acquisitions and your inventory to collections and then iterate over your new acquisitions and look up the new item in your inventory collection. If the item exists, update the item. If it doesnt, add it to the inventory collection. For this activity, it might be good to write a simple class to contain an inventory item. It could be used for both the new acquisitions and for the inventory. For a fast lookup, I would suggest that you use HashSet or HashMap for your inventory collection.
At the end of the process, dont forget to persist the changes to your inventory file.
As Java doesn’t support parsing of CSV files natively, we have to rely on third party library. Opencsv is one of the best library available for this purpose. It’s open source and is shipped with Apache 2.0 licence which makes it possible for commercial use.
Here, this link should help you and others in the situations!
For writing to CSV
public void writeCSV() {
// Delimiter used in CSV file
private static final String NEW_LINE_SEPARATOR = "\n";
// CSV file header
private static final Object[] FILE_HEADER = { "Empoyee Name","Empoyee Code", "In Time", "Out Time", "Duration", "Is Working Day" };
String fileName = "fileName.csv");
List<Objects> objects = new ArrayList<Objects>();
FileWriter fileWriter = null;
CSVPrinter csvFilePrinter = null;
// Create the CSVFormat object with "\n" as a record delimiter
CSVFormat csvFileFormat = CSVFormat.DEFAULT.withRecordSeparator(NEW_LINE_SEPARATOR);
try {
fileWriter = new FileWriter(fileName);
csvFilePrinter = new CSVPrinter(fileWriter, csvFileFormat);
csvFilePrinter.printRecord(FILE_HEADER);
// Write a new student object list to the CSV file
for (Object object : objects) {
List<String> record = new ArrayList<String>();
record.add(object.getValue1().toString());
record.add(object.getValue2().toString());
record.add(object.getValue3().toString());
csvFilePrinter.printRecord(record);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
fileWriter.flush();
fileWriter.close();
csvFilePrinter.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
You can use Apache Commons CSV api.
FYI this anwser : https://stackoverflow.com/a/42198895/6549532
Read / Write Example