I have a few arff files. I would like to read them sequentially and create a large dataset. Instances.add(Instance inst) doesn't add string values to the instances, hence the attempt to setDataset() ... but even this fails. Is there a way to accomplish the intuitively correct thing for strings?
ArffLoader arffLoader = new ArffLoader();
arffLoader.setFile(new File(fName));
Instances newData = arffLoader.getDataSet();
for (int i = 0; i < newData.numInstances(); i++) {
Instance one = newData.instance(i);
one.setDataset(data);
data.add(one);
}
This is from mailing list. I saved it before
how to merge two data file a.arff and b.arff into one data list?
Depends what merge you are talking about. Do you just want to append
the second file (both have the same attributes) or do you want to add
the merge the attributes (both have the same number of instances)?
In the first case ("append"):
java weka.core.Instances append filename1 filename2 > output-file
and the latter case ("merge"):
java weka.core.Instances merge filename1 filename2 > output-file
Here's the relevant Javadoc:
http://weka.sourceforge.net/doc.dev/weka/core/Instances.html#main(java.lang.String[])
Use mergeInstances to merge two datasets.
public static Instances mergeInstances(Instances first,
Instances second)
Your code would be something like below. For same instance numbers.
ArffLoader arffLoader = new ArffLoader();
arffLoader.setFile(new File(fName1));
Instances newData1 = arffLoader.getDataSet();
arffLoader.setFile(new File(fName2));
Instances newData2 = arffLoader.getDataSet();
Instances mergedData = Instances.mergeInstances( newData1 ,newData2);
Your code would be something like below. For same attribute numbers. I do not see any java method in weka. If you read code there is something like below.
// Instances.java
// public static void main(String[] args) {
// read two files, append them and print result to stdout
else if ((args.length == 3) && (args[0].toLowerCase().equals("append"))) {
DataSource source1 = new DataSource(args[1]);
DataSource source2 = new DataSource(args[2]);
String msg = source1.getStructure().equalHeadersMsg(source2.getStructure());
if (msg != null)
throw new Exception("The two datasets have different headers:\n" + msg);
Instances structure = source1.getStructure();
System.out.println(source1.getStructure());
while (source1.hasMoreElements(structure))
System.out.println(source1.nextElement(structure));
structure = source2.getStructure();
while (source2.hasMoreElements(structure))
System.out.println(source2.nextElement(structure));
}
Related
I am reading from ini files and passing them via data providers to test cases.
(The data provider reads these and returns an Ini.Section[][] array. If there are several sections, testng runs the test that many times.)
Let's imagine there is a section like this:
[sectionx]
key1=111
key2=222
key3=aaa,bbb,ccc
What I want, in the end, is to read this data and execute the test case three times, each time with a different value of key3, the other keys being the same.
One way would be to copy&paste the section as many times as needed... which is clearly not an ideal solution.
The way to go about it would seem to create further copies of the section, then change the key values to aaa, bbb and ccc. The data provider would return the new array and testng would do the rest.
However, I cannot seem to be able to create a new instance of the section object. Ini.Section is actually an interface; the implementing class org.ini4j.BasicProfileSection is not visible. It does not appear to be possible to create a copy of the object, or to inherit the class. I can only manipulate existing objects of this type, but not create new ones. Is there any way around it?
It seems that it is not possible to create copies of sections or the ini files. I ended up using this workaround:
First create an 'empty' ini file, that will serve as a sort of a placeholder. It will look like this:
[env]
test1=1
test2=2
test3=3
[1]
[2]
[3]
...with a sufficiently large number of sections, equal or greater to the number of sections in the other ini files.
Second, read the data in the data provider. When there is a key that contains several values, create a new Ini object for each value. The new Ini object must be created from a new file object. (You can read the placeholder file over and over, creating any number of Ini files.)
Finally, you have to copy the content of the actual ini file into the placeholder file.
The following code code works for me:
public static Ini copyIniFile(Ini originalFile){
Set<Entry<String, Section>> entries = originalFile.entrySet();
Ini emptyFile;
try {
FileInputStream file = new FileInputStream(new File(EMPTY_DATA_FILE_NAME));
emptyFile = new Ini(file);
file.close();
} catch (Exception e) {
e.printStackTrace();
return null;
}
for(Entry<String, Section> entry : entries){
String key = (String) entry.getKey();
Section section = (Section) entry.getValue();
copySection(key, section, emptyFile);
}
return emptyFile;
}
public static Ini.Section copySection(String key, Ini.Section origin, Ini destinationFile){
Ini.Section newSection = destinationFile.get(key);
if(newSection==null) throw new IllegalArgumentException();
for(Entry<String, String> entry : origin.entrySet()){
newSection.put(entry.getKey().toString(), entry.getValue().toString());
}
return newSection;
}
I have two files:
Grader.getFileInfo("data\\studentSubmissionA.txt");
Grader.teacherFiles("data\\TeacherListA.txt");
Both contain a list of math problems, but the TeacherList is unsolved in order to check that the StudentSubmission was not altered from the original version.
studentSubmission is sent to the Grader class and the method currently looks like this:
public static void getFileInfo(String fileName)
throws FileNotFoundException {
Scanner in = new Scanner(new File(fileName))
while (in.hasNext()) {
String fileContent = in.nextLine();
}
and the TeacherFiles method looks like
public static void teacherFiles(String teacherFiles)
throws FileNotFoundException{
Scanner in = new Scanner(new File(teacherFiles));
while (in.hasNext()){
String teacherContent = in.nextLine();
String line = teacherContent.substring(0, teacherContent.indexOf('='));
}
I don't know how to get these methods to another method in order to compare them since they're coming from a file and I have to put something in the method signature to pass them and it doesn't work.
I tried putting them in one method, but that was a bust as well.
I don't know where to go from here.
And unfortunately, I can't use try/catches or arrays.
Is it possible to send the .substring(0 , .indexof('=')) through the methods?
Like line = teacherFiles(teacherContent.substring(0 , .indexof('='))); Is it possible to do this?
Think in more general terms. Observe that your methods called getFileInfo and teacherFiles, respectively are the very same except a few nuances. So why do not we think about finding the optimal way of merging the two functionalities and handling the nuances outside of them?
It is logical that you cannot use arrays as you need to know the number of elements of your array before you initialize it and your array would have already been initialized when you read the file. So using an array for this task is either an overkill (for example you allocate 1000 elements in your memory and you use only 10 elements) or insufficient (if you create an array of 10 elements, but you would need 1000). So, due to the fact that you do not know the number of rows in advance you need to use another data structure for your task.
So create the following method:
public static AbstractList<String> readFile(String filePath) throws FileNotFoundException, IOException {
Scanner s = new Scanner(new File(filePath));
AbstractList<String> list = new ArrayList<String>();
while (s.hasNext()){
list.add(s.next());
}
s.close();
return list;
}
Then use the method to read the student file and to read the teacher file. Store the results into two separate AbstractList<String> variables, then iterate through them and compare them as you like. Again, think in more general terms.
I'm thinking about using HBase as a source for one of my MapReduce jobs. I know that TableInputFormat specifies one input split (and thus one mapper) per Region. However, this seems inefficient. I'd really like to have multiple mappers working on a given Region at once. Can I achieve this by extending TableInputFormatBase? Can you please point me to an example? Furthermore, is this even a good idea?
Thanks for the help.
You need a custom input format that extends InputFormat. you can get idea how do this from answer to question I want to scan lots of data (Range based queries), what all optimizations I can do while writing the data so that scan becomes faster. This is a good idea if the time of data processing is more greater then data retrieving time.
Not sure if you can specify multiple mappers for a given region, but consider the following:
If you think one mapper is inefficient per region (maybe your data nodes don't have enough resources like #cpus), you can perhaps specify smaller regions sizes in the file hbase-site.xml.
here's a site for the default configs options if you want to look into changing that:
http://hbase.apache.org/configuration.html#hbase_default_configurations
please note that by making the region size small, you will be increasing the number of files in your DFS, and this can limit the capacity of your hadoop DFS depending on the memory of your namenode. Remember, the namenode's memory usage is directly related to the number of files in your DFS. This may or may not be relavant to your situation as I do not know how your cluster is being used. There is never a silver bullet answer to these questions!
1 . Its absolutely fine just make sure the key set is mutually exclusive between the mappers .
you arent creating too many clients as this may lead to lot of gc , as during hbase read hbase block cache churning happens
Using this MultipleScanTableInputFormat, you can use MultipleScanTableInputFormat.PARTITIONS_PER_REGION_SERVER configuration to control how many mappers should execute against a single regionserver. The class will group all the input splits by their location (regionserver), and the RecordReader will properly iterate through all aggregated splits for the mapper.
Here is the example
https://gist.github.com/bbeaudreault/9788499#file-multiplescantableinputformat-java-L90
That work you have created the multiple aggregated splits for a single mapper
private List<InputSplit> getAggregatedSplits(JobContext context) throws IOException {
final List<InputSplit> aggregatedSplits = new ArrayList<InputSplit>();
final Scan scan = getScan();
for (int i = 0; i < startRows.size(); i++) {
scan.setStartRow(startRows.get(i));
scan.setStopRow(stopRows.get(i));
setScan(scan);
aggregatedSplits.addAll(super.getSplits(context));
}
// set the state back to where it was..
scan.setStopRow(null);
scan.setStartRow(null);
setScan(scan);
return aggregatedSplits;
}
Create partition by Region server
#Override
public List<InputSplit> getSplits(JobContext context) throws IOException {
List<InputSplit> source = getAggregatedSplits(context);
if (!partitionByRegionServer) {
return source;
}
// Partition by regionserver
Multimap<String, TableSplit> partitioned = ArrayListMultimap.<String, TableSplit>create();
for (InputSplit split : source) {
TableSplit cast = (TableSplit) split;
String rs = cast.getRegionLocation();
partitioned.put(rs, cast);
}
This would be useful if you wanna scan large regions (hundred of millions rows) with conditioned scan that finds only a few records. This will prevent of ScannerTimeoutException
package org.apache.hadoop.hbase.mapreduce;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
public class RegionSplitTableInputFormat extends TableInputFormat {
public static final String REGION_SPLIT = "region.split";
#Override
public List<InputSplit> getSplits(JobContext context) throws IOException {
Configuration conf = context.getConfiguration();
int regionSplitCount = conf.getInt(REGION_SPLIT, 0);
List<InputSplit> superSplits = super.getSplits(context);
if (regionSplitCount <= 0) {
return superSplits;
}
List<InputSplit> splits = new ArrayList<InputSplit>(superSplits.size() * regionSplitCount);
for (InputSplit inputSplit : superSplits) {
TableSplit tableSplit = (TableSplit) inputSplit;
System.out.println("splitting by " + regionSplitCount + " " + tableSplit);
byte[] startRow0 = tableSplit.getStartRow();
byte[] endRow0 = tableSplit.getEndRow();
boolean discardLastSplit = false;
if (endRow0.length == 0) {
endRow0 = new byte[startRow0.length];
Arrays.fill(endRow0, (byte) 255);
discardLastSplit = true;
}
byte[][] split = Bytes.split(startRow0, endRow0, regionSplitCount);
if (discardLastSplit) {
split[split.length - 1] = new byte[0];
}
for (int regionSplit = 0; regionSplit < split.length - 1; regionSplit++) {
byte[] startRow = split[regionSplit];
byte[] endRow = split[regionSplit + 1];
TableSplit newSplit = new TableSplit(tableSplit.getTableName(), startRow, endRow,
tableSplit.getLocations()[0]);
splits.add(newSplit);
}
}
return splits;
}
}
Currently, I'm copying one instance at a time from one dataset to the other. Is there a way to do this so that string mappings remain intact? The mergeInstances works horizontally, is there an equivalent vertical merge?
This is one step of a loop I use to read datasets of the same structure from multiple arff files into one large dataset. There has got to be a simpler way.
Instances iNew = new ConverterUtils.DataSource(name).getDataSet();
for (int i = 0; i < iNew.numInstances(); i++) {
Instance nInst = iNew.instance(i);
inst.add(nInst);
}
If you want a totally fully automated method that also copy properly string and nominal attributes, you can use the following function:
public static Instances merge(Instances data1, Instances data2)
throws Exception
{
// Check where are the string attributes
int asize = data1.numAttributes();
boolean strings_pos[] = new boolean[asize];
for(int i=0; i<asize; i++)
{
Attribute att = data1.attribute(i);
strings_pos[i] = ((att.type() == Attribute.STRING) ||
(att.type() == Attribute.NOMINAL));
}
// Create a new dataset
Instances dest = new Instances(data1);
dest.setRelationName(data1.relationName() + "+" + data2.relationName());
DataSource source = new DataSource(data2);
Instances instances = source.getStructure();
Instance instance = null;
while (source.hasMoreElements(instances)) {
instance = source.nextElement(instances);
dest.add(instance);
// Copy string attributes
for(int i=0; i<asize; i++) {
if(strings_pos[i]) {
dest.instance(dest.numInstances()-1)
.setValue(i,instance.stringValue(i));
}
}
}
return dest;
}
Please note that the following conditions should hold (there are not checked in the function):
Datasets must have the same attributes structure (number of attributes, type of attributes)
Class index has to be the same
Nominal values have to exactly correspond
To modify on the fly the values of the nominal attributes of data2 to match the ones of data1, you can use:
data2.renameAttributeValue(
data2.attribute("att_name_in_data2"),
"att_value_in_data2",
"att_value_in_data1");
Why not make a new ARFF file which has the data from both of the originals? A simple
cat 1.arff > tmp.arff
tail -n+20 2.arff >> tmp.arff
where 20 is replaced by however many lines long your arff header is. This would then produce a new arff file with all of the desired instances, and you could read this new file with your existing code:
Instances iNew = new ConverterUtils.DataSource(name).getDataSet();
You could also invoke weka on the command line using this documentation: http://old.nabble.com/how-to-merge-two-data-file-a.arff-and-b.arff-into-one-data-list--td22890856.html
java weka.core.Instances append filename1 filename2 > output-file
However, there is no function in the documentation http://weka.sourceforge.net/doc.dev/weka/core/Instances.html#main%28java.lang.String which will allow you to append multiple arff files natively within your java code. As of Weka 3.7.6, the code that appends two arff files is this:
// read two files, append them and print result to stdout
else if ((args.length == 3) && (args[0].toLowerCase().equals("append"))) {
DataSource source1 = new DataSource(args[1]);
DataSource source2 = new DataSource(args[2]);
String msg = source1.getStructure().equalHeadersMsg(source2.getStructure());
if (msg != null)
throw new Exception("The two datasets have different headers:\n" + msg);
Instances structure = source1.getStructure();
System.out.println(source1.getStructure());
while (source1.hasMoreElements(structure))
System.out.println(source1.nextElement(structure));
structure = source2.getStructure();
while (source2.hasMoreElements(structure))
System.out.println(source2.nextElement(structure));
}
Thus it looks like Weka itself simply iterates through all of the instances in a data set and prints them, the same process your code uses.
Another possible solution is to use addAll from java.util.AbstractCollection, since Instances implement it.
instances1.addAll(instances2);
I've just shared an extended weka.core.Instaces class with methods like innerJoin, leftJoin, fullJoin, update and union.
table1.makeIndex(table1.attribute("Continent_ID");
table2.makeIndex(table2.attribute("Continent_ID");
Instances result = table1.leftJoin(table2);
Instances can have different number of attributes, levels of NOMINAL and STRING variables are merged together if neccesary.
Sources and some examples are here on GitHub: weka.join.
Here's the situation :
I have 3 objects all named **List and I have a method with a String parameter;
gameList = new StringBuffer();
appsList = new StringBuffer();
movieList = new StringBuffer();
public void fetchData(String category) {
URL url = null;
BufferedReader input;
gameList.delete(0, gameList.length());
Is there a way to do something like the following :
public void fetchData(String category) {
URL url = null;
BufferedReader input;
"category"List.delete(0, gameList.length());
, so I can choose which of the lists to be used based on the String parameter?
I suggest you create a HashMap<String, StringBuffer> and use that:
map = new HashMap<String, StringBuffer>();
map.put("game", new StringBuffer());
map.put("apps", new StringBuffer());
map.put("movie", new StringBuffer());
...
public void fetchData(String category) {
StringBuffer buffer = map.get(category);
if (buffer == null) {
// No such category. Throw an exception?
} else {
// Do whatever you need to
}
}
If the lists are fields of your object - yes, using reflection:
Field field = getClass().getDeclaredField(category + "List");
List result = field.get();
But generally you should avoid reflection. And if your objects are fixed - i.e. they don't change, simply use an if-clause.
The logically simplest way taking your question as given would just be:
StringBuffer which;
if (category.equals("game"))
which=gameList;
else if (category.equals("apps"))
which=appList;
else if (category.equals("movie"))
which=movieList;
else
... some kind of error handling ...
which.delete();
As Jon Skeet noted, if the list is big or dynamic you probably want to use a map rather than an if/else/if.
That said, I'd encourage you to use integer constant or an enum rather than a String. Like:
enum ListType {GAME, APP, MOVIE};
void deleteList(ListType category)
{
if (category==GAME)
... etc ...
In this simple example, if this is all you'd ever do with it, it wouldn't matter much. But I'm working on a system now that uses String tokens for this sort of thing all over the place, and it creates a lot of problems.
Suppose you call the function and by mistake you pass in "app" instead of "apps", or "Game" instead of "game". Or maybe you're thinking you added handling for "song" yesterday but in fact you went to lunch instead. This will successfully compile, and you won't have any clue that there's a problem until run-time. If the program does not throw an error on an invalid value but instead takes some default action, you could have a bug that's difficult to track down. But with an enum, if you mis-spell the name or try to use one that isn't defined, the compiler will immediately alert you to the error.
Suppose that some functions take special action for some of these options but not others. Like you find yourself writing
if (category.equals("app"))
getSpaceRequirements();
and that sort of thing. Then someone reading the program sees a reference to "app" here, a reference to "game" 20 lines later, etc. It could be difficult to determine what all the possible values are. Any given function might not explicitly reference them all. But with an enum, they're all neatly in one place.
You could use a switch statement
StringBuffer buffer = null;
switch (category) {
case "game": buffer = gameList;
case "apps": buffer = appsList;
case "movie": buffer = movieList;
default: return;
}