Mahout : To read a custom input file - java

I was playing with Mahout and found that the FileDataModel accepts data in the format
userId,itemId,pref(long,long,Double).
I have some data which is of the format
String,long,double
What is the best/easiest method to work with this dataset on Mahout?

One way to do this is by creating an extension of FileDataModel. You'll need to override the readUserIDFromString(String value) method to use some kind of resolver do the conversion. You can use one of the implementations of IDMigrator, as Sean suggests.
For example, assuming you have an initialized MemoryIDMigrator, you could do this:
#Override
protected long readUserIDFromString(String stringID) {
long result = memoryIDMigrator.toLongID(stringID);
memoryIDMigrator.storeMapping(result, stringID);
return result;
}
This way you could use memoryIDMigrator to do the reverse mapping, too. If you don't need that, you can just hash it the way it's done in their implementation (it's in AbstractIDMigrator).

userId and itemId can be string, so this is the CustomFileDataModel which will convert your string into integer and will keep the map (String,Id) in memory; after recommendations you can get string from id.

Assuming that your input fits in memory, loop through it. Track the ID for each string in a dictionary. If it does not fit in memory, use sort and then group by to accomplish the same idea.
In python:
import sys
import sys
next_id = 0
str_to_id = {}
for line in sys.stdin:
fields = line.strip().split(',')
this_id = str_to_id.get(fields[0])
if this_id is None:
next_id += 1
this_id = next_id
str_to_id[fields[0]] = this_id
fields[0] = str(this_id)
print ','.join(fields)

Related

How to highlight first different line in a quite long String?

I have user readable file with several hundreds rows.
Each row is quite short(~20-30 symbols).
From time to time I need to execute equals operation with that string against another strings.
If Strings are different I need to find first row which differs. Sure I can do it manually:
in a loop find first character which differs then find previous and following '/n' but this code is not beaiful from my point of view.
Is there any other way to achieve it using some external libraries ?
There's no need for any library, what you ask is rather straightforward. But it's unique enough that no library would have it, so just write it yourself.
import java.nio.file.Files;
import java.util.*;
...
Optional<String> findFirstDifferentLine(Path file, Collection<String> rows) throws IOException {
try (var fileStream = Files.lines(file)) { // Need to close
var fileIt = fileStream.iterator();
var rowIt = rows.iterator();
while (fileIt.hasNext() && rowIt.hasNext()) {
var fileItem = fileIt.next();
if (!Objects.equal(fileItem, rowIt.next()) {
return Optional.of(fileItem);
}
}
return Optional.of(fileIt)
.filter(Iterator::hasNext)
.map(Iterator::next);
}
}

Is there a way to add a string at the end of a dot without manually typing it?

I want to make an OreBase class so that i don't make a new class for every new ore because they should pretty much do the exact same thing: 1. exist, 2. drop the appropriate item that is named before the underscore of the ore name (ruby_ore -> ruby). To return a ruby for a ruby_ore i need to return ModItems.RUBY, i can get the string "RUBY" from "ruby_ore", but i don't know how to properly add it after "ModItems.". Is this possible?
If that isn't possible, is it maybe possible to put "ModItems." and the item string ex. "RUBY" in a single string ex. "ModItems.RUBY" and run that string as code?
#Override
public Item getItemDropped(IBlockState state, Random rand, int fortune) {
int a = ore_name.indexOf('_'); //ex. ore_name = ruby_ore
String b = ore_name.substring(0,a); //ex. ruby
String c = b.toUpperCase();//ex. RUBY
return ModItems.b;//i want this to do ex. ModItems.RUBY
}
So if the ore_name is ex. biotite_ore the function should return ModItems.BIOTITE, for pyroxine_ore it should return ModItems.PYROXINE, etc.
There are at least 3 ways of doing this. Take your pick.
1. Make ModItems an Enum containing an Item object:
int a = ore_name.indexOf('_');
String b = ore_name.substring(0,a);
String c = b.toUpperCase();
return ModItems.valueOf(c).getItem();
Pros: Simple, no need to update a map if a new item is added
Cons: Throws an exception if the ModItem doesn't exist
2. Making a Map<String, ModItem> (preferred):
return oreMap.get(ore_name);
Pros: Simple, easy to implement
Cons: You have to update your map every time you add an item and get returns null for unknown ores
3. Reflection:
int a = ore_name.indexOf('_');
String b = ore_name.substring(0,a);
String c = b.toUpperCase();
return ModItems.class.getDeclaredField(c).get(null);
Pros: No need to update a map for every new item
Cons: Overkill, throws ugly checked exceptions, and is generally frowned upon unless absolutely necessary.

List of regex results instead of first result in Kotlin

Using the following code, I can set a couple variables to my matches. I want to do the same thing, but populate a map of all instances of these results. I'm struggling and could use help.
val (dice, level) = Regex("""([0-9]*d[0-9]*) at ([0-9]*)""").matchEntire(text)?.destructured!!
This code works for one instance, none of my attempts at matching multiple are working.
Your solution is short and readable. Here are a few options the one you use is largely a matter of preference. You can get a Map directly by using the associate method as follows.
val diceLevels = levelMatches.associate { matched ->
val (diceTwo,levelTwo) = matched.destructured
(levelTwo to diceTwo)
}
Note: This creates an immutable map. If you want a MutableMap, you can use associateTo.
If you want to be concise, you can simplify out the destructuring to local variables and index the groups directly.
val diceLevels = levelMatches.associate {
(it.groupValues[2] to it.groupValues[1])
}
Or, using let, you can also avoid needing to declare levelMatches as a local variable if it isn't used elsewhere --
val diceLevels = Regex("([0-9]+d[0-9]+) at ([0-9]+)")
.findAll(text)
.let { levelMatches ->
levelMatches.associate {
(it.groupValues[2] to it.groupValues[1])
}
}
I realized this was no where near as complicated as I was making it. Here was my solution. Is there something more elegant?
val levelMatches = Regex("([0-9]+d[0-9]+) at ([0-9]+)").findAll(text)
levelMatches.forEach { matched ->
val (diceTwo,levelTwo) = matched.destructured
diceLevels[levelTwo] = diceTwo
}

Hadoop Custom Partitioner not behaving according to the logic

Based on this example here, this works. Have tried the same on my dataset.
Sample Dataset:
OBSERVATION;2474472;137176;
OBSERVATION;2474473;137176;
OBSERVATION;2474474;137176;
OBSERVATION;2474475;137177;
Consider each line as string, my Mapper output is:
key-> string[2], value-> string.
My Partitioner code:
#Override
public int getPartition(Text key, Text value, int reducersDefined) {
String keyStr = key.toString();
if(keyStr == "137176") {
return 0;
} else {
return 1 % reducersDefined;
}
}
In my data set most id's are 137176. Reducer declared -2. I expect two output files, one for 137176 and second for remaining Id's. I'm getting two output files but, Id's evenly distributed on both the output files. What's going wrong in my program?
Explicitly set in the Driver method that you want to use your custom Partitioner, by using: job.setPartitionerClass(YourPartitioner.class);. If you don't do that, the default HashPartitioner is used.
Change String comparison method from == to .equals(). i.e., change if(keyStr == "137176") { to if(keyStr.equals("137176")) {.
To save some time, perhaps it will be faster to declare a new Text variable at the beginning of the partitioner, like that: Text KEY = new Text("137176"); and then, without converting your input key to String every time, just compare it with the KEY variable (again using the equals() method). But perhaps those are equivalent. So, what I suggest is:
Text KEY = new Text("137176");
#Override
public int getPartition(Text key, Text value, int reducersDefined) {
return key.equals(KEY) ? 0 : 1 % reducersDefined;
}
Another suggestion, if the network load is heavy, parse the map output key as VIntWritable and change the Partitioner accordingly.

What is the best way to perform index ranged searches in OrientDB from Java?

We are using OrientDB in the embedded mode, and are hoping to access it directly with Java api calls (not using the SQL-ish language). We have an index, and need to perform a ranged search on it. Here is the only way I have found so far:
String startAt = createInternalOIndexSearchableKey(actualKey);
Index<Edge> index = graph.getIndex(indexName, Edge.class);
OrientIndex orientIndex = (OrientIndex) index;
OIndex oIndex = orientIndex.getUnderlying();
boolean INCLUSIVE = true;
boolean ASCENDING = true;
OIndexCursor cursor = oIndex.iterateEntriesMajor(startAt, INCLUSIVE, ASCENDING);
while(cursor.hasNext())
{
Entry<Object, OIdentifiable> entry = cursor.nextEntry();
...process the entry here
It feels uncomfortable to be deviating so far from the normal public API. Especially the implementation of createInternalOIndexSearchableKey:
private String createInternalOIndexSearchableKey(String actualKey)
{
// NOTE: Keys passed to OIndex.iterateEntriesMajor must
// be in the (undocumented) format: EdgeLabel!=!ActualKey
return KEY_CAN_DOWNLOAD_PUBCODETIMESTAMP + "!=!" + actualKey;
}
Is there a better way to do this?
OIndex and OIndexCursor is a public api of Document database, so no worry, you can use it.
However the main aim of API is to provide flexibility to SQL engine and other internal components, so it is not very convenient.
I would recommend you to use sql queries, they provide the same level of flexibility and more compact, that make their use more convenient.

Categories