LensKit: LensKit demo is not reading my data file

LensKit: LensKit demo is not reading my data file - java

When I run the LensKit demo program I get this error:
[main] ERROR org.grouplens.lenskit.data.dao.DelimitedTextRatingCursor - C:\Users\sean\Desktop\ml-100k\u - Copy.data:4: invalid input, skipping line
I reworked the ML 100k data set so that it only holds this line although I dont see how this would effect it:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244
Here is the code I am using too:
public class HelloLenskit implements Runnable {
public static void main(String[] args) {
HelloLenskit hello = new HelloLenskit(args);
try {
hello.run();
} catch (RuntimeException e) {
System.err.println(e.getMessage());
System.exit(1);
}
}
private String delimiter = "\t";
private File inputFile = new File("C:\\Users\\sean\\Desktop\\ml-100k\\u - Copy.data");
private List<Long> users;
public HelloLenskit(String[] args) {
int nextArg = 0;
boolean done = false;
while (!done && nextArg < args.length) {
String arg = args[nextArg];
if (arg.equals("-e")) {
delimiter = args[nextArg + 1];
nextArg += 2;
} else if (arg.startsWith("-")) {
throw new RuntimeException("unknown option: " + arg);
} else {
inputFile = new File(arg);
nextArg += 1;
done = true;
}
}
users = new ArrayList<Long>(args.length - nextArg);
for (; nextArg < args.length; nextArg++) {
users.add(Long.parseLong(args[nextArg]));
}
}
public void run() {
// We first need to configure the data access.
// We will use a simple delimited file; you can use something else like
// a database (see JDBCRatingDAO).
EventDAO base = new SimpleFileRatingDAO(inputFile, "\t");
// Reading directly from CSV files is slow, so we'll cache it in memory.
// You can use SoftFactory here to allow ratings to be expunged and re-read
// as memory limits demand. If you're using a database, just use it directly.
EventDAO dao = new EventCollectionDAO(Cursors.makeList(base.streamEvents()));
// Second step is to create the LensKit configuration...
LenskitConfiguration config = new LenskitConfiguration();
// ... configure the data source
config.bind(EventDAO.class).to(dao);
// ... and configure the item scorer. The bind and set methods
// are what you use to do that. Here, we want an item-item scorer.
config.bind(ItemScorer.class)
.to(ItemItemScorer.class);
// let's use personalized mean rating as the baseline/fallback predictor.
// 2-step process:
// First, use the user mean rating as the baseline scorer
config.bind(BaselineScorer.class, ItemScorer.class)
.to(UserMeanItemScorer.class);
// Second, use the item mean rating as the base for user means
config.bind(UserMeanBaseline.class, ItemScorer.class)
.to(ItemMeanRatingItemScorer.class);
// and normalize ratings by baseline prior to computing similarities
config.bind(UserVectorNormalizer.class)
.to(BaselineSubtractingUserVectorNormalizer.class);
// There are more parameters, roles, and components that can be set. See the
// JavaDoc for each recommender algorithm for more information.
// Now that we have a factory, build a recommender from the configuration
// and data source. This will compute the similarity matrix and return a recommender
// that uses it.
Recommender rec = null;
try {
rec = LenskitRecommender.build(config);
} catch (RecommenderBuildException e) {
throw new RuntimeException("recommender build failed", e);
}
// we want to recommend items
ItemRecommender irec = rec.getItemRecommender();
assert irec != null; // not null because we configured one
// for users
for (long user: users) {
// get 10 recommendation for the user
List<ScoredId> recs = irec.recommend(user, 10);
System.out.format("Recommendations for %d:\n", user);
for (ScoredId item: recs) {
System.out.format("\t%d\n", item.getId());
}
}
}
}
I am really lost on this one and would appreciate any help. Thanks for your time.

The last line of your input file only contains one field. Each input file line needs to contain 3 or 4 fields.

Related

How to fix duplicate elements in repeated protocol-buffer field?

I want to load some data using protocoll-buffers (JSON was way too slow on Android) but somehow my repeated field called company contains 6 copies of every element - although I am not storing any duplicates.
How do I know that it shouldn't contain duplicates?
I did set a counter for every object I am saving - and it was the expected length.
This is my schema:
syntax = "proto3";
[...]
message CompanyProtoRepository {
// THIS FIELD CONTAINS DUPLICATES!
repeated CompanyProto company = 1;
}
How I store my data:
public void writeToFile(String fileName) {
CompanyProtos.CompanyProtoRepository repo = loadRepository();
try {
OutputStream outputStream = mContext.openFileOutput(fileName, Context.MODE_PRIVATE);
repo.writeTo(outputStream);
} catch (Exception e) {
e.printStackTrace();
}
}
private CompanyProtos.CompanyProtoRepository loadRepository() {
CompanyLoaderService jsonLoader = new JsonCompanyLoaderService(mContext.getResources());
CompanyProtos.CompanyProtoRepositoryOrBuilder repo = CompanyProtos.CompanyProtoRepository.newBuilder();
int counter = 0; // Will be 175 which is correct (every company 1 time)
// Will contain every id only time -> correct!
HashMap<Integer, Integer> map = new HashMap<>();
for (Company company : jsonLoader.getCompanies()) {
counter++;
if (!map.containsKey(company.getName()))
map.put(company.getId(), 1);
else
map.put(company.getId(), map.get(company.getId()) + 1);
CompanyProtos.CompanyProto proto = toProto(company);
if (!repo.getCompanyList().contains(proto))
((CompanyProtos.CompanyProtoRepository.Builder) repo).addCompany(proto);
}
return ((CompanyProtos.CompanyProtoRepository.Builder) repo).build();
}
And this is how I load my data:
private List<Company> loadCompanies() {
CompanyProtos.CompanyProtoRepository repo = null;
try {
InputStream inputStream = mContext.getResources().openRawResource(R.raw.company_buffers);
repo = CompanyProtos.CompanyProtoRepository.parseFrom(inputStream);
ArrayList<Company> list = new ArrayList<>();
for (CompanyProtos.CompanyProto companyProto: repo.getCompanyList()) {
list.add(fromProto(companyProto));
}
// This list contains every company 6 times!
return list;
} catch (Exception ex) { }
}
Of course I expected to have each company only 1 time since I verfied that I store each company only time inside my CompanyProtoRepository instead of 6 times.

Oh my f**** godness.
I just spent hours and hours by trying to fix that bug.
Turns out I am reading from an old corrupt data-set - and not the file I am actually writing to.

Why is my Spark driver program so slow?

My problem: I have a model engine that takes a list of parameter configuration, and evaluates a double value that corresponds to the metric associated to that configuration. I have six parameters, and each of them can vary according to a list. I want to find by brute force the best parameter configuration considering the combination that will produce the higher value for the output metric. Since I'm learning Spark, I realized that with the cartesian product operation I can easily generate the combinations, and split the RDD to be processed in parallel. So, I came up with this driver program:
public static void main(String[] args) {
String scriptName = "model.mry";
String scriptStr = null;
try {
scriptStr = new String(Files.readAllBytes(Paths.get(scriptName)));
} catch (IOException ex) {
Logger.getLogger(BruteForceDriver.class.getName()).log(Level.SEVERE, null, ex);
System.exit(1);
}
final String script = scriptStr;
SparkConf conf = new SparkConf()
.setAppName("wordCount")
.setSparkHome("/home/danilo/bin/spark-2.2.0-bin-hadoop2.7")
.setJars(new String[]{"/home/danilo/NetBeansProjects/SparkHello1/target/SparkHello1-1.0.jar",
"/home/danilo/.m2/repository/org/modcs/mercury/4.7/mercury-4.7.jar"})
.setMaster("spark://danilo-desktop:7077");
String baseDir = "/home/danilo/NetBeansProjects/SimulationOptimization/workspace/";
JavaSparkContext sc = new JavaSparkContext(conf);
final int NUM_SERVICES = 6;
final int QTD = 3;
JavaRDD<Service>[] providers = new JavaRDD[NUM_SERVICES];
for (int i = 1; i <= NUM_SERVICES; i++) {
providers[i - 1] = sc.textFile(baseDir + "provider"
+ i
+ ".mat")
.filter((t1) -> !t1.contains("#") && !t1.trim().isEmpty())
.map(Service.createParser("" + i))
.zipWithIndex().filter((t1) -> {
return t1._2 < QTD;
}).keys();
}
JavaPairRDD c = null;
JavaRDD<Service> p = providers[0];
for (int i = 1; i < NUM_SERVICES; i++) {
if (c == null) {
c = p.cartesian(providers[i]);
} else {
c = c.cartesian(providers[i]);
}
}
JavaRDD<List<Service>> cartesian = c.map(new FlattenTuple<>());
final Broadcast<ModelEvaluator> model = sc.broadcast(new ModelEvaluator(script));
JavaPairRDD<Double, List<Service>> results = cartesian.mapToPair(
(t) -> {
try {
double val = model.value().evaluateModel(t);
System.out.println(val);
return new Tuple2<>(val, t);
} catch (Exception ex) {
return null;
}
}
);
results.sortByKey().collect().forEach((t) -> {
System.out.println(t._1 + ", " + t._2);
});
sc.close();
}
The "QTD" variable allows me to control the size the interval which each parameter will vary. For QTD = 3, I'll have 3^6 = 729 combinations. The problem is that it is taking so long to compute all those combinations. I wrote a implementations using only normal Java threads, and the runtime is about 40 seconds. Using my Spark driver program, the runtime more than 6 minutes. Why is my Spark program so slow compared to the plain Java multi-thread program?
Edit:
I put:
results = results.cache();
before sorting the results and now the runtime is 2.5 minutes.
Edit 2:
I created a RDD with the cartesian product of the parameters by hand instead of using the operation provided by the framework. Now my runtime is 1'25''. It does make sense now, since there is some overhead to start the driver and move the jars to the workers.

How to get all members of AD group via LDAP in Java

I have written an application that retrieves Active Directory groups and flattens them, i.e. includes recursively members of subgroup to the top parent group.
It works fine for small groups, but with larger groups I am facing a problem.
If number of members does not exceed 1500, they are listed in the member attribute. If there are more - then this attribute is empty and attribute with name member;range:0-1499 appears, containing first 1500 members.
My problem that I don't know how to get the rest of member set over 1500.
We have groups with 8-12 thousand members. Do I need to run another query?
On the Microsoft site I have seen C# code snippet on the similar matter, but couldn't make much sense of it, as they were showing how to specify a range, but not how to plug it into query. If someone knows how to do it in Java, I'd appreciate a tip.

This will obviously give you the next ones:
String[] returnedAtts = { "member;range=1500-2999" };
You need to fetch the users chunk by chunk (1500 chunks) Just make a counter and update you search and retrieve the next ones until you have all of them.

With your help I have a full working code
// Initialize
LdapContext ldapContext = null;
NamingEnumeration<SearchResult> results = null;
NamingEnumeration<?> members = null;
try {
// Initialize properties
Properties properties = new Properties();
properties.put(Context.INITIAL_CONTEXT_FACTORY, "com.sun.jndi.ldap.LdapCtxFactory");
properties.put(Context.PROVIDER_URL, "ldap://" + ldapUrl);
properties.put(Context.SECURITY_PRINCIPAL, adminLoginADOnPremise);
properties.put(Context.SECURITY_CREDENTIALS, adminPasswordADOnPremise);
// Initialize ldap context
ldapContext = new InitialLdapContext(properties, null);
int range = 0;
boolean finish = false;
while (finish != true) {
// Set search controls
SearchControls searchCtls = new SearchControls();
searchCtls.setSearchScope(SearchControls.SUBTREE_SCOPE);
searchCtls.setReturningAttributes(generateRangeArray(range));
// Get results
results = ldapContext.search(ldapBaseDn, String.format("(samAccountName=%s)", groupName), searchCtls);
if (results.hasMoreElements() == true) {
SearchResult result = results.next();
try {
members = result.getAttributes().get(generateRangeString(range)).getAll();
while (members.hasMore()) {
String distinguishedName = (String) members.next();
logger.debug(distinguishedName);
}
range++;
} catch (Exception e) {
// Fails means there is no more result
finish = true;
}
}
}
} catch (NamingException e) {
logger.error(e.getMessage());
throw new Exception(e.getMessage());
} finally {
if (ldapContext != null) {
ldapContext.close();
}
if (results != null) {
results.close();
}
}

Two functions missing from the working code example by #Nicolas, I guess they would be something like:
public static String[] generateRangeArray(int i) {
String range = "member;range=" + i * 1500 + "-" + ((i + 1) * 1500 - 1);
String[] returnedAtts = { range };
return returnedAtts;
}
public static String generateRangeString(int i) {
String range = "member;range=" + i * 1500 + "-" + ((i + 1) * 1500 - 1);
return range;
}
The code does not handle the case if the AD group is not so large that the member attribute actually needs to be "chunked", that is if the "member" attribute exists instead.

Only execute once for every ID in array list

I'm getting results from a PHP, and parse them to a String array
ParseResults[0] is the ID returned from the database.
What I'm trying to do, is make a message box, which is only shown once (until the application is restarted of course).
My code looks something like this, but I can't figure out what stops it from working properly.
public void ShowNotification() {
try {
ArrayList<String> SearchGNArray = OverblikIntetSvar(Main.BrugerID);
// SearchGNArray = Gets undecoded rows of information from DB
for(int i=0; i<SearchGNArray.size(); i++){
String[] ParseTilArray = ParseResultater(SearchGNArray.get(i));
// ParseToArray = Parse results and decode to useable results
// ParseToArray[0] = the index containing the ID we'd like
// to keep track of, if it already had shown a popup about it
if (SearchPopUpArray.size() == 0) {
// No ID's yet in SearchPopUpArray
// SearchPopUpArray = Where we'd like to store our already shown ID's
SearchPopUpArray.add(ParseTilArray[0]);
// Create Messagebox
}
boolean match = false ;
for(int ii=0; ii<SearchPopUpArray.size(); ii++){
try {
match = SearchPopUpArray.get(ii).equals(ParseTilArray[0]);
} catch (Exception e) {
e.printStackTrace();
}
if(match){
// There is a match
break; // Break to not create a popup
} else {
// No match in MatchPopUpArray
SearchPopUpArray.add(ParseTilArray[0]);
// Create a Messagebox
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
As of now I have 2 rows, so there should be two ID's. There's 101 and 102. It shows 102 once, and then it just keeps creating messageboxes about 101.

You are not incrementing the right variable in the second for-loop:
for(int ii = 0; ii <SearchPopUpArray.size();i++){
/* ^
|
should be ii++
*/
}
It might be help to use more descriptive variable name like indexGN and indexPopup instead to avoid this sort of issue

I've deleted the second for loop:
for(int ii=0; ii<SearchPopUpArray.size(); ii++){
try {
match = SearchPopUpArray.get(ii).equals(ParseTilArray[0]);
} catch (Exception e) {
e.printStackTrace();
}
if(match){
// There is a match
} else {
// No match in MatchPopUpArray
SearchPopUpArray.add(ParseTilArray[0]);
// Create a Messagebox
}
}
And replaced with
if (SearchPopUpArray.contains(ParseTilArray[0])) {
// Match
} else {
// No match i MatchPopUpArray
SearchPopUpArray.add(ParseTilArray[0]);
// Create a Messagebox
}
Much more simple.

Grabbing tagged instagram photos in real time

I'm trying to download photos posted with specific tag in real time. I found real time api pretty useless so I'm using long polling strategy. Below is pseudocode with comments of sublte bugs in it
newMediaCount = getMediaCount();
delta = newMediaCount - mediaCount;
if (delta > 0) {
// if mediaCount changed by now, realDelta > delta, so realDelta - delta photos won't be grabbed and on next poll if mediaCount didn't change again realDelta - delta would be duplicated else ...
// if photo posted from private account last photo will be duplicated as counter changes but nothing is added to recent
recentMedia = getRecentMedia(delta);
// persist recentMedia
mediaCount = newMediaCount;
}
Second issue can be addressed with Set of some sort I gueess. But first really bothers me. I've moved two calls to instagram api as close as possible but is this enough?
Edit
As Amir suggested I've rewritten the code with use of min/max_tag_ids. But it still skips photos. I couldn't find better way to test this than save images on disk for some time and compare result to instagram.com/explore/tags/.
public class LousyInstagramApiTest {
#Test
public void testFeedContinuity() throws Exception {
Instagram instagram = new Instagram(Settings.getClientId());
final String TAG_NAME = "portrait";
String id = instagram.getRecentMediaTags(TAG_NAME).getPagination().getMinTagId();
HashtagEndpoint endpoint = new HashtagEndpoint(instagram, TAG_NAME, id);
for (int i = 0; i < 10; i++) {
Thread.sleep(3000);
endpoint.recentFeed().forEach(d -> {
try {
URL url = new URL(d.getImages().getLowResolution().getImageUrl());
BufferedImage img = ImageIO.read(url);
ImageIO.write(img, "png", new File("D:\\tmp\\" + d.getId() + ".png"));
} catch (Exception e) {
e.printStackTrace();
}
});
}
}
}
class HashtagEndpoint {
private final Instagram instagram;
private final String hashtag;
private String minTagId;
public HashtagEndpoint(Instagram instagram, String hashtag, String minTagId) {
this.instagram = instagram;
this.hashtag = hashtag;
this.minTagId = minTagId;
}
public List<MediaFeedData> recentFeed() throws InstagramException {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, minTagId, null);
List<MediaFeedData> dataList = feed.getData();
if (dataList.size() == 0) return Collections.emptyList();
String maxTagId = feed.getPagination().getNextMaxTagId();
if (maxTagId != null && maxTagId.compareTo(minTagId) > 0) dataList.addAll(paginateFeed(maxTagId));
Collections.reverse(dataList);
// dataList.removeIf(d -> d.getId().compareTo(minTagId) < 0);
minTagId = feed.getPagination().getMinTagId();
return dataList;
}
private Collection<? extends MediaFeedData> paginateFeed(String maxTagId) throws InstagramException {
System.out.println("pagination required");
List<MediaFeedData> dataList = new ArrayList<>();
do {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, null, maxTagId);
maxTagId = feed.getPagination().getNextMaxTagId();
dataList.addAll(feed.getData());
} while (maxTagId.compareTo(minTagId) > 0);
return dataList;
}
}

Using the Tag endpoints to get the recent media with a desired tag, it returns a min_tag_id in its pagination info, which is tied to the most recently tagged media at the time of your call. As the API also accepts a min_tag_id parameter, you can pass that number from your last query to only receive those media that are tagged after your last query.
So based on whatever polling mechanism you have, you just call the API to get the new recent media if any based on last received min_tag_id.
You will also need to pass a large count parameter and follow the pagination of the response to receive all data without losing anything when the speed of tagging is faster than your polling.
Update:
Based on your updated code:
public List<MediaFeedData> recentFeed() throws InstagramException {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, minTagId, null, 100000);
List<MediaFeedData> dataList = feed.getData();
if (dataList.size() == 0) return Collections.emptyList();
// follow the pagination
MediaFeed recentMediaNextPage = instagram.getRecentMediaNextPage(feed.getPagination());
while (recentMediaNextPage.getPagination() != null) {
dataList.addAll(recentMediaNextPage.getData());
recentMediaNextPage = instagram.getRecentMediaNextPage(recentMediaNextPage.getPagination());
}
Collections.reverse(dataList);
minTagId = feed.getPagination().getMinTagId();
return dataList;
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

LensKit: LensKit demo is not reading my data file - java

The last line of your input file only contains one field. Each input file line needs to contain 3 or 4 fields.

Related

How to fix duplicate elements in repeated protocol-buffer field?

Why is my Spark driver program so slow?

How to get all members of AD group via LDAP in Java

Only execute once for every ID in array list

Grabbing tagged instagram photos in real time

Categories

Resources