Data integration problem - How to integrate similar entities

Data integration problem - How to integrate similar entities - java

I have a database which has very similar rows within the same table. Those rows are similar because they have nearly equal column values. I need to integrate those corresponding rows into one single row.
For example, those two users (u1 and u2) should be integrated:
u1 = User(name = "William Henry Gates III",
age = 55,
nationality = "american",
alma_mater = "Harvard Univesity")
u2 = User(name: "William Henry 'Bill' Gates III",
age: 55,
nationality: "America",
alma_mater: "Harvard U.")
I am thinking of using some edit distance and stemming techniques. Other algorithms and techniques suggestions? Any helpful libraries to use (preferably in Python or Java)?

Considered something like Refine?

Related

Computing preference values in Apache Mahout

I am trying to learn Apache mahout, very new to this topic. I want to implement user-based recommender. For this, after exploring on the internet I have found some samples like below,
public static void main(String[] args) {
try {
int userId = 2;
DataModel model = new FileDataModel(new File("data/mydataset.csv"), ";");
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(userId, 3);
for (RecommendedItem recommendation : recommendations) {
logger.log(Level.INFO, "Item Id recommended : " + recommendation.getItemID() + " Ratings : "
+ recommendation.getValue() + " For UserId : " + userId);
}
} catch (Exception e) {
logger.log(Level.SEVERE, "Exception in main() ::", e);
}
I am using following dataset which contains userid, itemid, preference value respectively,
1,10,1.0
1,11,2.0
1,12,5.0
1,13,5.0
1,14,5.0
1,15,4.0
1,16,5.0
1,17,1.0
1,18,5.0
2,10,1.0
2,11,2.0
2,15,5.0
2,16,4.5
2,17,1.0
2,18,5.0
3,11,2.5
3,12,4.5
3,13,4.0
3,14,3.0
3,15,3.5
3,16,4.5
3,17,4.0
3,18,5.0
4,10,5.0
4,11,5.0
4,12,5.0
4,13,0.0
4,14,2.0
4,15,3.0
4,16,1.0
4,17,4.0
4,18,1.0
In this case, it works fine, but my main question is I have the different set of data which don't have preference values, which contains some data based on that I am thinking to compute preference values. Following is my new dataset,
userid itemid likes shares comments
1 4 1 20 3
2 6 18 20 12
3 12 10 2 20
4 7 0 20 13
5 9 0 2 1
6 5 5 3 2
7 3 9 7 0
8 1 15 0 0
My question is how can I compute preference value for a particular record based on some other columns such as likes, shares, comments etc. Is there anyway to compute this in mahout?

Yes- I think your snippet is from an older version of Mahout, but what you want to use is the Correlated Co Occurrence recommender. The CCO Recommender is multi-modal (allows user to have various inputs).
There are CLI Drivers, but I'm guessing you want to code, there is a Scala tutorial here
In the tutorial I think it recommends 'friends' based on genres tagged and artists 'liked', as well as your current friends.

As #rawkintrevo says, Mahout has moved on from the older "taste" recommenders and they will be deprecated from Mahout soon.
You can build you own system from the CCO algorithm in Mahout here. It allows you to use data from different user behavior like "likes, shares, comments". So we call it multi-modal.
Or in another project we have created a full featured recommendation server based on Mahout, called the Universal Recommender. It is build on Apache PredicitonIO where the UR is a plugin called a Template. Together they deliver a nearly turnkey server that takes input and responds to queries. To get started easily try the AWS AMI that has the whole system working. Some other methods to install are shown here.
This is all Apache licensed OSS, but Mahout no longer can really provide a production ready environment, Mahout does algorithms but you need a system around it. Build your own or try the PredictionIO based one. Since everything is OSS you can tweak things if needed.

How to count distinct values of a reference collection in mongo

Having a list of books that points to a list of authors, I want to display a tree, having in each node the author name and the number of books he wrote. Initially, I have embedded the authors[] array directly into books collection, and this worked like a charm, using the magic of aggregation framework. However, later on, I realise that it would be nice to have some additional information attached to each author (e.g. it's picture, biographical data, birth date, etc). For the first solution, this is bad because:
it duplicates the data (not a big deal, and yes, I know that mongo's purpose is to encapsulate full objects, but let's ignore that for now);
whenever an additional property is created or updated on the old records won't benefit from this change, unless I specifically query for some unique old property and update all the book authors with the new/updated values.
The next thing was to use the second collection, called authors, and each books document is referencing a list of author ids, like this:
{
"_id" : ObjectId("58ed2a254374473fced950c1"),
"authors" : [
"58ed2a254d74s73fced950c1",
"58ed2a234374473fce3950c1"
],
"title" : "Book title"
....
}
For getting the author details, I have two options:
make an additional query to get the data from the author collection;
use DBRefs.
Questions:
Using DBRefs automatically loads the authors data into the book object, similar to what JPA #MannyToOne does for instance?
Is it possible to get the number of written books for each author, without having to query for each author's book count? When the authors were embedded, I was able to aggregate the distinct author name's and also the number of book documents that he was present on. Is such query possible between two collections?
What would be your recommendation for implementing this behaviour? (I am using Spring Data)

You can try the below query in the spring mongo application.
UnwindOperation unwindAuthorIds = Aggregation.unwind("authorsIds", true);
LookupOperation lookupAuthor = Aggregation.lookup("authors_collection", "authorsIds", "_id", "ref");
UnwindOperation unwindRefs = Aggregation.unwind("ref", true);
GroupOperation groupByAuthor = Aggregation.group("ref.authorName").count().as("count");
Aggregation aggregation = Aggregation.newAggregation(unwindAuthorIds, lookupAuthor, unwindRefs, groupByAuthor);
List<BasicDBObject> results = mongoOperations.aggregate(aggregation, "book_collection", BasicDBObject.class).getMappedResults();

Following #Veeram's suggestion, I was able to write this query:
db.book_collection.aggregate([
{
$unwind: "$authorsIds"
},
{
$lookup: {
from: "authors_collection",
localField: "authorsIds",
foreignField: "_id",
as: "ref"
}
},
{$group: {_id: "$ref.authorName", count: {$sum: 1}}}
])
which returns something like this:
{
"_id" : [
"Paulo Coelho"
],
"count" : 1
}
/* 2 */
{
"_id" : [
"Jules Verne"
],
"count" : 2
}
This is exactly what I needed, and it sounds about right. I only need to do an additional query now to get the books with no author set.

FRED Economic Series IDs

I am trying to use the FRED API to get the economic data. The examples provided on the FRED website uses series Id to get the observations/values for a particular series. I am not able to find where I can find the series IDs. For example, in the below example it is asking for series id to get the data-
https://api.stlouisfed.org/fred/series/observations?series_id=GNPCA&api_key=*your_key*
Any help is appreciated.

From the documentation on FRED API, fred/category/series, "Get the series in a category."; an example:
https://api.stlouisfed.org/fred/category/series?category_id=125&api_key=abcdefghijklmnopqrstuvwxyz123456

Just go here Federal Reserve Economic Data | FRED | St. Louis Fed and search for the series you're interested in. Alternately, you can browse by several attributes, including Category from there. Once you've searched, if you find it, click one of the results. A search for S&P 500 yielded a link to S&P 500 | FRED | St. Louis Fed, which has a chart of it. You'll see some letters and/or numbers in parentheses right beside the page's title. In this case, it is SP500. THAT is your series ID.
Your complete link will be https://api.stlouisfed.org/fred/series/observations?series_id=SP500&file_type=json&api_key=YourAPIKey. More parameters are here: St. Louis Fed Web Services: fred/series/observations

Perhaps this will help you:
GET https://api.stlouisfed.org/fred/series/search?search_text=**your+search+query**&api_key=**abc123**&file_type=json&observation_start=2016-03-01&observation_end=2016-03-15
This will return JSON data with the series id you are looking for....
{
"realtime_start": "2016-04-06",
"realtime_end": "2016-04-06",
"order_by": "search_rank",
"sort_order": "desc",
"count": 164,
"offset": 0,
"limit": 1000,
"series**s**": [
{
"id": "**FEDFUNDS**",
"realtime_start": "2016-04-06",
"realtime_end": "2016-04-06",
"title": "Effective Federal Funds Rate",
------------------ json result intentionally cut off -------------
There are different ways to get the series id's and category id's found here:
https://research.stlouisfed.org/docs/api/fred/

MongoDB find has the same speed with and without index

I use MongoTemplate from Spring to access a MongoDB.
final Query query = new Query(Criteria.where("_id").exists(true));
query.with(new Sort(Direction.ASC, "FIRSTNAME", "LASTNAME", "EMAIL"));
if (count > 0) {
query.limit(count);
}
query.skip(start);
query.fields().include("FIRSTNAME");
query.fields().include("LASTNAME");
query.fields().include("EMAIL");
return mongoTemplate.find(query, User.class, "users");
I generated 400.000 records in my MongoDB.
When asking for the first 25 Users without using the above written sort line, I get the result within less then 50 milliseconds.
With sort it lasts over 4 seconds.
I then created indexes for FIRSTNAME, LASTNAME, EMAIL. Single indexes, not combined ones
mongoTemplate.indexOps("users").ensureIndex(new Index("FIRSTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("LASTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("EMAIL", Order.ASCENDING));
After creating these indexes the query again lasts over 4 seconds.
What was my mistake?
-- edit
MongoDB writes this on the console...
Thu Jul 04 10:10:11.442 [conn50] query mydb.users query: { query: { _id: { $exists: true } }, orderby: { LASTNAME: 1, FIRSTNAME: 1, EMAIL: 1 } } ntoreturn:25 ntoskip:0 nscanned:382424 scanAndOrder:1 keyUpdates:0 numYields: 2 locks(micros) r:6903475 nreturned:25 reslen:3669 4097ms

You have to create a compound index for FIRSTNAME, LASTNAME, and EMAIL, in this order and all of them using ascending order.

Thu Jul 04 10:10:11.442 [conn50] query mydb.users query:
{ query: { _id: { $exists: true } }, orderby: { LASTNAME: 1, FIRSTNAME: 1, EMAIL: 1 } }
ntoreturn:25 ntoskip:0 nscanned:382424 scanAndOrder:1 keyUpdates:0 numYields: 2
locks(micros) r:6903475 nreturned:25 reslen:3669 4097ms
Possible bad signs:
Your scanAndOrder is coming true (scanAndOrder=1), correct me if I am wrong.
It has to return (ntoreturn:25) means 25 documents but it is scanning (nscanned:382424) 382424 documents.
indexed queries, nscanned is the number of index keys in the range that Mongo scanned, and nscannedObjects is the number of documents it looked at to get to the final result. nscannedObjects includes at least all the documents returned, even if Mongo could tell just by looking at the index that the document was definitely a match. Thus, you can see that nscanned >= nscannedObjects >= n always.
Context of Question:
Case 1: When asking for the first 25 Users without using the above written sort line, I get the result within less then 50 milliseconds.
Case 2: With sort it lasts over 4 seconds.
query.with(new Sort(Direction.ASC, "FIRSTNAME", "LASTNAME", "EMAIL"));
As in this case there is no index: so it is doing as mentioned here:
This means MongoDB had to batch up all the results in memory, sort them, and then return them. Infelicities abound. First, it costs RAM and CPU on the server. Also, instead of streaming my results in batches, Mongo just dumps them all onto the network at once, taxing the RAM on my app servers. And finally, Mongo enforces a 32MB limit on data it will sort in memory.
Case 3: created indexes for FIRSTNAME, LASTNAME, EMAIL. Single indexes, not combined ones
I guess it is still not fetching data from index. You have to tune your indexes according to Sorting order
Sort Fields (ascending / descending only matters if there are multiple sort fields)
Add sort fields to the index in the same order and direction as your query's sort
For more details, check this
http://emptysqua.re/blog/optimizing-mongodb-compound-indexes/
Possible Answer:
In the query orderby: { LASTNAME: 1, FIRSTNAME: 1, EMAIL: 1 } } order of sort is different than the order you have specified in :
mongoTemplate.indexOps("users").ensureIndex(new Index("FIRSTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("LASTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("EMAIL", Order.ASCENDING));
I guess Spring API might not be retaining order:
https://jira.springsource.org/browse/DATAMONGO-177
When I try to sort on multiple fields the order of the fields is not maintained. The Sort class is using a HashMap instead of a LinkedHashMap so the order they are returned is not guaranteed.
Could you mention spring jar version?
Hope this answers your question.
Correct me where you feel I might be wrong, as I am little rusty.

Java String formatting solution

I have a string description of a company, which is nasty written by different users (hand-typed). Here is a example (focus on the dots, spaces, first letters etc..):
XXXX is a Global menagement consulting,Technology services and
outsourcing company, with 257000people serving clients in more than
120 countries.. combining unparalleled experience, comprehensive
capabilities across all industries and business functions,and
extensive research on the worlds most successfull companies, XXXX
collaborates with clients to help them become high-performance
businesses and governments., the company generated net revenues of
US$27.9 Billion for the fiscal year ended 31.07.2012..
Now what i want is to format the string to a bit nicer version like this:
XXXX is a global management consulting, technology services and
outsourcing company, with 257,000 people serving clients in more than
120 countries. Combining unparalleled experience, comprehensive
capabilities across all industries and business functions, and
extensive research on the world’s most successful companies, XXXX
collaborates with clients to help them become high-performance
businesses and governments. The company generated net revenues of
US$27.9 billion for the fiscal year ended Aug. 31, 2012.
My question is: Is there any library with already defined methods which could do all the spelling corrections, unneeded space removal, etc .. ?
So far, I do it be replacing stuff like " ," with ", " and toUpperCase() if the is a "///." in front etc..
desc = desc.replace(" ", " ");
desc = desc.replace("..", ".");
desc = desc.replace(" .", ".");
desc = desc.replace(" ,", ", ");
desc = desc.replace(".,", ".");
desc = desc.replace(",.", ".");
desc = desc.replace(", .", ".");
desc = desc.replace("*", "");
I'm sure there is a cleaner and better version to do this. Using regex maybe??
Any solution would be appreciated.

If I were trying to solve your problem, I would probably read the text 1 char at a time, and format it as you go. For example, in psuedocode...
while (has more chars){
char letter = readChar();
if (letter == ','){
// checking for the ',.' combination
letter = readChar();
if (readChar == '.'){
// write out a '.' only
out.print('.');
}
else {
// it wasn't the ',.' combination, so you need to output both characters, whatever they are
out.print(',');
out.print(letter);
}
}
else if (another letter you want to filter){
// etc.
}
else {
// doesn't match any of the filters, so just output the letter
out.print(letter);
}
}
Basically if you read the text 1 char at a time, you can detect any of your chosen formatting problems as you go, and correct them immediately. This provides a performance improvement, as you're only reading over the text string once (not 8 times, like you are currently doing), and allows you to add as many different/complex formatting changes as you want. The downside, however, is that you need to write the logic yourself rather than relying on in-built functions.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.