updating an excel file with apache metamodel - java

I'm trying to incorporate Apache MetaModel into a project and keep running into a weird problem. I update an Excel spreadsheet row in code. The code finds the right row, deletes it, then appends the row (with my update) to the bottom of the spreadsheet. I'd like the update to happen in-place, with the same data staying in the same row. I thought it was something I was doing wrong, then set up a stupid simple project to duplicate the behavior. Unfortunately, the problem remains.
Here's the xlsx file:
Name Address City State Zip
Bob 123 Main St. Norman OK 11111
Fred 989 Elm Street Chicago IL 22222
Mary 555 First Street San Francisco CA 33333
Now, I want to update Bob's Zip to "None".
package MMTest;
import java.io.File;
import org.apache.metamodel.UpdateableDataContext;
import org.apache.metamodel.excel.ExcelDataContext;
import org.apache.metamodel.schema.Column;
import org.apache.metamodel.schema.Schema;
import org.apache.metamodel.schema.Table;
import org.apache.metamodel.update.Update;
public class MMTest {
public static void main(String[] args) {
UpdateableDataContext excel = new ExcelDataContext(new File("C:/test/test.xlsx"));
Schema schema = excel.getDefaultSchema();
Table[] tables = schema.getTables();
assert tables.length == 1;
Table table = schema.getTables()[0];
Column Name = table.getColumnByName("Name");
Column Zip = table.getColumnByName("Zip");
excel.executeUpdate(new Update(table).where(Name).eq("Bob").value(Zip, "None"));
}
}
Pretty simple right? Nope.
This is the result:
Name Address City State Zip
<blank line>
Fred 989 Elm Street Chicago IL 22222
Mary 555 First Street San Francisco CA 33333
Bob 123 Main St. Norman OK None
Am I missing something simple? The documentation is pretty sparse, but I've read everything the internet has to offer on this package. I appreciate your time.

Late to the party, but I've recently bumped into this issue and haven't spotted an answer elsewhere yet. The actual deleting takes place in ExcelDeleteBuilder.java
If you aren't concerned about maintaining row order, you could change
for (Row row : rowsToDelete) {
sheet.removeRow(row);
}
to
for (Row row : rowsToDelete) {
int rowNum = row.getRowNum() + 1;
sheet.removeRow(row);
sheet.shiftRows(rowNum, sheet.getLastRowNum(), -1);
}
See Apache POI docs for a better understanding of shiftRows().
As Adi pointed out, you'll still end up with the "updated" row being moved to the bottom, but in my use case the empty row is successfully removed.
N.B. I'm working from Apache Metamodel 4.5.4

You are not missing anything. The ExcelDataContext is not providing it's own update behavior. It is defaulting to use apache meta-model's default store agnostic implementation for updating the data. That implementation of UpdateCallback uses DeleteAndInsertCallback which is causing the behavior you are observing. It picks the row to be updated, updates it with a new value in memory, deletes the original row and inserts the updated row(which ends up in the bottom which is ExcelDataContext behavior).
You can open an issue at https://issues.apache.org/jira/browse/METAMODEL
Attach your sample code and data.
Best would be a failing unit test in
https://git-wip-us.apache.org/repos/asf/metamodel.git

Related

Deleting LDAP record with 0x0A in CN (Java)

I'm trying to delete ADLDS user records created by Microsoft's conflict resolution model.
Microsoft describes the creation of the new records as
The new RDN will be <Old RDN>\0ACNF:<objectGUID>
These are the records I'm trying to delete from my environment.
My search for uid=baduser will return two CNs:
cn=John R. Doe 123456
and
cn=John R. Doe 123456 CNF:123e4567-e89b-12d3-a456-426614174000
The second record has the \0A in the cn.
Executing a ctx.destroySubcontext(cn) on it errors out like this:
cn=John R. Doe 123456CNF:123e4567-e89b-12d3-a456-426614174000,c=US: [LDAP: error code 34 - 0000208F: NameErr: DSID-0310022D, problem 2006 (BAD_NAME), data 8349
What am I missing to be able to delete a record with a cn that contains a line feed character?
note: I also can't seem to read/modify this \0A record using JXplorer. Clicking on the record after a search results in the same BAD_NAME error.
String commonName = attr.get("cn").get().toString().replace("\n", "\\\\0A");
A simple replacement of the \n character worked for me.

Dataflow GCS to BigQuery - How to output multiple rows per input?

Currently I am using the gcs-text-to-bigquery google provided template and feeding in a transform function to transform my jsonl file. The jsonl is pretty nested and i wanted to be able to output multiple rows per one row of the newline delimited json by doing some transforms.
For example:
{'state': 'FL', 'metropolitan_counties':[{'name': 'miami dade', 'population':100000}, {'name': 'county2', 'population':100000}…], 'rural_counties':{'name': 'county1', 'population':100000}, {'name': 'county2', 'population':100000}….{}], 'total_state_pop':10000000,….}
There will obviously be more counties than 2 and each state will have one of these lines. The output my boss wants is:
When i do the gcs-to-bq text transform, i end up only getting one line per state (so I'll get miami dade county from FL, and then whatever the first county is in my transform for the next state). I read a little bit and i think this is because of the mapping in the template that expects one output per jsonline. It seems I can do a pardo(DoFn ?) not sure what that is, or there is a similar option with beam.Map in python. There is some business logic in the transforms (right now it's about 25 lines of code as the json has more columns than i showed but those are pretty simple).
Any suggestions on this? data is coming in tonight/tomorrow, and there will be hundreds of thousands of rows in a BQ table.
the template i am using is currently in java, but i can translate it to python pretty easily as there are a lot of examples online in python. i know python better and i think its easier given the different types (sometimes a field can be null) and it seems less daunting given the examples i saw look simpler, however, open to either
Solving that in Python is somewhat straightforward. Here's one possibility (not fully tested):
from __future__ import absolute_import
import ast
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/service_account.json'
pipeline_args = [
'--job_name=test'
]
pipeline_options = PipelineOptions(pipeline_args)
def jsonify(element):
return ast.literal_eval(element)
def unnest(element):
state = element.get('state')
state_pop = element.get('total_state_pop')
if state is None or state_pop is None:
return
for type_ in ['metropolitan_counties', 'rural_counties']:
for e in element.get(type_, []):
name = e.get('name')
pop = e.get('population')
county_type = (
'Metropolitan' if type_ == 'metropolitan_counties' else 'Rural'
)
if name is None or pop is None:
continue
yield {
'State': state,
'County_Type': county_type,
'County_Name': name,
'County_Pop': pop,
'State_Pop': state_pop
}
with beam.Pipeline(options=pipeline_options) as p:
lines = p | ReadFromText('gs://url to file')
schema = 'State:STRING,County_Type:STRING,County_Name:STRING,County_Pop:INTEGER,State_Pop:INTEGER'
data = (
lines
| 'Jsonify' >> beam.Map(jsonify)
| 'Unnest' >> beam.FlatMap(unnest)
| 'Write to BQ' >> beam.io.Write(beam.io.BigQuerySink(
'project_id:dataset_id.table_name', schema=schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
)
This will only succeed if you are working with batch data. If you have streaming data then just change beam.io.Write(beam.io.BigquerySink(...)) to beam.io.WriteToBigQuery.

Computing preference values in Apache Mahout

I am trying to learn Apache mahout, very new to this topic. I want to implement user-based recommender. For this, after exploring on the internet I have found some samples like below,
public static void main(String[] args) {
try {
int userId = 2;
DataModel model = new FileDataModel(new File("data/mydataset.csv"), ";");
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(userId, 3);
for (RecommendedItem recommendation : recommendations) {
logger.log(Level.INFO, "Item Id recommended : " + recommendation.getItemID() + " Ratings : "
+ recommendation.getValue() + " For UserId : " + userId);
}
} catch (Exception e) {
logger.log(Level.SEVERE, "Exception in main() ::", e);
}
I am using following dataset which contains userid, itemid, preference value respectively,
1,10,1.0
1,11,2.0
1,12,5.0
1,13,5.0
1,14,5.0
1,15,4.0
1,16,5.0
1,17,1.0
1,18,5.0
2,10,1.0
2,11,2.0
2,15,5.0
2,16,4.5
2,17,1.0
2,18,5.0
3,11,2.5
3,12,4.5
3,13,4.0
3,14,3.0
3,15,3.5
3,16,4.5
3,17,4.0
3,18,5.0
4,10,5.0
4,11,5.0
4,12,5.0
4,13,0.0
4,14,2.0
4,15,3.0
4,16,1.0
4,17,4.0
4,18,1.0
In this case, it works fine, but my main question is I have the different set of data which don't have preference values, which contains some data based on that I am thinking to compute preference values. Following is my new dataset,
userid itemid likes shares comments
1 4 1 20 3
2 6 18 20 12
3 12 10 2 20
4 7 0 20 13
5 9 0 2 1
6 5 5 3 2
7 3 9 7 0
8 1 15 0 0
My question is how can I compute preference value for a particular record based on some other columns such as likes, shares, comments etc. Is there anyway to compute this in mahout?
Yes- I think your snippet is from an older version of Mahout, but what you want to use is the Correlated Co Occurrence recommender. The CCO Recommender is multi-modal (allows user to have various inputs).
There are CLI Drivers, but I'm guessing you want to code, there is a Scala tutorial here
In the tutorial I think it recommends 'friends' based on genres tagged and artists 'liked', as well as your current friends.
As #rawkintrevo says, Mahout has moved on from the older "taste" recommenders and they will be deprecated from Mahout soon.
You can build you own system from the CCO algorithm in Mahout here. It allows you to use data from different user behavior like "likes, shares, comments". So we call it multi-modal.
Or in another project we have created a full featured recommendation server based on Mahout, called the Universal Recommender. It is build on Apache PredicitonIO where the UR is a plugin called a Template. Together they deliver a nearly turnkey server that takes input and responds to queries. To get started easily try the AWS AMI that has the whole system working. Some other methods to install are shown here.
This is all Apache licensed OSS, but Mahout no longer can really provide a production ready environment, Mahout does algorithms but you need a system around it. Build your own or try the PredictionIO based one. Since everything is OSS you can tweak things if needed.

BigQuery WORM work-around for updated data

Using Google's "electric meter" example from a few years back, we would have:
MeterID (Datastore Key) | MeterDate (Date) | ReceivedDate (Date) | Reading (double)
Presuming we received updated info (Say, out of calibration/busted meter, etc.) and put in a new row with same MeterID and MeterDate, using a Window Function to grab the newest Received Date for each ID+MeterDate pair would only cost more if there is multiple records for that pair, right?
Sadly, we are flying without a SQL expert, but it seems like the query should look like:
SELECT
meterDate,
NTH_VALUE(reading, 1) OVER (PARTITION BY meterDate ORDER BY receivedDate DESC) AS reading
FROM [BogusBQ:TableID]
WHERE meterID = {ID}
AND meterDate BETWEEN {startDate} AND {endDate}
Am I missing anything else major here? Would adding 'AND NOT IS_NAN(reading)' cause the Window Function to return the next row, or nothing? (Then we could use NaN to signify "deleted".)
Your SQL looks good. Couple of advices:
- I would use FIRST_VALUE to be a bit more explicit, but otherwise should work.
- If you can - use NULL instead of NaN. Or better yet, add new BOOLEAN column to mark deleted rows.

javacc skip comments but need to keep useful comments

I need to use javaCC to parse a data file like:
//This is comment to skip
//This is also comment to skip
//student Table
Begin:
header:( 1 //name
2 //age ) { "John" 21 } { "Jack" "22" }
#End
//The following is teacher table, this line is also comment to skip
//Teacher Table
Begin:
header:( 1 //name
2 //age 3 //class ) { "Eric" 31 "English" } { "Jasph" "32" "History" }
#End
Here I need to fetch data from "student" and "teacher" tables, there are also some other tables formatted like above. Data exported from "student" table is:
Table Name: student
name age
John 21
Jack 22
That is I need to skip comments like: "//This is also comment to skip", but need to keep the tokens like: "//student Table", "//Teacher Table", "//name", "//age" etc. How to write such SKIP expression? Thanks.
Slightly late, but you might be looking at it wrong.
Surely in your case, // isn't really a comment, it is part of the syntax you are parsing. It just happens that sometimes the bit following // is irrelevant.
I would parse the comments and decide which ones to discard in your Java code.

Categories