Duke deduplication engine: linking records not working?

Duke deduplication engine: linking records not working? - java

I am attempting to use Duke to match records from one database to another. One db has song titles + writers. I am trying to match to another db to find duplicates and corresponding records.
I have gotten duke to run and I can see some of the records getting matched. But no matter what I do, Correct links found = 0% always and I just cant right to the linkfile.
This is what I have done currently:
<duke>
<schema>
<threshold>0.79</threshold>
<maybe-threshold>0.70</maybe-threshold>
<path>test</path>
<property type="id">
<name>PublishingID</name>
</property>
<property type="id">
<name>AmgID</name>
</property>
<property>
<name>NAME</name>
<comparator>no.priv.garshol.duke.comparators.JaroWinkler</comparator>
<low>0.12</low>
<high>0.61</high>
</property>
<property>
<name>TITLE</name>
<comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
<low>0.09</low>
<high>0.93</high>
</property>
</schema>
<group>
<jdbc>
<param name="driver-class" value="com.mysql.jdbc.Driver"/>
<param name="connection-string" value="jdbc:mysql://127.0.0.1"/>
<param name="user-name" value="root"/>
<param name="password" value="root"/>
<param name="query" value="
SELECT pSongs.song_id, pSongs.songtitle, pSongs.publisher_id, pWriters.first_name AS writer_first_name, pWriters.last_name AS writer_last_name
FROM devel_matching.publisher_songs AS pSongs
INNER JOIN devel_matching.publisher_writers as pWriters ON pWriters.publisher_id = pSongs.publisher_id AND pWriters.song_id = pSongs.song_id
WHERE pSongs.writers LIKE '%LENNON, JOHN%'
LIMIT 20000;"/>
<column name="song_id" property="PublishingID"/>
<column name="songtitle" property="TITLE" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
<column name="writer_first_name" property="NAME" cleaner = "no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
</jdbc>
</group>
<group>
<jdbc>
<param name="driver-class" value="com.mysql.jdbc.Driver"/>
<param name="connection-string" value="jdbc:mysql://127.0.0.1"/>
<param name="user-name" value="root"/>
<param name="password" value="root"/>
<param name="query" value="
SELECT amgSong.id, amgSong.track, SUBSTRING_INDEX(SUBSTRING_INDEX(amgSong.composer, '/', numbers.n), '/', -1) composer
FROM
devel_matching.numbers INNER JOIN devel_matching.track as amgSong
ON CHAR_LENGTH(amgSong.composer) - CHAR_LENGTH(REPLACE(amgSong.composer, '/', '')) >= numbers.n - 1
WHERE amgSong.composer like '%lennon%'
LIMIT 5000;"/>
<column name="id" property = "AmgID"/>
<column name="track" property="TITLE" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
<column name="composer" property="NAME" cleaner = "no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
</jdbc>
</group>
Output:
Total records: 5000
Total matches: 8284
Total non-matches: 1587
Correct links found: 0 / 0 (0.0%)
Wrong links found: 0 / 0 (0.0%)
Unknown links found: 8284
Percent of links correct 0.0%, wrong 0.0%, unknown 100.0%
Precision 0.0%, recall NaN%, f-number 0.0
Running on Spring STS:
program arguments = --progress --verbose --testfile=linked.txt --testdebug --showmatches duke.xml
Its not writing to linked.txt or finding any correct links. Not sure what I am doing wrong here. Any help would be awesome.

Actually, it is finding 8284 links. --testfile is for giving Duke a file containing known correct links, basically test data. What you want is --linkfile, which writes the links you've found into that file.
I guess I should add code which warns against an empty test file, since that very likely indicates a user error.
You'd probably be better off asking this question on the Duke mailing list, btw.

Related

Hibernate View showing different results in app and Workbench

I'm working on an app in Java connected to a MySql database by hibernate.
I'm using Pojos to define the classes and using the class Session to connect to the database.
The problem is the next view:
CREATE OR REPLACE VIEW INVENTARIO AS
SELECT
ID_ARTICULO,
ID_ESTRUCTURA,
ID_ESTRUCTURA_ORIGEN,
SUM(STOCK)STOCK,
STOCK_MIN,
NECESITA_REPO
FROM
HISTORICO_INVENTARIO
LEFT JOIN TIPOS_MOVIMIENTO
ON HISTORICO_INVENTARIO.ID_TIPO_MOV = TIPOS_MOVIMIENTO.ID_TIPO_MOV
GROUP BY ID_ARTICULO , ID_ESTRUCTURA , ID_ESTRUCTURA_ORIGEN , STOCK_MIN , NECESITA_REPO;
In Java, i'm mapping the view this way:
<hibernate-mapping>
<class name="Pojos.Inventario" table="INVENTARIO">
<id name="id_articulo" type="string" column="ID_ARTICULO"/>
<property name="id_estructura" type="string" column="ID_ESTRUCTURA" />
<property name="id_estructura_origen" type="string" column="ID_ESTRUCTURA_ORIGEN" />
<property name="stock" type="float" column="STOCK" />
<property name="stock_min" type="float" column="STOCK_MIN" />
<property name="necesita_repo" type="string" column="NECESITA_REPO" />
</class>
I've to say that the field "id_articulo" is not the ID, but i've to choose one because.
If i execute this view in MySql Workbench i can the the results correctly. If i execute the same query in my app, i'm having different results.
Does anyone knows why could be this happening?
Thanks in advance.
EDIT:
I've tried to define the XML putting the SQL in the subselect tag:
<class name="Pojos.Inventario">
<subselect>
SELECT
ID_ARTICULO,
ID_ESTRUCTURA,
ID_ESTRUCTURA_ORIGEN,
SUM(STOCK) STOCK,
STOCK_MIN,
NECESITA_REPO
FROM
HISTORICO_INVENTARIO
LEFT JOIN TIPOS_MOVIMIENTO
ON HISTORICO_INVENTARIO.ID_TIPO_MOV = TIPOS_MOVIMIENTO.ID_TIPO_MOV
GROUP BY ID_ARTICULO , ID_ESTRUCTURA , ID_ESTRUCTURA_ORIGEN , STOCK_MIN , NECESITA_REPO
</subselect>
<synchronize table="HISTORICO_INVENTARIO"/>
<synchronize table="TIPOS_MOVIMIENTO"/>
<id name="id_articulo" type="string" column="ID_ARTICULO"/>
<property name="id_estructura" type="string" column="ID_ESTRUCTURA" />
<property name="id_estructura_origen" type="string" column="ID_ESTRUCTURA_ORIGEN" />
<property name="stock" type="float" column="STOCK" />
<property name="stock_min" type="float" column="STOCK_MIN" />
<property name="necesita_repo" type="string" column="NECESITA_REPO" />
</class>
Getting the worong resultset

make your hibernate show_sql parameter to true. Now try to capture sql in your log and try to run it in your sql workbench.
<property name="show_sql">true</property>

Done it!
The problem was produced by the ID. I've added one extra field wich is the new ID. Now I'm getting the correct resultset

vc-complex-type.2.4.a: Invalid content was found starting with element

This is my xml code. The errors happen below the one-to-one and all the basic attributes contain errors. However, when I move the one-to-one to the bottom of the attributes list, the error goes away, which is odd.
<one-to-one name="page" fetch="LAZY">
<join-column name="PAGE_ID" referenced-column-name="PAGE_ID" updatable="false" insertable="false" />
</one-to-one>
<basic name="sublinkNm">
<column name="SUBLINK_NM" />
</basic>
<basic name="descriptionTxt">
<column name="DESCRIPTION_TXT" />
</basic>
<basic name="searchableFlg">
<column name="SEARCHABLE_FLG" />
</basic>
This is the error I get. The error is affecting a few of my xml files
Description Resource Path Location Type
cvc-complex-type.2.4.a: Invalid content was found starting with element 'basic'. One of '{"http://www.eclipse.org/eclipselink/xsds/persistence/orm":many-to-one, "http://www.eclipse.org/eclipselink/xsds/persistence/orm":one-to-many, "http://www.eclipse.org/eclipselink/xsds/persistence/orm":one-to-one, "http://www.eclipse.org/eclipselink/xsds/persistence/orm":variable-one-to-one, "http://www.eclipse.org/eclipselink/xsds/persistence/orm":many-to-many, "http://www.eclipse.org/eclipselink/xsds/persistence/orm":element-collection, "http://www.eclipse.org/eclipselink/xsds/persistence/orm":embedded, "http://www.eclipse.org/eclipselink/xsds/persistence/orm":transformation, "http://www.eclipse.org/eclipselink/xsds/persistence/orm":transient, "http://www.eclipse.org/eclipselink/xsds/persistence/orm":structure, "http://www.eclipse.org/eclipselink/xsds/persistence/orm":array}' is expected. page.xml /ccservices/src/com/dstoutput/cp/resource/jpa/mapping line 68 XML Problem

Hibernate write followed by read causes object not found

I'm currently working on a Quizzing Tool that uses hibernate and spring. I'm actually building it as a Sakai LMS tool and that complicates this question a little more, but let me see if I can generalize.
My current scenario is when users go to a StartQuiz page, which when they submit the form on the page, initializes an Attempt object(Stored by hibernate). It populates the object below:
<class name="org.quiztool.model.Attempt" table="QT_ATTEMPTS">
<cache usage="transactional" />
<id name="id" type="long">
<generator class="native">
<param name="sequence">QT_ATTEMPTS_ID_SEQ</param>
</generator>
</id>
<many-to-one name="quizId" class="org.quiztool.model.Quiz" cascade="none" />
<property name="score" type="int" not-null="true" />
<property name="outOf" type="int" not-null="true" />
<list name="responses" cascade="none" table="QT_RESPONSES" lazy="false">
<key column="id"/>
<index column="idxr"/>
<many-to-many class="org.quiztool.model.QuizAnswer" />
</list>
<list name="questionList" cascade="none" table="QT_ATTEMPT_QUESTIONS" lazy="false">
<key column="id"/>
<index column="idxq"/>
<many-to-many class="org.quiztool.model.QuizQuestion" />
</list>
<property name="userId" type="string" length="99" />
<property name="siteRole" type="string" length="99" />
<property name="startTime" type="java.util.Date" not-null="true" />
<property name="finishTime" type="java.util.Date" />
</class>
It randomly picks out a set of questions and sets the start time and a few other properties, then redirects the user to the TakeTheQuiz page after saving the object through hibernate.
On the TakeTheQuiz page it loads the attempt object by its ID which is passed as a request param, then prints and formats it into an html form for the user to fill out the quiz. About 2/5 concurrent users will see no questions. The attempt object loads, and its questions are empty.
My theory is that the question list in the Attempt object is either not inserting immediately to the database(which is fine as long as the object goes to the hibernate cache, and I can then get it from the cache ,which I cant see to figure out how to do) OR it is saving to the Database, but my load of the object on the TakeTheQuiz page is reading an incomplete object from the cache.
Admittedly my Hibernate knowledge is limited, so if someone can help me understand what could be happening here and how to fix it, please let me know.

The answer, as I found out, was simple. It seemed that my save function was committing to the database lazily. Once I forced commits for that object at the end of each transaction the problem was solved.
I ended up writing my own hibernate session code which looks like this:
Session session = getSession();
session.beginTransaction();
session.saveOrUpdate(attempt);
session.getTransaction.commit();
session.close();
Problem solved.

My theory is that there is something wrong with the piece of code that randomly picks the questions. Are you sure that it works? Please paste some of your code.
A second theory is that there is something wrong with your transaction boundaries. When do you flush the session? And when is your transaction committed? Give it a try and set the FlushMode on your session to ALWAYS. Does this change something?

Duke Fast Deduplication: java.lang.UnsupportedOperationException: Operation not yet supported?

I'm trying to use the Duke Fast Deduplication Engine to search for some duplicate records in the database at the company where I work.
I run it from the command line like this:
java -cp "C:\utils\duke-0.6\duke-0.6.jar;C:\utils\duke-0.6\lucene-core-3.6.1.jar" no.priv.garshol.duke.Duke --showmatches --verbose .\config.xml
But I get an error:
Exception in thread "main" java.lang.UnsupportedOperationException: Operation no
t yet supported
at sun.jdbc.odbc.JdbcOdbcResultSet.isClosed(Unknown Source)
at no.priv.garshol.duke.datasources.JDBCDataSource$JDBCIterator.close(JD
BCDataSource.java:115)
at no.priv.garshol.duke.Processor.deduplicate(Processor.java:152)
at no.priv.garshol.duke.Duke.main_(Duke.java:135)
at no.priv.garshol.duke.Duke.main(Duke.java:38)
My configuration file looks like this:
<duke>
<schema>
<threshold>0.82</threshold>
<maybe-threshold>0.80</maybe-threshold>
<path>test</path>
<property type="id">
<name>ID</name>
</property>
<property>
<name>LNAME</name>
<comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator>
<low>0.6</low>
<high>0.8</high>
</property>
<property>
<name>FNAME</name>
<comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator>
<low>0.6</low>
<high>0.8</high>
</property>
<property>
<name>MNAME</name>
<comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator>
<low>0.3</low>
<high>0.5</high>
</property>
<property>
<name>SSN</name>
<comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator>
<low>0.0</low>
<high>1.0</high>
</property>
</schema>
<jdbc>
<param name="driver-class" value="sun.jdbc.odbc.JdbcOdbcDriver" />
<param name="connection-string" value="jdbc:odbc:VT_DeDupe" />
<param name="user-name" value="aleer" />
<param name="password" value="**" />
<param name="query" value="select SocialSecurityNumber, LastName, FirstName, MiddleName, empssn from T_Employees" />
<column name="SocialSecurityNumber" property="ID" />
<column name="LastName" property="LNAME" />
<column name="FirstName" property="FNAME" />
<column name="MiddleName" property="MNAME" />
<column name="empssn" property="SSN" />
</jdbc>
</duke>
It doesn't really tell me what is unsupported...I'm just trying it out, nothing serious with the configuration yet.

As mbonaci says, the problem is that the JDBC driver's isClosed() method is not implemented. This even though implementing it would be no harder than simply writing "return closed".
I added an ugly workaround for this issue now. Please do an "hg pull" and try again.

Which Java version are you using?
sun.jdbc.odbc.JdbcOdbcResultSet.isClosed first appeared in Java 1.6. and it still looks like this in v1.7 (I haven't checked in Java 8):
public boolean isClosed() throws SQLException {
throw new UnsupportedOperationException("Operation not yet supported");
}
So simply don't call that method. Use some other way of checking whether resultset is closed.
Or if you cannot change the code ask the project's authors for help (I see there was an effort to solve exception when closing RS).

basic Hibernate setup question: why is this resulting in one million null objects?

I have two tables: foo (primary key: foo_id) and foo_entry (primary key: foo_entry_id; foreign key: foo_id).
Below is my Hibernate config.
My problem is, when I call getAttributes() on the FooModel class, I end up with a list of a little over one million null objects. (foo table has ~200 rows, foo_entry has ~10,000).
I'm new to Hibernate and suspect I am just overlooking or am just not understanding something very, very basic. Any help appreciated!
<hibernate-mapping package="com.blah.www">
<class name="FooModel" table="foo">
<id name="fooId" column="foo_id"></id>
<list name="attributes" table="foo_entry">
<key column="foo_id" />
<index column="entry_id" />
<one-to-many class="FooEntryModel" />
</list>
</class>
</hibernate-mapping>
<hibernate-mapping package="com.blah.www">
<class name="FooEntryModel" table="foo_entry">
<id name="fooEntryId" column="foo_entry_id">
<generator class="native" />
</id>
<property name="fooId" type="int" column="foo_id" />
<property name="attrName" type="string" column="attr_name" />
<property name="attrValue" type="string" column="attr_value" />
<property name="startDate" type="timestamp" column="start_date" />
<property name="endDate" type="timestamp" column="end_date" />
</class>
</hibernate-mapping>

The numbers imply you're getting a Cartesian join. Do you have the FK set up in the database?
aside - I used Hibernate for a a year and never coded an attribute-infused model or one of those files like your show. We always reverse-engineered the database.

First step to debug is to see the query, Hibernate generated for you, in the logs. However, I suggest you to try this,
<list name="attributes">
<key column="foo_id" />
<one-to-many class="FooEntryModel" />
</list>

Sigh...
This turned out to have a very logical (and very subtle) explanation.
I had misunderstood and hijacked the semantics of the <index> (also known as <list-index>) tag within <list>. Namely, given:
<list name="attributes">
<key column="foo_id" />
<index column="some_integer_value" />
<one-to-many class="FooEntryModel" />
</list>
... I thought was referring to the attribute by which you want to order the list. In fact, it refers to an attribute whose value denotes at what index position within the list to insert the overall object. It's meant to be a placeholder attribute, maintained and used entirely by Hibernate.
The value of the "some_integer_value" to which I was mapping varied in my test data. Sometimes the value was less than a 100. Sometimes it was greater than a million.
Thus, upon mapping just one row where "some_integer_value" == e.g. 100,001, Hibernate would create a list with that object inserted in the 100,001st position. Every list member preceding it, naturally, would be null.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Duke deduplication engine: linking records not working? - java

Related

Hibernate View showing different results in app and Workbench

vc-complex-type.2.4.a: Invalid content was found starting with element

Hibernate write followed by read causes object not found

Duke Fast Deduplication: java.lang.UnsupportedOperationException: Operation not yet supported?

basic Hibernate setup question: why is this resulting in one million null objects?

Categories

Resources