Cross referencing in knime

Cross referencing in knime - java

I'm using Knime and I need to be able to cross reference a value within a csv file against a value I get from an Oracle DB.
Specifically, I need to match a ZIP code I get from the DB to a CSV file I have that contains zip codes and their corresponding counties.
I'm not really sure how to approach it. I've tried Joins and Cross joins but the data ends up looking garbled and I'm unable to make any sense of it. Worst case scenario I end up manually looking things up.

If you are adding only one column (county in this example) I prefer Cell Replacer node with option to Append column. Easier to configure and faster than Joiner node.

Related

Looking for the right solution to store filenames

Just started out with my hobby project and now I am here to get help with making the correct database design/query. I have made a simple Java program that loops trough the content of a folder. I want to save this content to a MySQL database, so I added a connector to my database in Java, created a table and the columns "file", "path" and "id, "date" in MySQL.
So now to the important/fun thing, every time I want to add the filenames to the MySQL in Java I do this (when the GUI-button is pressed I call on a method that does):
DELETE all entries with the same file path - this is to ensure that I will get new entries which is exactly the same as the content in the path.
Java-loop: INSERT the file-info into the columns id, path, filename and date when the file was added to the database.
In this way I can always ensure that the filenames that are going to be added into the database always are up to date, it doesn't matter if I rename a file or remve it, it will be up to date since the table will get it's entries deleted and new info will be written. Old info -> DELETE old info - INSERT new info -> Up-to-date.
I know this is probably not the best solution but it works, but now I am stuck on the next thing I want to do. I want to add the difference of the files in order to know which files has been added and deleted between two inserts, and here is my problem, since the entries are deleted before a new INSERT I cannot compare. How would you change the design or the solution? All ideas are welcome and since I am so fresh I would really appreciate if you could show me how the query could look like.

Do not remove all rows first. Remove only the ones that are removed (or event better, just mark them "inactive" as I suggest below). Query your DB first, to see what was there last time.
I would maintain additional column in your table called "inactive". It will be FALSE as default, and TRUE for removed files. Please keep in mind that as your file is uniquely identified by file+path+id renaming any file is indeed an operation of deleting the old one and creating the new one.
Removing things from DB is not a good idea, as you might always remove something by accident (bug in the code) and would not be able to get the data back.
Additional thing to do is adding the hash to your table. This way you will be able to check if the file was really changed. There is no need to re-add the file to the DB is it is not changed. See Getting a File's MD5 Checksum in Java for more info.

One way to achieve this is to implement auditing of your table. A common approach is to create a copy of the table where you are storing the folder contents and name that table using a convention to indicate it is storing audit information (eg. _AUD) . You then add additional columns to the AUD table, like "REV" (revision), "REV_TYPE" (inserted, deleted, modified). Whenever you insert, update or delete any rows from your main table, you insert a row into the AUD table to describe what you've done. Then you can find the operations associated with each revision by looking it up in the AUD table. A java framework that provides this feature is hibernate envers (http://hibernate.org/orm/envers/).

How to tell initial data load to insert only the values which are not there in target db?

i have some large data in one table and small data in other table,is there any way to run initial load of golden gate so that same data in both tables wont be changed and rest of the data got transferred from one table to other.

Initial loads are typically for when you are setting up the replication environment; however, you can do this as well on single tables. Everything in the Oracle database is driven by System Change Numbers/Change System Numbers (SCN/CSN).
By using the SCN/CSN, you can identify what the starting point in the table should be and start CDC from there. Any prior to the SCN/CSN will not get captured and would require you to manually move that data in some fashion. That can be done by using Oracle Data Pump (Export/Import).
Oracle GoldenGate also provided a parameter called SQLPredicate that allows you to use a "where" clause against a table. This is handy with initial load extracts because you would do something like TABLE ., SQLPredicate "as of ". Then data before that would be captured and moved to the target side for a replicat to apply into a table. You can reference that here:
https://www.dbasolved.com/2018/05/loading-tables-with-oracle-goldengate-and-rest-apis/
Official Oracle Doc: https://docs.oracle.com/en/middleware/goldengate/core/19.1/admin/loading-data-file-replicat-ma-19.1.html
On the replicat side, you would use HANDLECOLLISIONS to kick out any ducplicates. Then once the load is complete, remove it from the parameter file.
Lots of details, but I'm sure this is a good starting point for you.

That would require programming in java.
1) First you would read your database
2) Decide which data has to be added in which table on the basis of data that was read.
3) Execute update/ data entry queries to submit data to tables.

If you want to run Initial Load using GoldenGate:
Target tables should be empty
Data: Make certain that the target tables are empty. Otherwise, there
may be duplicate-row errors or conflicts between existing rows and
rows that are being loaded. Link to Oracle Documentations
If not empty, you have to treat conflicts. For instance if the row you are inserting already exists in the target table (INSERTROWEXISTS) you should discard it, if that's what you want to do. Link to Oracle Documentation

How do I best handle a simple column replacement in a database using a java program?

I have a table A [id,name] and it has about say, 10 million records. I need to replace all names with 10 million unique names. So, for this I have a text file that acts as a lookup file and it has 10 million names in it separated by new line. So, bunch of questions:
How do I go about randomly replacing these 10 million names in the
database with 10 million names in a text file? - I can think of a few approaches, caching the entire file, and creating a map of what is being replaced, so that I never reuse entries from lookup file.OR use a database table, load look up file into it and make use of this table.
In general, what would be a good number for # of writes/ # of reads that would make it a case for using database against using a file? Say, if your program is reading a file million times and writing to another file million times, would you switch to using database? What's the upperlimit really (if there is any)?

Well lets think, you have THAT many names, that it cant be all loaded into the memory, so we are going to find the solution as usable as possible.
For random approach you can create temp column in database, create the Unique key over it and always use this :
1)take a name on line "x" (by random or whatever you want)
2)random record "y" in database which was not replaced yet (it can be tracked with just one boolean)
3)try to add the name on line x to the record y AND to the same record add x to the temp column.
4)if Unique errorcomes, it means the name was already given to someone, repeat once more with another x.
If we can track "x" and we are sure, we are not using alredy given names, we dont need the Unique modifier.

Instead of replacing the names randomly if i would have been in your place I would have opted for a batch based approached where I would have processed the data in chunks. Have a reader to read the chunk, a processor to update new values from the file and a writer to write back the updated values to a database.
Your second question is a bit unclear. The decision to go with files is purely based on the requirement. If you are getting the data in flat files you will have to read from that. Even if there are billions of rows moving all the data from a flat file to database table and then again using it to update another table is an overkill. You are unnecessarily persisting data which you are not going to use later on once the intended table column is updated.

Is there a clean way to read embedded SQL resource files?

To avoid creating SQL statements as strings in a class I've placed them as .sql files in the same package and read the contents to a string in the static constructor. The reason for this is the SQL is very complex due to an ERP system that the SQL is querying.
There's no problem with this method, though since the SQL reading mechanism quite simply just reads the whole file any comments within that file may cause the read to fail if they are at the end of the line, as when reading it first removes excess whitespace and removes new-lines. Full commented lines (i.e. lines beginning with -- are removed).
I could enhance the simple reading to read the file and remove commented lines etc, though I have to wonder if there is something already available that could read an SQL file and clean it up.

I've seen this same problem solved in a project I've worked on by storing queries in XML, and loading the XML into a custom StoredQueriesCache object at runtime. To get a query, we would call a method on the StoredQueriesCache object and just pass the query name (which is defined in the XML), and it would return the query.
Writing something like this is fairly simple. The XML would look something like this below...
<Query>
<Name>SomeUniqueQueryName</Name>
<SQL>
SELECT someColumn FROM someTable WHERE somePredicate
</SQL>
</Query>
You would have one element for every stored query. The XML would be loaded into memory at application startup from file, or depending on your needs it could be lazy loaded from file. Then your StoredQueriesCache object that holds the XML would have methods to return individual queries by name. In my experience, having comments in the query has never caused any issue since linebreaks are part of the XML node's innertext, but if you want your StoredQueriesCache methods that retrieve the queries could parse comments out.
I've found this to be the most organized way of storing queries without embedding them in code, and without using stored procedures. There should honestly be a library that does this for you; maybe I'll write one!

JAVA : file exists Vs searching large xml db

I'm quite new to Java Programming and am writing my first desktop app, this app takes a unique isbn and first checks to see if its all ready held in the local DB, if it is then it just reads from the local DB, if not it requests the data from isbndb.com and enters it into the DB the local DB is in XML format. Now what im wondering is which of the following two methods would create the least overhead when checking to see if the entry all ready exists.
Method 1.) File Exists.
On creating said DB entry the app would create a seperate file for every isbn number named isbn number.xml (ie. 3846504937540.xml) and when checking would use the file exists method to check if an entry all ready exists using the user provided isbn .
Method 2.) SAX XML Parser.
All entries would be entered into a single large XML file and when checking for existing entries the SAX XML Parser would be used to parse the file and then the user provided isbn would be checked against those in the XML DB for a match.
Note :
The resulting entries could number in the thousands over time.
Any information would be greatly appreciated.

I don't think either of your methods is all that great. I strongly suggest using a DBMS to store the data. If you don't have a DBMS on the system, or if you want an app that can run on systems without an installed DBMS, take a look at using SQLite. You can use it from Java with SQLiteJDBC by David Crawshaw.
As far as your two methods are concerned, the first will generate a huge amount of file clutter, not to mention maintenance and consistency headaches. The second method will be slow once you have a sizable number of entries because you basically have to read (on the average) half the data base for every query. With a DBMS, you can avoid this by defining indexes for the info you need to look up quickly. The DBMS will automatically maintain the indexes.

I don't like too much the idea of relying on the file system for that task: I don't know how critical is your application, but many things may happen to these xml files :) plus, if the folder gets very very big, you would need to think about splitting these files in some hierarchcal folder structure, to have decent performance.
On the other hand, I don't see why using an xml file as a database, if you need to update frequently.
I would use a relational database, and add a new record in a table for each entry, with an index on the isbn_number column.
If you are in the thousands records, you may very well go with sqlite, and you can replace it with a more powerful non-embedded DB if you ever need it, with no (or little :) ) code modification.

I think you'd better use DBMS instead of your 2 methods.

If you want least overhead just for checking existence, then option 1 is probably what you want, since it's direct look up. Parsing XML each time for checking requires you to to pass through the whole XML file in worst case. Although you can do caching with option 2 but that gets more complicated than option 1.
With option 1 though, you need to beware that there is a limit of how many files you can store under a directory, so you probably have to store the XML files by multiple layer (for example /xmldb/38/46/3846504937540.xml).
That said, neither of your options is good way to store data in the long run, you will find them become quite restrictive and hard to manage as data grows.
People already recommended using DBMS and I agree. On top of that I would suggest you to look into document-based database like MongoDB as your database.

Extend your db table to not only include the XML string but also the ISBN number.
Then you select the XML column based on the ISBN column.
Query: Java escaped, "select XMLString from cacheTable where isbn='"+ isbn +"'"
A different approach could be to use an ORM like Hibernate.
In ORM instead of saving the whole XML document in one column you use different different columns for each element and attribute and you could even split upp your document over several tables for a simpler long term design.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Cross referencing in knime - java

If you are adding only one column (county in this example) I prefer Cell Replacer node with option to Append column. Easier to configure and faster than Joiner node.

Related

Looking for the right solution to store filenames

How to tell initial data load to insert only the values which are not there in target db?

How do I best handle a simple column replacement in a database using a java program?

Is there a clean way to read embedded SQL resource files?

JAVA : file exists Vs searching large xml db

Categories

Resources