Is it okay to validate JSON at PostgreSQL side?

Is it okay to validate JSON at PostgreSQL side? - java

Writing APIs I used to validate all input parameters on the Java (or PHP, whatever) side, but now we moved our DBs to PostgreSQL which gives us great JSON features, like building JSON from table rows and a lot more (I didn't find anything we can't to without PGSQL JSON-functions so far). So I thought what if I do all parameters validation to Postgres (also considering that I can return JSON straight from database)?
In Java I made it like this:
if (!params.has("signature"))
//params comes from #RequestBody casted to JSONObject
return errGenerator.genErrorResponse("e01"); //this also need database access to get error description
On a Postgres I will to that like this (tested, works as expected):
CREATE OR REPLACE FUNCTION test.testFunc(_object JSON)
RETURNS TABLE(result JSON) AS
$$
BEGIN
IF (_object -> 'signature') IS NULL --so needed param is empty
THEN
RETURN QUERY (SELECT row_to_json(errors)
FROM errors
WHERE errcode = 'e01');
ELSE --everything is okay
RETURN QUERY (SELECT row_to_json(other_table)
FROM other_table);
END IF;
END;
$$
LANGUAGE 'plpgsql';
And so on...
The one problem I see so far is that if we move to MS SQL or Sybase it will need to rewrite all procedures. But as NoSQL comes more and more now, it seems to be unlikely and If we move to NoSQL DB we will also have to recode all APIs

You have to consider basically two items:
The closer you put your checks to the data storage, the safer it is. If you have the database perform all the checks, they'll be performed no matter how you interface with it, whether through your application, or through some third party tool you might be using (if even only for maintenance). In that sense, checking at the database side improves security (as in "data consistency"). In that respect, it does make all the sense to have the database perform the checks.
The closer you put your checks to the user, the fastest you can respond to his/her input. If you have a web application that needs fast response times, you probably want to have the checks on the client side.
And take into consideration an important one:
You might also have to consider your team knowledge: what the developers are more comfortable with. If you know your Java library much better than you know your database functions... it might make sense to perform all the checks Java-side.
You can have a third way: do both checks in series, first application (client) side, then database (server) side. Unless you have some sophisticated automation, this involves extra work to make sure that all checks performed are consistent. That is, there shouldn't be any data blocked at the client-side that woud be allowed to pass when checked by the database. At least, the most basic checks are performed in the first stages, and all of them (even if they're redundant) are performed in the database.
If you can afford the time to move the data through several application layers, I'd go with safety. However, the choice to be made is case-specific.

So I found some keys... The main is that I can have my error messages been cached in my application that will allow to avoid making database request if input parameters doesn't pass it and only go to database to get the result data

Related

Secure MySQL connection in Java?

I try this:
Connection con = DriverManager.getConnection("jdbc:mysql://localhost:3306/sonoo","root","password");
but it's very easy for someone to hack strings of username and password.
Opening Application with zip, winrar or any else program look like this and read code.
How can I secure my connection?

You need to decide what permissions someone who gets a copy of your JAR has. Do they have permission to run database queries or not?
If they should not: delete the database connection. They don't have permission.
If they should: then they can have the password. They have permission.
What seems to be tripping you up is that you are giving out the root password for your database, and so you want the the third option: "They should be able to do some database queries, but not others."
The JAR file is the wrong place to try to solve that problem. If you try to solve this at the JAR file level, one of two things will happen. Either your users were trustworthy all along and you wasted your time with whatever elaborate scheme you used, or some of your end-users are untrustworthy and one of them will hack you. They will hack you by stepping it through the JVM and editing your query strings right before the JVM sends them out, at the very last second, if they absolutely have to. Everything you do at this level will be security theater, like getting frisked at the airport, it doesn't make you significantly safer but there is a tiny chance that you can say "but we encrypted it!" and your clients might not dump you after the inevitable security breach.
That problem needs to be solved within the database, by creating a user account which does not have the permissions that they should not have. When you do SHOW GRANTS FOR enduser#'%' it will show you only the sorts of queries that they are allowed to do.
In many cases you want to give the user account a more fine-grained permission than just INSERT, SELECT, or UPDATE on a table. For example, you might have the logic "you can add to this table, but only if you also update the numbers in this other table." For these, you should use stored procedures, which can have their permissions set to either "definer" or "invoker": define it by a user with the appropriate permissions and then the invoker gets to have advanced permissions to do this particular query.
In some cases you have a nasty situation where you want to distribute the same application to two different clients, but they would both benefit significantly (at the expense of the other!) from being able to read each other's data. For example you might be an order processor dealing with two rival companies; either one would love to see the order history of the other one. For these cases you have a few options:
Move even SELECT statements into stored procedures. A stored procedure can call user() which can still give you the logged-in user, even though they are not the definer.
Move the database queries out of the shared JAR. Like #g-lulu says above, you can create a web API which you lock down really well, or something like that.
Duplicate the database, move the authentication parameters to a separate file which you read on startup.
Option 3 requires you to write tooling to maintain multiple databases as perfect duplicates of each other's structure, which sucks. However it has the nice benefit over (1) and (2) that a shared database inevitably leaks some information -- for example an auto_increment ID column could leak how many orders are being created globally and there might be ways to determine something like, "oh, they send all of their orders through this unusual table access, so that will also bump this ID at the same time, so I just need to check to see if both IDs are bumped and that'll reveal an order for our rival company".

You can create a webservice in PHP (or java or others). This webservice is stocked on a server and he's contain access and query to your database.
With your desktop app, just send a request (POST, GET) to your web service.
Exemple in PHP webservice :
if (isset($_POST['getMember'])){
do a query in your database
insert result into JSON
return JSON
}

Single line select using string builder or Stored Procedure

I have a lot of single line select queries in my application with multiple joins spanning 5-6 tables. These queries are generated based on many conditions based on input from a form etc using String Builders. However my team lead who happens to be a sql developer has asked me to convert those single line queries to Stored Procedures.
Is there any advantage of converting the single line select queries to backend and performing all the if and else there as SP.

One advantage of having all your sql part in stored procedures is that you keep your queries in one place that is database so it would a lot easier to change or modify without making a lot of changes in application layer or front end layer.
Besides DBA's or SQL develoeprs could fine tune the SQL's if it is stored in database procedures. You could keep all your functions/stored procedures in a package which would be better in terms of performance and organizing your objects(similar way of creating packages in Java). And of course in packages you could restrict direct access to its objects.
This is more of team's or department policy where to keep the sql part whether in front end or in database itself and of course like #Gimby mentioned, many people could have different views.
Update 1
If you have a select statement which returns something use a function, if you have INSERT/UPDATE/DELETE or similar stuff like sending emails or other business rules then use a procedure and call these from front end by passing parameters.

I'm afraid that is a question that will result in many different answers based on many different personal opinions.
Its business logic you are talking about here in any case, in -my- opinion that belongs in the application layer. But I know a whole club of Oracle devs who wholeheartedly disagree with me.

If your use PreparedStatement in java then there is no big differense in performance between
java queries and stored procedures. (If your use Statement in java, then your have a problem).
But Stored Procedure is a good way to organize and reuse your sql code. Your can group them in packages, your can change them without java compilation and your DBA or SQL spetialist can tune them.

How expensive is application -> DB call? (i.e. Java JDBC -> Oracle)?

There's always a "do everything in application service layer" vs "do everything in a DB procedure" argument in my workplace.
What I got is making application server and DB communicate too often is a rather expensive operation. My question is - just how expensive is it?
Say we have this example - I have a list of users in my Java application, and I need to bind a certain attribute to each of them. Let's say there are 20 users, and 20 attributes to be stored. Just how much more expensive is it to make 20 calls to an Oracle procedure using parameters (employee_id, attribute_value) rather than making 1 call, and sending all employee_ids and their matching attribute_values at once?
edit:
Ok, maybe I didn't state my case clearly - I'll "dumbify" it a bit :)
How much more expensive is it to make n calls to an Oracle procedure that does 1 insert, rather than making 1 call to an Oracle procedure that does n inserts (where n inserts are basically looping 1 insert n times)? The reason behind doing it in n calls rather than in 1 go is that, for a newbie, it's definitely easier to write a loop in Java that does n procedure calls with simple datatypes as input objects (i.e. integer, varchar2 etc) than to think of a way to pass an array from Java to Oracle.

You need to take a case by case view of how expensive it is to get data. Dependes on the SLA that you are adhering to.
In the example that you took, if not all users are logged into your application simultaneously and the "attribute" has different value for each user, there is no point in fetching it all in one go.
If however some attributes above represent static data, it makes sense to cache them in the application and use the cached data.
You really need to make a case by case decision. Just because it is expensive to fetch data doesnt mean you fetch it all in one go.
As for how expensive it will be, if you are using a datasource and connection pool (which almost all apss use these days), and if you use prepared statement, use BULK COLLECT statements in your procedures, or if you are using hibernate (use optimum fetch size), it should not be very costly.
The relation is definitely not linear i.e. it wont cost you 20 times a single call.

Is a good idea do processing of a large amount of data directly on database?

I have a database with a lot of web pages stored.
I will need to process all the data I have so I have two options: recover the data to the program or process directly in database with some functions I will create.
What I want to know is:
do some processing in the database, and not in the application is a good
idea?
when this is recommended and when not?
are there pros and cons?
is possible to extend the language to new features (external APIs/libraries)?
I tried retrieving the content to application (worked), but was to slow and dirty. My
preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.
ONLY a example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper Name´ is part of name or not.
I will need to process all the data. In this case, doing directly on database is better than retrieving to application?

My preoccupation was that can't do in the database what can I do in
Java, but I don't know if this is true.
No, that is not a correct assumption. There are valid circumstances for using database to process data. For example, if it involves calling a lot of disparate SQLs that can be combined in a store procedure then you should do the processing the in the stored procedure and call the stored proc from your java application. This way you avoid making several network trips to get to the database server.
I do not know what are you processing though. Are you parsing XML data stored in your database? Then perhaps you should use XQuery and a lot of the modern databases support it.
ONLY an example: I have a table called Token. At the moment, it has
180,000 rows, but this will increase to over 10 million rows. I need
to do some processing to know if a word between two token classified
as `Proper Name´ is part of name or not.
Is there some indicator in the data that tells it's a proper name? Fetching 10 million rows (highly susceptible to OutOfMemoryException) and then going through them is not a good idea. If there are certain parameters about the data that can be put in a where clause in a SQL to limit the number of data being fetched is the way to go in my opinion. Surely you will need to do explains on your SQL, check the correct indices are in place, check index cluster ratio, type of index, all that will make a difference. Now if you can't fully eliminate all "improper names" then you should try to get rid of as many as you can with SQL and then process the rest in your application. I am assuming this is a batch application, right? If it is a web application then you definitely want to create a batch application to do the staging of the data for you before web applications query it.
I hope my explanation makes sense. Please let me know if you have questions.

Directly interacting with the DB for every single thing is a tedious job and affects the performance...there are several ways to get around this...you can use indexing, caching or tools such as Hibernate which keeps all the data in the memory so that you don't need to query the DB for every operation...there are tools such as luceneIndexer which are very popular and could solve your problem of hitting the DB everytime...

How to handle huge result sets from database

I'm designing a multi-tiered database driven web application – SQL relational database, Java for the middle service tier, web for the UI. The language doesn't really matter.
The middle service tier performs the actual querying of the database. The UI simply asks for certain data and has no concept that it's backed by a database.
The question is how to handle large data sets? The UI asks for data but the results might be huge, possibly too big to fit in memory. For example, a street sign application might have a service layer of:
StreetSign getStreetSign(int identifier)
Collection<StreetSign> getStreetSigns(Street street)
Collection<StreetSign> getStreetSigns(LatLonBox box)
The UI layer asks to get all street signs meeting some criteria. Depending on the criteria, the result set might be huge. The UI layer might divide the results into separate pages (for a browser) or just present them all (serving up to Goolge Earth). The potentially huge result set could be a performance and resource problem (out of memory).
One solution is not to return fully loaded objects (StreetSign objects). Rather return some sort of result set or iterator that lazily loads each individual object.
Another solution is to change the service API to return a subset of the requested data:
Collection<StreetSign> getStreetSigns(LatLonBox box, int pageNumber, int resultsPerPage)
Of course the UI can still request a huge result set:
getStreetSigns(box, 1, 1000000000)
I'm curious what is the standard industry design pattern for this scenario?

The very first question should be:
¿The user needs to, or is capable of, manage this amount of data?
Although the result set should be paged, if its potentially size is so huge, the answer will be "probably not", so the UI shouldn't try to show it.
I worked on J2EE projects on Health Care Systems, that deal with enormous amount of stored data, literally millions of patients, visits, forms, etc, and the general rule is not to show more than 100 or 200 rows for any user search, advising the user that those set of criteria produces more information that he can understand.
The way to implement this varies from one project to another, it is possible to force the UI to ask the service tier the size of a query before launching it, or it is possible to throw an Exception from the service tier if the result set grows too much (however this way couples the service tier with the limited implementation of an UI).
Be careful! This not means that every method on the service tier must throw an Exception if its result sizes more than 100, this general rule only applies to result sets that are shown to the user directly, that is a better reason to place the control in the UI instead on the service tier.

The most frequent pattern I've seen for this situation is some sort of paging, usually done server-side to reduce the amount of information sent over the wire.
Here's a SQL Server 2000 example using a table variable (generally faster than a temp table) together with your street signs example:
CREATE PROCEDURE GetPagedStreetSigns
(
#Page int = 1,
#PageSize int = 10
)
AS
SET NOCOUNT ON
-- This memory-variable table will control paging
DECLARE #TempTable TABLE (RowNumber int identity, StreetSignId int)
INSERT INTO #TempTable
(
StreetSignId
)
SELECT [Id]
FROM StreetSign
ORDER BY [Id]
-- select only those rows belonging to the requested page
SELECT SS.*
FROM StreetSign SS
INNER JOIN #TempTable TT ON TT.StreetSignId = SS.[Id]
WHERE TT.RowNumber BETWEEN ((#Page - 1) * #PageSize + 1)
AND (#Page * #PageSize)
In SQL Server 2005, you can get more clever with stuff like Common Table Expressions and the new SQL Ranking functions. But the general theme is that you use the server to return only the information belonging to the current page.
Be aware that this approach can get messy if you're allowing the end-user to apply on-the-fly filters to the data that s/he's seeing.

I would say if the potential exsists for a large set of data, then go the paging route.
You can still set a MAX that you do not want them to go over.
E.G. SO uses page sizes of 15, 30, 50...

One thing to be wary of when working with home-grown row-wrapper classes like you (apparently) have, is code that makes additional calls to the database without you (the developer) being aware of it. For example, you might call a method that returns a collection of Person objects and think that the only thing going on under the hood is a single "SELECT * FROM PERSONS" call. In actuality, the method you're calling might iterate through the returned collection of Person objects and make additional DB calls to populate each Person's Orders collection.
As you say, one of your solutions is to not return fully-loaded objects, so you're probably aware of this potential problem. One of the reasons I tend to avoid using row wrappers is that they invariably make it difficult to tune your application and minimize the size and frequency of database traffic.

In ASP.NET I would use server-side paging, where you only retrieve the page of data the user has requested from the data store. This is opposed to retrieving the entire result set, putting it into memory and paging through it on request.

JSF or JavaServerFaces has widgets for chunking large result sets to the browser. It can be parameterized as you suggest. I wouldn't call it a "standard industry design pattern" by any means, but it is worth a look to see how someone else solved the problem.

When I deal with this type of issue, I usually chunk the data sent to the browser (or thin/thick client, whichever is more appropriate for your situation) as regardless of the actual total size of the data that meets some certain criteria, only a small portion is really usable in any UI at one time.
I live in a Microsoft world, so my primary environment is ASP.Net with SQL Server. Here are two articles about paging (which mention some techniques for paging through result sets) that may be helpful:
Paging through lots of data efficiently (and in an Ajax way) with ASP.NET 2.0
Efficient Data Paging with the ASP.NET 2.0 DataList Control and ObjectDataSource
Another mechanism that Microsoft has shipped lately is their idea of "Dynamic Data" - you might be able to check out the guts of this for some guidance as to how they're dealing with this issue.

I've done similar things on two different products. In one case the data source is optionally paginated -- for java, implements a Pageable interface similar to:
public interface Pageable
{
public void setStartIndex( int index );
public int getStartIndex();
public int getRowsPerPage() throws Exception;
public void setRowsPerPage( int rowsPerPage );
}
The data source implements another method for get() of items, and the implementation of a paginated data source just returns the current page. So you can set your start index, and grab a page in your controller.
One thing to consider will be to cache your cursors server side. For a web app you'll have to expire them, but they'll really help performance wise.

The fedora digital repository project returns a maximum number of results with a result-set-id. You then get the rest of the result by asking for the next chunk supplying the result-set-id in the subsequent query. It works ok as long as you don't want to do any searching or sorting outside of the query.

From the datay retrieval layer, the standard design pattern is to have two method interfaces, one for all and one for a block size.
If you wish, you can layer components that do paging over it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.