Find most common words in sql - java

I have a new problem. I have a database with a column that contains a wide variety of text, is there any way I can get SQL to tell me which are the 10 most common words used in these fields? As an example:
1 I am coming home a bit late today.
2 Train is running late.
3 What is the train schedule like today?
4 Snow is really bad right now.
And output optimally would be:
is: 3
late : 2
train: 2
today: 2
If it is not possible to do it with SQL, what else would you suggest I look into to get this information?

This might technically be doable in SQL, but it will be painful and very slow when you have more rows in your database.
The problem you are describing is a perfect use case for an indexing engine though, such as Lucene (I used this one as an example it since your question first contained the tag 'java' before being edited).

One option is to use table-valued split function that returns each word as a row ; count them ; sort them by count in descending order

Related

What is a good way to remove duplicates?

I have a varchar column. It contains values separated by semicolon (;).
For example, it looks like
10;20;21;17;20;21;22;
It's not always 7 elements. It could contain anything from around 30 to 70. The reason they designed it this way is because the values are actually genome segments and it makes sense to enter or retrieve it collectively
I need to remove records with duplicate columns, so if I see another record with the same value as above, I need to remove it.
I also need to remove the record if it contains same values in another record. For example, I need to remove
10;;21;17;20;21;22;
because it's the same as the first but it doesn't have the second value, 20. If it's more complete than the first, I will remove the first one instead.
1;2;3;4;5;6;7; and 1;2;3;4;5;6;7;8; are dups and I'm taking the 2nd one because it's more complete. 1;2;3;4;5;6;;7 is also a duplicate. In this case, if they have 13 or more matched numbers and no mismatch, we will merge them so it becomes a single value 1;2;3;4;5;6;7;7;.
I can scan each record in java but I'm afraid that it will be complicated and time consuming, given that the table contains millions of records. I was wondering if it's doable in oracle itself.
My final goal is to calculate the frequency that those numbers occur. For instance, if number 10 appears 5 out of 100 times, it will be 5%. The calculation will be simple. However, I can't calculate this unless I make sure there's no duplicates in the table in the first place.
Note: This answer is a placeholder because the question looks in danger of closure but I think it will be worthy of an answer once all the rules are established.
It's trivial to remove the exact duplicates:
delete from your_table y
where y.rowid not in ( select min(x.rowid)
from your_table x
group by x.genome_string)
The hard part is establishing duplicating strings which have exact matches and nulls. Merging rows makes the logic even more convoluted.
The sql below is a solution ONLY IF:
1;2;3;4;5; is a more complete form of 1;2;;5
All your entries end with ;
The request was tested using sqlite so perhaps it may need some changes for Oracle.
It expects a table "TEST" with a column "VALUE"
SELECT
DISTINCT VALUE
from TEST As ORIGIN_TEST
WHERE NOT EXISTS (SELECT VALUE FROM TEST
WHERE
VALUE <> ORIGIN_TEST.VALUE AND
(VALUE LIKE replace(ORIGIN_TEST.VALUE, ';;', ';_%;') OR
VALUE LIKE ORIGIN_TEST.VALUE || '_%;')
)

Neo4j - Java heap space. Wrong query or settings?

i have a problem with neo4j.
I don't know if problem is my query or something else.
Intro
I have to build an application that store bus/train routes.
This is my schema:
Nodes:
Organizaton: company that have routes/bus etc..
Route: A bus route like: Paris - Berlin.
Vehicle(Bus in this case): Fisical bus with a unique license plate.
Stops: point in a map with latitude and longitude.
Important Relationships:
NEXT: This is a really important relationship.
NEXT relationships contains those properties:
startHour
startMinutes
endHour
endMinutes
dayOfWeek (from 0 to 6 - Sun, Mon etc..)
vehicleId
Problem
My query is:
MATCH (s1:Stop {id: {departureStopId}}), (s2:Stop {id: {arrivalStopId}})
OPTIONAL MATCH (s1)-[nexts:NEXT*]->(s2)
WHERE ALL(i in nexts WHERE toInt(i.dayOfWeek) = {dayOfWeek} AND toInt(i.startHour) >= {hour})
RETURN nexts
LIMIT 10
For example: I wanna found all nexts relationships where dayOfWeek is Sunday (0) and property startHour > 11
After that I usually parse and validate final object on my nodejs backend.
This works when i was at the start.. with 1k relationships..
Now i have 10k relationships and my query have a TIMEOUT problem or queries are solved in 30s.. too much time...
I have no idea how to solve this.
I use neo4j with docker and i tried to read settings docs but i have no idea how Java works.
Can you help me guys?
UPDATE
Thank you all guys!
For now i solved with "allShortestPaths" but I think i will rename all relationships (like Michael Hunger said).
Have you tried:
MATCH p=allShortestPaths((s1:Stop {id: {departureStopId}})-[:NEXT*]-> (s2:Stop {id: {arrivalStopId}}) )
WHERE ALL(i in RELS(p) WHERE toInt(i.dayOfWeek) = {dayOfWeek} AND toInt(i.startHour) >= {hour})
RETURN rels(p) as nexts
LIMIT 10
This should use the fast shortest path algorithm because:
Planning shortest paths in Cypher can lead to different query plans depending on the predicates that need to be evaluated. Internally, Neo4j will use a fast bidirectional breadth-first search algorithm if the predicates can be evaluated whilst searching for the path.
See https://neo4j.com/docs/developer-manual/current/cypher/execution-plans/shortestpath-planning/#_shortest_path_with_fast_algorithm for more details.
Can you share your profile.
I presume you have a constraint on :Stop(id)
I would use shortest path or dijkstra with costs instead of optional match.
OPTIONAL MATCH will try to find ALL of such paths which are hundreds of millions and filter them as they go.
And it might make sense to group your NEXT relationships by day of week, .e.g :NEXT_MO, :NEXT_THU so you only look at 1/7 th of the data.
It's not settings; it's the fact that your query must visit each and every node in the graph in order to satisfy the query.
The problem would show itself in a relational database when a TABLE SCAN had to be used instead of an index.
I think the solution is to add buckets for hours, like you already have for days. If you have to have minutes, make 96 fifteen minute buckets to cover a day. That will give the query optimizer its best chance.

How to implement database engine UPDATE command

I am developing a simple database engine in Java (using text files as tables) and I have to implement code for CRUD operations. I have successfully written code for CREATE and INSERT commands already. Now I want to continue with UPDATE which should look like this:
UPDATE table-name SET attribute-name=literal {,attribute-name=literal} WHERE condition
But I have an issue here, I am stuck with "condition". How can I approach the implementation of a condition? (WHERE attr1 = something AND attr2 >= something OR . . .) I will very much appreciate your feedback.
Best regards.
The WHERE part is always the most important component of any database system. To find out all the records satisfying the conditions in WHERE part, you should build proper indexes for any columns included in the condition.
For example, you will find WHERE attr1 = something AND attr2 >= something OR...
, then columns attr1, attr2 must have been indexed, otherwise it will take terrible long time to perform.
Index techniques may be hash index (for K-V search), B+ tree index and all their derived implementations.

Exact Match in SOLR 5.1

I have setup Solr 5.1.0 with proper data importation from MYSQL database. It is working good.
But I want exact match results or relevant to that only.
like,
Dancers in Mumbai
It gives all results which contains "dancers + mumbai" and only "dancers" + only "mumbai" keywords. I want result which must contains only "dancers + mumbai" not others.
This is not a complete answer, but it's the direction I'm trying to take with a similar problem. Comments are very welcome.
Step 1:
Implement multiple Solr cores, core 1 is "jobs" (dancers/lawyers/etc), and core 2 is "cities" (mumbai/chennai/etc).
Step 2:
Query each core for exact matches, so implement the KeywordTokenizerFactory on the relevant field to find exact matches only. This will give you all the matches accross cores (e.g. jobs: dancers and cities:mumbai).
Step 3:
Perform your general query using EDisMax for a user-friendly search (e.g. searching for "dancers in mumbai" accross many fields), and use the boost field to boost the jobs/cities found in the earlier query.
I would love to know if there is a better way of doing something this elaborate, but I have not found it yet. Hope it helps.
Using required terms like: +dancers +mumbia
Or a phrase query: "dancers in mumbia"
Would work.
You can also set the default operator for your query to be "AND", using the q.op parameter.

Data Structure to represent a DFA

I was wondering, what would be the best data structure to represent a DFA?
I am looking at converting a regular expression to a DFA and make this particular functionality as a library in Java.
The main thing is that, each entity in the regex carries a set of value rather than a single string value like "car" . In my case , each entity would carry many properties like {car, Honda, 4x4, sedan, ... } (Though I am not searching for cars, this is just an example.)
Any suggestions?
If I understand your question correctly you want to have a matching/filtering library for an arbitrary regular language over an alphabet with dynamic types? Going with your car example, I'd imagine you'd want to be able to create an expression in order to match over a List where all Cars (have the color red, have between 2 and 6 Passengers and each Passenger is between 8 and 88 years of age) or (have 1 Passenger).
Coincidentally I've been looking for something like that myself (for document validation) and the closest I could get was Jing; A Java RELAX-NG library. Unfortunately, the alphabet in Jing consists out of XML nodes so it didn't solve my problem. At the moment I'm attempting to write a library myself which does just this (matching against regular languages over an arbitrary type of alphabet), based on the pattern matching in Jing. If you like to help with this, please let me know ;).
A web search will yield some examples of DFAs in Java. However, the best representation depends on your specific application requirements; e.g. how your application is going to use the DFAs. I think you need to work this out for yourself.
I'm sure this answer won't be useful to the original question because of the data, but if anyone happens across this from google...
DFA's and NFA's can be stored as State transition table's, you then perform a parse by moving thought the table following the links as such.

Categories