MarkLogic - How to switch Language of Suggest - java

i want to use the suggestion feature of MarkLogic database in conjunction with the Java Client API.
The problem with that is, to do a suggest call i need a field or something that serves as a suggestion source. The next step is to create query options that reference this suggestion source. The last step is to call:
SuggestionDefinition def = marklogicClient.newQueryManager().newSuggestionDefinition();
def.setLimit(10);
def.setOptionsName("my-query-options");
def.setStringCriteria("Test");
//setup lang?
The question is: How to switch language?
If my frontend can be switched between german and english, then i have to switch the search/suggest language. In order to do this i have to switch the collation but how?
The query options are static after upload, containing something like:
<default-suggestion-source>
<word collation="http://marklogic.com/collation/de">
<field name="my-suggest" />
</word>
</default-suggestion-source>

Perhaps you are looking for the use of dynamic query options as defined in the java API documentation:
https://docs.marklogic.com/guide/java/searches#id_76144
Furthermore, you can also register more than one query option file and use one for each language.

The solution comes through two possible (and practical) ideas:
Either create more than one query options file per language as suggested (and additional indexes per language) or just ignore the problem!
If the field behind the suggestion (the source) points to elements that are tagged with different xml:lang attributes, than a suggest call with say "books" will return only english suggestions and a call with the german "Bücher" will return only german suggestions.
The only exception is if there is german text in an english tagged element. This could lead to false positives.
Additional thought: Searching through suggestions like "books" and setting the search language to german will return nothing.
Conclusion: Searching under a specific language is a complex topic. It really depends on how the user want to search and how the application works.
P.S: I used the second solution to just ignore the problem for now.

Related

Is there an easy way to search for all LOV's in a given Application/Project?

I am working on a fairly large project. I need to find all LOV's in a single application and modify them. The application has about 4 projects. There might be about 300 LOV's. Is there an easy way to search for these? Could I regex this? Is there a way to get a data model diagram of all LOV's.
Any response is appreciated. Thanks in advance.
A LOV is defined by one or more tags depending on the kind of LOV (select one choice, combobox, input select one choice,...).
You can use any tool that can look for text inside files to search for specific tags.
As you did not tell us the framework you use I give you a sample for the tag ADF selectOneChoice uses:
af:selectOneChoice
So you can search the projects folders for all files containing this text. As you tagged the question with JDeveloper, you can use JDevs Find->'Find in Files...' menu option. In the dialog you get, you enter the right data where to look (scope) and what to look for(Search Text). There are more options you can use to get faster and better results. Click on the '?' button to get more help on how to use this feature.

Configurable HTML information extraction

Scenario:
I'm doing some HTML information extraction using a crawler. Right now, most the rules for extraction are hardcoded (not the tags or things like that, but loops, nested elements, etc.)
For instance, one common task is as follows:
Obtain table with ID X. If it doesn't exists there may be additional mechanisms so find the info which are triggered
Find a row which contains some info. Usually the match is a regexp against an specific column.
Retrieve the data in a different column (usually marked in the td, or previously detected in the header)
The way I'm currently doing so is:
Query to get the body of first table with id X (X is in config file). Some websites of my list are buggy and duplicate that id on elements different than table -.-
Iterate over interesting cells, executing regexp on cell.text() (regexp is in config file)
Get the parent row of the matching cells, and obtain the cell I need from the row (identifier of the row is in config file)
Having all this hardcoded for the most part (except column names, table ids, etc) gives me the benefit or being easy to implement and more efficiency than a generic parser, however, it is less configurable, and some changes in the target websites force me to deal with code, which makes it harder to delegate the task.
Question
Is there any language (preferably with a java implementation available) which allows to consistently define rules for extractions like those? I'm using css-style selectors for some tasks, but others are not so simple, so my best guess is that there must be something extending that that a non-programmer maintainer to add/modify rules on demand.
I would accept a Nutch-based answer, if there's one, as we're studying migrating our crawlers to nutch, although, I'd prefer a generic java solution.
I was thinking about writing a Parser generator and create my own set of rules to allow users/maintainers to generate parsers, but it really feels like reinventing the wheel for no reason.
I'm doing something somewhat similar - not exactly what you're searching for, but maybe you can get some ideas.
First the crawling part:
I'm using Scrapy on Python 3.7.
For my project, that brought the advantage, that it's very flexible and an easy crawling framework to build upon. Things like delays between requests, HTTP header language etc. can mostly be configured.
For the information extraction part and rules:
In my last generation of crawler (I'm now working on the 3rd gen, the 2nd one is still running but not as scalable) I've used JSON files to enter the XPath / CSS rules for every page. So on starting my crawler, I've loaded the JSON file for one specific page that is currently being crawled and a generic crawler, knew what to extract based on the loaded JSON file.
This approach isn't easily scalable since one config file per domain has to be created.
Currently, I'm still using Scrapy, with a starting list of 700 Domains to crawl and the crawler is now only responsible for downloading the whole website as HTML files.
These are being stored in tar archives by a shell script.
Afterward, a Python script is going through all members of the shell script and analyzing the content for the information I'm looking to extract.
Here, as you said, it's a bit like re-inventing the wheel or writing a wrapper around an existing library.
In Python, one can use BeautifulSoup for removing all tags like script and style etc.
Then you can extract for instance all text.
Or you'd focus first on tables only, extract all tables into dicts and can then analyze with regex or similar.
There are libraries like DragNet for boilerplate removal.
And there are some specific approaches on how to extract table structured information.

Extracting webpage information based on a template in Java

Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change leads to a change in the existing Java code, this is a tedious task, because these webpage change very frequently. Also it requires a programmer to fix the Java code. Here is an example of HTML code of my interest on a webpage:
<div>
<p><strong>Score:</strong>2.5/5</p>
<p><strong>Director:</strong> Bryan Singer</p>
</div>
<div>some other info which I dont need</div>
Now here is what I want to do, I want to save this webpage (an HTML file) locally and create a template out of it, like:
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
<div>some other info which I dont need</div>
Along with the actual URLs of the webpages these HTML templates will be the input to the Java program which will find out the location of these predefined keywords (e.g. {MOVIE_RATING}, {MOVIE_DIRECTOR}) and extract the values from the actual webpages.
This way I wouldn't have to modify the Java program every time a webpage changes, I will just save the webpage's HTML and replace the data with these keywords and rest will be taken care by the program. For example in future the actual HTML code may look like this:
<div>
<div><b>Rating:</b>**1/2</div>
<div><i>Director:</i>Singer, Bryan</div>
</div>
and the corresponding template will look like this:
<div>
<div><b>Rating:</b>{MOVIE_RATING}</div>
<div><i>Director:</i>{MOVIE_DIRECTOR}</div>
</div>
Also creating these kind of templates can be done by a non-programmer, anyone who can edit a file.
Now the question is, how can I achieve this in Java and is there any existing and better approach to this problem?
Note: While googling I found some research papers, but most of them require some prior learning data and accuracy is also a matter of concern.
The approach you gave is pretty much similar to the Gilbert's except
the regex part. I don't want to step into the ugly regex world, I am
planning to use template approach for many other areas apart from
movie info e.g. prices, product specs extraction etc.
The template you describe is not actually a "template" in the normal sense of the word: a set static content that is dumped to the output with a bunch of dynamic content inserted within it. Instead, it is the "reverse" of a template - it is a parsing pattern that is slurped up & discarded, leaving the desired parameters to be found.
Because your web pages change regularly, you don't want to hard-code the content to be parsed too precisely, but want to "zoom in" on its' essential features, making the minimum of assumptions. i.e. you want to commit to literally matching key text such as "Rating:" and treat interleaving markup such as"<b/>" in a much more flexible manner - ignoring it and allowing it to change without breaking.
When you combine (1) and (2), you can give the result any name you like, but IT IS parsing using regular expressions. i.e. the template approach IS the parsing approach using a regular expression - they are one and the same. The question is: what form should the regular expression take?
3A. If you use java hand-coding to do the parsing then the obvious answer is that the regular expression format should just be the java.util.regex format. Anything else is a development burden and is "non-standard" and will be hard to maintain.
3B. If you use want to use an html-aware parser, then jsoup is a good solution. Problem is you need more text/regular expression handling and flexibility than jsoup seems to provide. It seems too locked into specific html tags and structures and so breaks when pages change.
3C. You can use a much more powerful grammar-controlled general text parser such as ANTLR - a form of backus-naur inspired grammar is used to control the parsing and generator code is inserted to process parsed data. Here, the parsing grammar expressions can be very powerful indeed with complex rules for how text is ordered on the page and how text fields and values relate to each other. The power is beyond your requirements because you are not processing a language. And there's no escaping the fact that you still need to describe the ugly bits to skip - such as markup tags etc. And wrestling with ANTLR for the first time involves educational investment before you get productivity payback.
3D. Is there a java tool that just uses a simple template type approach to give a simple answer? Well a google search doesn't give too much hope https://www.google.com/search?q=java+template+based+parser&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a. I believe that any attempt to create such a beast will degenerate into either basic regex parsing or more advanced grammar-controlled parsing because the basic requirements for matching/ignoring/replacing text drive the solution in those directions. Anything else would be too simple to actually work. Sorry for the negative view - it just reflects the problem space.
My vote is for (3A) as the simplest, most powerful and flexible solution to your needs.
Not really a template-based approach here, but jsoup can still be a workable solution if you just externalize your Selector queries to a configuration file.
Your non-programmer doesn't even have to see HTML, just update the selectors in the configuration file. Something like SelectorGadget will make it easier to pick out what selector to actually use.
How can I achieve this in Java and is there any existing and better approach to this problem?
The template approach is a good approach. You gave all of the reasons why in your question.
Your templates would consist of just the HTML you want to process, and nothing else. Here's my example based on your example.
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
Basically, you would use Jsoup to process your templates. Then, as you use Jsoup to process the web pages, you check all of your processed templates to see if there's a match.
On a template match, you find the keywords in the processed template, then you find the corresponding values in the processed web page.
Yes, this would be a lot of coding, and more difficult than my description indicates. Your Java programmer will have to break this description down into simpler and simpler tasks until she or he can code the tasks.
If the web page changes frequently, then you'll probably want to confine your search for the fields like MOVIE_RATING to the smallest possible part of the page, and ignore everything else. There are two possibilities: you could either use a regular expression for each field, or you could use some kind of CSS selector. I think either would work and either "template" can consist of a simple list of search expressions, regex or css, that you would apply. Just roll through the list and extract what you can, and fail if some particular field isn't found because the page changed.
For example, the regex could look like this:
"Score:"(.)*[0-9]\.[0-9]\/[0-9]
(I haven't tested this.)
Or you can try different approach, using what i would call 'rules' instead of templates: for each piece of information that you need from the page, you can define jQuery expression(s) that extracts the text. Often when page change is small, the same well written jQuery expressions would still give the same results.
Then you can use Jerry (jQuery in Java), with the almost the same expressions to fetch the text you are looking for. So its not only about selectors, but you also have other jQuery methods for walking/filtering the DOM tree.
For example, rule for some Director text would be (in sort of sudo-java-jerry-code):
$.find("div#movie").find("div:nth-child(2)")....text();
There could be more (and more complex) expressions in the rule, spread across several lines, that for example iterate some nodes etc.
If you are OO person, each rule may be defined in its own implementation. If you are groovy person, you can even rewrite rules when needed, without recompiling your project, and still being in java. Etc.
As you see, the core idea here is to define rules how to find your text; and not to match to patterns as that may be fragile to minor changes - imagine if just a space has been added between two divs:). In this example of mine, I've used jQuery-alike syntax (actually, it's Jerry-alike syntax, since we are in Java) to define rules. This is only because jQuery is popular and simple, and known by your web developer too; at the end you can define your own syntax (depending on parsing tool you are using): for example, you may parse HTML into DOM tree and then write rules using your helper methods how to traverse it to the place of interest. Jerry also gives you access to underlaying DOM tree, too.
Hope this helps.
I used the following approach to do something similar in a personal project of mine that generates a RSS feed out of here the leading real estate website in spain.
Using this tool I found the rented place I'm currently living in ;-)
Get the HTML code from the page
Transform the HTML into XHTML. I used this this library I guess there might be today better options available
Use XPath to navigate the XHTML to the information you're interesting in
Of course every time they change the original page you will have to change the XPath expression. The other approach I can think of -semantic analysis of the original HTML source- is far, far beyond my humble skills ;-)

Search without accents must return words with accents

I have a web app developed with Hibernate, Spring and Java, that accesses an Informix database...
Imagine you are searching for a certain record with an accent on it, like "María", but you write "Maria" in the search box... now it doesn't show any result, but it must show the "María" record, as well as any other combination like "Maríá" or "Máríá" or "Mária", etc...
How could I achieve it? Thanks in advance...
You'll need to add another column with ascii-ized strings and compare it against an ascii-ized search string, but use the primary string as a result. There is no way to convince Informix to do that for you, especially if you want it fetched from an index.
On a side note, if you had all the strings in Java memory, you could use a SortedMap with a custom, Collator-based Comparator.
When you need to search a database via Hibernate and have concerns about accents, an interest in better natural language matches, or simply want to explore options like auto-suggestion or typo corrections you should look at Hibernate Search which automates most of this via a couple of annotations.

Allow to enter language specific character from keyboard

I have one application providing language selection option to user.
I want to implement facility that user are allowed to entering text from keyboard in selected language. e.g. If i select Hindi my application takes an input in Hindi.
I am using JSF(icefaces) and Hibernate.
Is it possible ? How ?
use language translation javascript function on onkeyup event
you need to include external JS for this as http://www.google.com/jsapi..
please refer this for your reference
http://www.labnol.org/internet/website-translation-with-google-language-api/4367/
may this help u :)
Everything is possible. The question is "how much is this?"
Go to translate.google.com and see that they are able to detect writing language automatically. if you are able to do so send the text typed by user using AJAX to server and validate that the text is written in chosen language.
But language detection is not so simple task. It is simple if language uses its unique script. For example Georgian language (as far as I know) uses its own script and no other languages use the same script. You cannot say the same about European languages: they all use Latin letters. In this case more sophisticated methods are required and google does it. BTW You can probably utilize this tranlate.google facility (if they have API). Send typed text to google using AJAX and see which language does it detect. It is not 100% correct but much better that everyone of us can implement himself.

Categories