The current java API for elastic search documentation does not say anything about creating an index template.
I know I can create an index template using crud, but my elastic search index is going grow depending on the data I get. The data I have right now, it may be possible that the data may change. So instead of manually making an index and a template, I want to know if it can be done through writing a code in Java.
You can use the IndicesAdminClient to create a template
node.client().admin().indices().putTemplate(
new PutIndexTemplateRequest("templatename").source(templateString)
);
PutIndexTemplateRequest has other methods to programatically build the template if you'd prefer to build it as a Java map, etc.
Related
I am implementing qsjava application of docusign, as part of the application it is taking the pdf file from local folder i.e,World_Wide_Corp_lorem.pdf. I want to take that file from my accound Link.
I have created the Template and I have added the standard fields like text box.
Here my question is in the api link i.e, https://demo.docusign.net/restapi/v2.1/accounts/accountid/templates/teplateid, I am not getting the statndard fild that I have added in the template, from which api I will get it.
Once I got that field how can I read that my account template api and use that template in my application.
I'll answer your question, but I don't think that's what you are trying to do.
To retrieve the bits of documents inside templates, you can use this endpoint:
GET /restapi/v2.1/accounts/{accountId}/templates/{templateId}/documents/{documentId}
You need to know the documentId of your document, which is typically a number, if you set it yourself you can use "1", "2" etc.a
Now, I wonder if what you want to do is use this template to have someone sign an envelope using that template? that doesn't require you to download the bits.
See this code example that shows how to do this with your template if that's what you're after.
Scenario:
I'm doing some HTML information extraction using a crawler. Right now, most the rules for extraction are hardcoded (not the tags or things like that, but loops, nested elements, etc.)
For instance, one common task is as follows:
Obtain table with ID X. If it doesn't exists there may be additional mechanisms so find the info which are triggered
Find a row which contains some info. Usually the match is a regexp against an specific column.
Retrieve the data in a different column (usually marked in the td, or previously detected in the header)
The way I'm currently doing so is:
Query to get the body of first table with id X (X is in config file). Some websites of my list are buggy and duplicate that id on elements different than table -.-
Iterate over interesting cells, executing regexp on cell.text() (regexp is in config file)
Get the parent row of the matching cells, and obtain the cell I need from the row (identifier of the row is in config file)
Having all this hardcoded for the most part (except column names, table ids, etc) gives me the benefit or being easy to implement and more efficiency than a generic parser, however, it is less configurable, and some changes in the target websites force me to deal with code, which makes it harder to delegate the task.
Question
Is there any language (preferably with a java implementation available) which allows to consistently define rules for extractions like those? I'm using css-style selectors for some tasks, but others are not so simple, so my best guess is that there must be something extending that that a non-programmer maintainer to add/modify rules on demand.
I would accept a Nutch-based answer, if there's one, as we're studying migrating our crawlers to nutch, although, I'd prefer a generic java solution.
I was thinking about writing a Parser generator and create my own set of rules to allow users/maintainers to generate parsers, but it really feels like reinventing the wheel for no reason.
I'm doing something somewhat similar - not exactly what you're searching for, but maybe you can get some ideas.
First the crawling part:
I'm using Scrapy on Python 3.7.
For my project, that brought the advantage, that it's very flexible and an easy crawling framework to build upon. Things like delays between requests, HTTP header language etc. can mostly be configured.
For the information extraction part and rules:
In my last generation of crawler (I'm now working on the 3rd gen, the 2nd one is still running but not as scalable) I've used JSON files to enter the XPath / CSS rules for every page. So on starting my crawler, I've loaded the JSON file for one specific page that is currently being crawled and a generic crawler, knew what to extract based on the loaded JSON file.
This approach isn't easily scalable since one config file per domain has to be created.
Currently, I'm still using Scrapy, with a starting list of 700 Domains to crawl and the crawler is now only responsible for downloading the whole website as HTML files.
These are being stored in tar archives by a shell script.
Afterward, a Python script is going through all members of the shell script and analyzing the content for the information I'm looking to extract.
Here, as you said, it's a bit like re-inventing the wheel or writing a wrapper around an existing library.
In Python, one can use BeautifulSoup for removing all tags like script and style etc.
Then you can extract for instance all text.
Or you'd focus first on tables only, extract all tables into dicts and can then analyze with regex or similar.
There are libraries like DragNet for boilerplate removal.
And there are some specific approaches on how to extract table structured information.
Let me just start by saying that this is a soft question.
I am rather new to application development, and thus why I'm asking a question without presenting you with any actual code. I know the basics of Java coding, and I was wondering if anyone could enlighten me on the following topic:
Say I have an external website, Craigslist, or some other site that allows me to search through products/services/results manually by typing a query into a searchbox somewhere on the page. The trouble is, that there is no API for this site for me to use.
However I do know that http://sfbay.craigslist.org/search/sss?query=QUERYHERE&sort=rel points me to a list of results, where QUERYHERE is replaced by what I'm looking for.
What I'm wondering here is: is it possible to store these results in an Array (or List or some form of Collection) in Java?
Is there perhaps some library or external tool that can allow me to specify a query to search for, have it paste it in to a search-link, perform the search, and fill an Array with the results?
Or is what I am describing impossible without an API?
This depends, if the query website accepts returning the result as XML or JSON (usually with a .xml or .json at the end of url) you can parse it easily with DOM for XML on Java or download and use the JSONLibrary to parse a JSON.
Otherwise you will receive a HTML that is the page that a user would see in a browser, then you can try parse it as a XML but you will have a lot of work to map all fields in the HTML to get the list as you want.
My Problem is, I am goin to develop a site where every one uploads the doc file, txt files etc. Now here I need a component which actually pasre the file for some Key words and mainatin the index of that. And also that Index should be updated based on the Strutured data as well, like document can actively viewable and so forth. When another user try to look that list of document based on some key word and some strutured data as mentioned earlier, user should find the list quickly. And it should support the Multi Language. We have an alogorthim in place, but we need an open source API for reading the file indexing the file with Unstrutured data based on key word. Can any one can help in this.
Lucene with Solr is the best open source solution out there.
This is not a trivial task, so why reinvent when other people have already done that: try Apache Lucene.
I am new to IR techniques.
I looking for a Java based API or tool that does the following.
Download the given set of URLs
Extract the tokens
Remove the stop words
Perform Stemming
Create Inverted Index
Calculate the TF-IDF
Kindly let me know how can Lucene be helpful to me.
Regards
Yuvi
You could try the Word Vector Tool - it's been a while since the latest release, but it works fine here. It should be able to perform all of the steps you mention. I've never used the crawler part myself, however.
Actually, TF-IDF is a score given to a term in a document, rather than the whole document.
If you just want the TF-IDFs per term in document, maybe use this method, without ever touching Lucene.
If you want to create a search engine, you need to do a bit more (such as extracting text from the given URLs, whose corresponding documents would probably not contain raw text). If this is the case, consider using Solr.