Java: ArrayList, String manipulation and Parsing

Java: ArrayList, String manipulation and Parsing - java

I am writing an application that stores references for books, journals, sites and so on. I mean I have already done most.
What I want is a suggestion regarding what is the best way to implement above specs?
What format text file should I use to store the library? Not file type but format. I am using simple text file at the moment. But planning to implement format as in below.
<book><Art of Human Hacking><New York><2011><1>
<journal><Achieving Maximum Speed In 802.11n><London><2009><260-265>
1st tag <book> and <journal> are type identifier. I have used ArrayList. Should I use multi dimensional ArrayList and store each item like below?
[[Art of Human Hacking,New York,2011,1][Achieving Maximum Speed In 802.11n,London,2009,260-265]]
I have used StringTokenizer and I cannot differentiate spaces. How do I fix this?
I have already implemented all features including listing all, listing unique, searching, editing, removing, adding. But everything is done to content without spaces.

You should use an existing serializer instead of writing your own, unless the project forbids it.
For compatability and human readability, CSV would be your best bet. Use an existing CSV parser to get your escaping correct (not that hard to write yourself, but difficult enough to warrant using an existing parser to avoid bugs). Google turned up: http://sourceforge.net/projects/javacsv/
If human editing is not a priority, then JSON is also a good format (it is human readable, but not quite as simple as CSV, and won't display in excel for instance).
Then you have binary protocols, such as native Java serialization, or even Google protocol buffers. These may be faster, but obviously lose the ability to view the data store and would probably complicate your debugging.

Related

What's the right way to produce a XML content in Java?

I've read several questions and tutorials over the internet such as
Best XML parser for Java [closed]
JAVA XML - How do I get specific elements in an XML Node?
What is the best way to create XML files in Java?
how to modify xml tag specific value in java?
Using StAX - From Oracle Tutorials
Apache Xerces Docs
Introduction to XML and XML With Java
Java DOM Parser - Modify XML Document
But since this is the very first time I have to manipulate XML documents in Java I'm still a little bit confused. The XML content is written with String concatenation and that seems to me wrong. It is the same to concatenate Strings to produce a JSON object instead of using a JSONObject class. That's the way the code is written right now:
"<msg:restenv xmlns:msg=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req req.xsd\"><autenticacao><usuario>"
+ usuario + "</usuario><senha>" + StringUtils.defaultIfBlank(UmbrellaRestClient.PARAMETROS_INFRA_UMBRELLA.get("SENHA_UMBRELLA"), "WS.INTEGRADOR")
+ "</senha></autenticacao><parametros><parametro><p_vl_gnre>" + valorGNRE + "</p_vl_gnre><p_cnpj_destinatario>" + cnpjFilial + "</p_cnpj_destinatario><p_num_ped_compra>" + idPedido
+ "</p_num_ped_compra><p_xml_sefaz><![CDATA[" + arquivoNfe + "]]></p_xml_sefaz></parametro></parametros></msg:restenv>"
In my research I think that almost everything I've read pointed to SAX as the best solution but I never really found something really useful to read about it, almost everything states that we have to create a handler and override startElement, endElement and characters methods.
I don't have to serialize the XML file in hard disk or database or anything else, I just need to pass its content to a webservice.
So my question really is, which is the right way to do it?
Concatenate Strings the way things are done right now?
Write the XML file using a Java API like Xerces? I have no clue on how that can be done.
Read the XML file with streams and just change node texts? My XML without the files would be like that:
<msg:restenv xmlns:msg="{url}"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="{schemaLocation}">
<autenticacao>
<usuario></usuario>
<senha></senha>
</autenticacao>
<parametros>
<parametro>
<p_vl_gnre></p_vl_gnre>
<p_cnpj_destinatario></p_cnpj_destinatario>
<p_num_ped_compra></p_num_ped_compra>
<p_xml_sefaz><![CDATA[]]></p_xml_sefaz>
</parametro>
</parametros>
</msg:restenv>
I've also read something about using Apache Velocity as a template Engine since I don't actually have to serialize the XML and that's a approach that I really like because I've already worked with this framework and it's a really simple framework.
Again, I'm not looking for the best way, but the right one with tutorials and examples, if possible, on how to get things done.

It all depends on context. There is no single "right way".
The biggest issues with concatenation is the combination of escaping the XML in to a String constant (which is fiddly), but also escaping the values that you can using so that they're correct for an XML document.
For small XMLs, that's fine.
But for larger ones, it can be a pain.
If most of your XML is boilerplate with just a few values inserted, you may find that templates using something like Velocity or any of the other several libraries may be quite effective. It helps keep the template "native" (you don't have to wrap it in "'s and escape it), plus it keeps the XML out of your code, but easily lets you stamp in the parts that you need to do.

I agree that there's not just one way to do it, but I would advise you to take a look at JAXB. You can easily consume and produce XML without any of that pesky String manipulation. Look here for a simple intro: https://docs.oracle.com/javase/tutorial/jaxb/index.html

The Answer by Will Hartung is correct. There is not one right way as it depends on your situation.
For a beginner programmer, I suggest writing the strings manually so you get to understand XML in general and your content in particular. As for the mechanics of String concatenation, you would generally be using StringBuilder rather than String for better performance. Where thread-safety is needed, use StringBuffer.
The major issue is memory.
Abundant MemoryIf you have lots of memory and small XML documents, you can load the entire document into memory. This way you can traverse a document forwards, backwards, and jump around arbitrarily. This approach is know as Document Object Model (DOM). One better-known implementation of this approach is Apache Xerces. There are other good implementations as well.
Scarce MemoryIf you have little memory and large XML documents, then you need to plow through the document from start to finish, biting off small chunks at a time for lower memory usage. This approach is known as SAX. You can find multiple good implementations.
Another issue is validation. Do you want to validate the XML documents against a DTD or Schema? Some tools do this and some do not.
When all you need is to serialize the content of a Java object and read it back, I recommend the Simple XML Serialization library. Much simpler with a quicker learning-curve than the other tools.

Is it possible to do this type of search in Java

I am stuck on a project at work that I do not think is really possible and I am wondering if someone can confirm my belief that it isn't possible or at least give me new options to look at.
We are doing a project for a client that involved a mass download of files from a server (easily did with ftp4j and document name list), but now we need to sort through the data from the server. The client is doing work in Contracts and wants us to pull out relevant information such as: Licensor, Licensee, Product, Agreement date, termination date, royalties, restrictions.
Since the documents are completely unstandardized, is that even possible to do? I can imagine loading in the files and searching it but I would have no idea how to pull out information from a paragraph such as the licensor and restrictions on the agreement. These are not hashes but instead are just long contracts. Even if I were to search for 'Licensor' it will come up in the document multiple times. The documents aren't even in a consistent file format. Some are PDF, some are text, some are html, and I've even seen some that were as bad as being a scanned image in a pdf.
My boss keeps pushing for me to work on this project but I feel as if I am out of options. I primarily do web and mobile so big data is really not my strong area. Does this sound possible to do in a reasonable amount of time? (We're talking about at the very minimum 1000 documents). I have been working on this in Java.

I'll do my best to give you some information, as this is not my area of expertise. I would highly consider writing a script that identifies the type of file you are dealing with, and then calls the appropriate parsing methods to handle what you are looking for.
Since you are dealing with big data, python could be pretty useful. Javascript would be my next choice.
If your overall code is written in Java, it should be very portable and flexible no matter which one you choose. Using a regex or a specific string search would be a good way to approach this;
If you are concerned only with Licensor followed by a name, you could identify the format of that particular instance and search for something similar using the regex you create. This can be extrapolated to other instances of searching.
For getting text from an image, try using the API's on this page:
How to read images using Java API?
Scanned Image to Readable Text
For text from a PDF:
https://www.idrsolutions.com/how-to-search-a-pdf-file-for-text/
Also, PDF is just text, so you should be able to search through it using a regex most likely. That would be my method of attack, or possibly using string.split() and make a string buffer that you can append to.
For text from HTML doc:
Here is a cool HTML parser library: http://jericho.htmlparser.net/docs/index.html
A resource that teaches how to remove HTML tags and get the good stuff: http://www.rgagnon.com/javadetails/java-0424.html
If you need anything else, let me know. I'll do my best to find it!

Apache tika can extract plain text from almost any commonly used file format.
But with the situation you describe, you would still need to analyze the text as in "natural language recognition". Thats a field where; despite some advances have been made (by dedicated research teams, spending many person years!); computers still fail pretty bad (heck even humans fail at it, sometimes).
With the number of documents you mentioned (1000's), hire a temp worker and have them sorted/tagged by human brain power. It will be cheaper and you will have less misclassifications.

You can use tika for text extraction. If there is a fixed pattern, you can extract information using regex or xpath queries. Other solution is to use Solr as shown in this video.You don't need solr but watch the video to get idea.

Text file to string matrix java

I am making an auto chat client like Cleverbot for school. I have everything working, but I need a way to make a knowledge base of responses. I was going to make a matrix with all the responses that I need the bot to say, but I think it would be hard to edit the code every time I want to add a responses to the bot. This is the code that I have for the knowledge base matrix:
`String[][] Database={
{"hi","hello","howdy","hey"},//possible user input
{"hi","hello","hey"},//response
{"how are you", "how r u", "how r you", "how are u"},
{"good","doing well"}`
How would I make a matrix like this from a text file? Is there a better way than reading from a text file to deal with this?

You could...
Use a properties file
The properties file is something that can easily be read into (and stored from, but you're not interested in that) Java. The class java.util.Properties can make that easier, but it's fairly simple to load it and then you access it like a Map.
hello.input=hi,hello,howdy,hey
hello.output=hi,hello,hey
Note the matching formats there. This has its own set of problems and challenges to work with, but it lets you easily pull things in to and out of properties files.
Store it in JSON
Lots of things use JSON for a serialization format. And thus, there are lots of libraries that you can use to read and store from it. It would again make some things easier and have its own set of challenges.
{
"greeting":{
"input":["hi","hello","howdy","hey"],
"output":["hi","hello","hey"]
}
}
Something like that. And then again, you read this and store it into your data structures. You could store JSON in a number of places such as document databases (like couch) which would make for easy updates, changes, and access... given you're running that database.
Which brings us to...
Embedded databases
There are lots of databases that you can stick right in your application and access it like a database. Nice queries, proper relationships between objects. There are lots of advantages to using a database when you actually want a database rather than hobbling strings together and doing all the work yourself.
Custom serialization
You could create a class (instead of a 2d array) and then store the data in a class (in which it might be a 2d array, but that's an implementation detail). At this point, you could implement Serializable and write the writeObject and readObject methods and store the data somehow in a file which you could then read back into the object directly. If you have the administration ability of adding new things as part of this application (or another that uses the same class) you could forgo the human readable aspect of it and use the admin tool (that you write) to update the object.
Lots of others
This is just the tip of the iceberg. There are lots of ways to go about this.
P.S.
Please change the name of the variable from Database to something in lower case that better describes it such as input2output or the like. Upper case names are typically reserved for class names (unless its all upper case, in which case it's a final static field)

A common solution would be to dump the data in to a properties file, and then load it with the standard Properties.load(...) method.
Once you have your data like that, you can then access the data by a map-like interface.
You could find different ways of storing the data in the file like:
userinput=hi,hello,howdy,hey
response=hi,hello,hey
...
Then, when you read the file, you can split the values on the comma:
String[] expectHello = properties.getProperty("userinput").split(",");

Create an item and bag handler for a text-based RPG

For practice, I'm creating a text-based RPG using Java. I'm currently using .properties files to handle character info. I understand that YAML might be a better option, but I'm not quite sure how to implement it. Using properties, it would be easy to create an inventory handler (slot1, slot2, etc.), but creating items and reading the slots for item IDs is a little beyond me.
Could I get some assistance?
To further elaborate, I'd like to create a system with three types of items: items to be used on the environment, items to be held in hand (like weapons or shields), and items to be worn.

Don't try parsing YAML yourself(unless you want to practice parsing, or to write a parsing library). Instead, use a YAML library - you can find some for Java at the official YAML website(I don't want to recommend any because I've never done YAML in Java).
Anyways, unless the serialized data needs to be read and edited by hand frequently, I suggest you use JSON. YAML's biggest advantage is it's readability and editability(yea, JSON and XML can be read and edited by humans as well, but not as neatly and elegantly as YAML). If you don't need that advantage, JSON is better because it has excellent, renowned libraries - like GSON(From a quick glance, YAML's libraries don't seem nearly as good...)
Whatever you do - don't go with XML.

Palm Database (PDB) files in Java?

Has anybody written any classes for reading and writing Palm Database (PDB) files in Java? (I mean on a server, not on the Palm device itself.) I tried to google, but all I got were Protein Data Bank references.
I wrote a Perl program that does it using Palm::PDB.pm, but I want to turn it into a servlet for a GWT app.

The jSyncManager project at http://www.jsyncmanager.org/ is under the LGPL and includes classes to read and write PDB files -- look in jSyncManager/API/Protocol/Util/DLPDatabase.java in its source code. It looks like the core code you need from this could be isolated from the rest of the library with a little effort.

There are a few ways that you can go about this;
Easiest but slowest: Find a perl-> java bridge. This will not be quick, but it will work and it should involve the least amount of work.
Find a C++/C# implementation that you have the source to and convert it (this should be the fastest solution)
Find a Java reader ... there seems to be a few listed under google... however I do not have any experience with these.

Depending on what your intended usage is, you might look into writing a simple reader yourself. The format is pretty simple and you only need to handle a couple of simple fields to parse it.
Basically there is a header for the entire file which has a 2 byte integer at the end which specifies the number of record. So just skip your way through the bytes for all the other fields in the header and then read the last field which is the number of records in the file. Be aware that the PDB format writes integers with most significant byte first.
Following this, there will be a record header for each record, the first field of which is the actual offset into the file for the record itself. Again, be aware of the byte order.
So, now you have the offsets into the file for each record in the file, which should make it very easy to read the actual records as long as you know the format of these for the type of PDB file you are trying to read.
Wikipedia has a nice overview of the header formats.

Maybe JPilot can help? They must have a lot of Java code dealing with Palm OS data.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.