Filesystem-based web content - java

I plan to pull my Java web apps's content from a filesystem, for the sake of simplicity of editing. These files will be most probably only a text in a simple markup like JTexy or Markdown.
What I plan to implement is a tree-like structure keeping the content of the files.
It should be cached and eventually should handle authorization.
I am looking for a simple-to-use thing, not a full-blown CMS like OpenCMS, but if it provides a simple api to access the content and can keep it's dependencies small, the other stuff like thick client content editors is a bonus.
Perhaps something from this list: http://java-source.net/open-source/content-managment-systems
What would you recommend?
Thanks,
Ondra

Why not use a simple Apache webserver with a (f)cgi perl script to convert the markdown and a mod_proxy to cache the results.
Beeautiful in its simplicity!

If you do not intend to reinvent wheels then you should probably use a CMS.
As you write OpenCms is a full-blown CMS", but it also should be "simple-to-use" in your context.

Related

Generate HTML files using XML configuration

My target is to assemble a static web site that has a lot of repeating code. Now, I could use JSP includes for that purpose. But the site will be modified infrequently and under very heavy load, also using features like gzip and I don't need the complications.
My idea is to put up a build process with some tool like ant, That build process will concatenate all HTML pieces, preprocess HTML, JS, CSS with minifier and finally apply gzip.
I want an XML configuration that will define the parts that need to go in every html page and their order.
I need advice on ant or any similar tool; how to approach the configuration, any external tools that will help? Any suggestions are much appreciated.
XSLT is perfectly suited to transform XML into another format like HTML.
You can download Apache Xalan to give it a try. Ant has support for XSLT processing.
In the java world, you can take a look at Apache Forrest, which precisely do that kind of things.
In other worlds, there also exist webgen, which is a competent Ruby site builder.
I also vaguey remember there are other alternatives, but i can't find back their name.

Communication model: C++ and Java

Pals,
I have a requirement to establish a communication channel between C++ and Java layer of my application for the exchange of objects and their properties.
I have got the following options:
XML / SOAP
Postgre SQL
Can you please advice me the Pros & Cons on these. Please share your experiences on the implementation complexities.
Thanks,
Gtk
If the option is between those I would choose XML
Object <=> XML
Java side Simple, C++ side XML Objects
Reason, its simpler for what you want, i.e. pass language objects and not Data Base
Ah, could you specify the communication channel between the apps ?
UPDATE
If you can use JSON I would recommend it instead of XML, here is why.
Another option would be JMS. There are C++ clients out there.
Every time I see XML I think RESTful web service. Both platforms you mentioned have some form of tooling to marshal & unmarshal XML. There are plenty of working examples out in the wild, so a Google/Bing search is good. A nice side-effect is once you have those interfaces built, anything can connect to them.
If you really want to bother with generating a WSDL, then feel free to go the SOAP route. However, speaking with several years of web service integration experience, RESTful is so gosh darned simple compared to anything else.
I would like to suggest a third option : YAML
You have parsing library in YAML for both java and C++. In my experience, it's easier to debug exchange in YAML that in XML (especially if you got full text field or cyclic data structure).
I depends of the kind of message you transfer.
If your message are individual entity that have a short live, I would go for XML, YAML or something similar.
If your message contains information that is going to be used later on and refer to information in previous messages, I would use a database.

How to create WSDL file given SOAP WSDL operations

I haven't had any experience with web service related development. So, any ideas will be greatly appreciated.
Suppose, I have a file listing draft specification of WSDL operations. Following is one example. How would I go about creating the WSDL file. Is notepad sufficient or do I need to have WSDL editor?
getHostSystemInfo
Returns detailed information about host systems specified via given IDs.
input HostSystemIdCollection(Collection of Strings)
Output HostSystemInfoCollection
HostSystemInfo
Id: mandatory
Properties: Following properties should be provided for host systems
HostSystemName
HostSystemProperty1
HostSystemProperty2
HostSystemProperty3
....
....
If the question is just "how do I create the WSDL" then you could indeed use Notepad and just write it, it's only XML after all. However, writing syntactically correct XML by hand is pretty dull, and error prone. So I would recommend using WSDL aware tooling for example an Eclipse editor
An alternative is to write some Java which expresses the interface, and from it generate the WSDL. There are many ways of doing this, including starting with an EJB and annotating it accordingly. A few googles should help you find what you need.
My experience is that simple POC situations tend to work well starting at the Java. Larger scale projects benfit from considered designs starting at the WSDL.
coding WSDL by hand is a big pain! i used a XML editor for creation of and then generated the stubs with JAXWS. It is important to understand and differences of the WSDL styles, which is not trivial (have a look at WSDL styles). a good help is to import the WSDL schema to your IDE (eclipse, idea) and then work with autocompletion.
just for interest, why are you using WSDL + SOAP. if you have a choice and you use anyway HTTP, have a look at REST. It can make implementation of web-api a LOT easier, both on server side and for api-clients.
If you haven't done any web services before, I would strongly recommend a WSDL Editor. The Netbeans has a plugin that should help.
The other way of doing it, which may be easier is by using the Java annotations defined in JSR 181.
Of course you could use the worst text editor in the world (!) but I'd seriously consider using any decent XML editor or IDE (Eclipse's WSDL support is pretty decent). This will save you a lot of pain and suffer.
Or, if this is an option, you could just annotate a Java class with JAX-WS annotations and have your WSDL dynamically generated from the Java code. Personally, I prefer the WSDL-first approach, the Java-first approach is just a suggestion to get you started.
You could use Axis2 to create that for you.

How do you grab a text from webpage (Java)?

I'm planning to write a simple J2SE application to aggregate information from multiple web sources.
The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need that huge tag cloud or navbar.
What technique/library would you advice?
Updates/Remarks
Speed doesn't matter — as long as it can parse about 5MB of HTML in less than 10 minutes.
It sould be really simple.
You may use HTMLParser (http://htmlparser.sourceforge.net/)in combination with URL#getInputStream() to parse the content of HTML pages hosted on Internet.
You could look at how httpunit does it. They use couple of decent html parsers, one is nekohtml.
As far as getting data you can use whats built into the jdk (httpurlconnection), or use apache's
http://hc.apache.org/httpclient-3.x/
If you want to take advantage of any structural or semantic markup, you might want to explore converting the HTML to XML and using XQuery to extract the information in a standard form. Take a look at this IBM developerWorks article for some typical code, excerpted below (they're outputting HTML, which is, of course, not required):
<table>
{
for $d in //td[contains(a/small/text(), "New York, NY")]
for $row in $d/parent::tr/parent::table/tr
where contains($d/a/small/text()[1], "New York")
return <tr><td>{data($row/td[1])}</td>
<td>{data($row/td[2])}</td>
<td>{$row/td[3]//img}</td> </tr>
}
</table>
In short, you may either parse the whole page and pick things you need(for speed I recommend looking at SAXParser) or running the HTML through a regexp that trims of all of the HTML... you can also convert it all into DOM, but that's going to be expensive especially if you're shooting for having a decent throughput.
You seem to want to screen scrape. You would probably want to write a framework which via an adapter / plugin per source site (as each site's format will differ), you could parse the html source and extract the text. you would prob use java's io API to connect to the URL and stream the data via InputStreams.
If you want to do it the old fashioned way , you need to connect with a socket to the webserver's port , and then send the following data :
GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>
then use the Socket#getInputStream , and then read the data using a BufferedReader , and parse the data using whatever you like.
You can use nekohtml to parse your html document. You will get a DOM document. You may use XPATH to retrieve data you need.
If your "web sources" are regular websites using HTML (as opposed to structured XML format like RSS) I would suggest to take a look at HTMLUnit.
This library, while targeted for testing, is a really general purpose "Java browser". It is built on a Apache httpclient, Nekohtml parser and Rhino for Javascript support. It provides a really nice API to the web page and allows to traverse website easily.
Have you considered taking advantage of RSS/Atom feeds? Why scrape the content when it's usually available for you in a consumable format? There are libraries available for consuming RSS in just about any language you can think of, and it'll be a lot less dependent on the markup of the page than attempting to scrape the content.
If you absolutely MUST scrape content, look for microformats in the markup, most blogs (especially WordPress based blogs) have this by default. There are also libraries and parsers available for locating and extracting microformats from webpages.
Finally, aggregation services/applications such as Yahoo Pipes may be able to do this work for you without reinventing the wheel.
Check this out http://www.alchemyapi.com/api/demo.html
They return pretty good results and have an SDK for most platforms. Not only text extraction but they do keywords analysis etc.

server-side css selectors

I am creating a tool that will check dynamically generated XHTML and validate it against expected contents.
I need to confirm the structure is correct and that specific attributes exist/match. There may be other attributes which I'm not interested in, so a direct string comparison is not suitable.
One way of validating this is with XPath, and I have implemented this already, but I would also like something less verbose - I want to be able to use CSS Selectors, like I can with jQuery, but on the server - within CFML code - as opposed to on the client.
Is there a CFML or Java library that allows me to use CSS Selectors against an XHTML string?
I've just released an open source project which is a W3C CSS Selectors Level 3 implementation in Java. Please give it a try. I was looking for the same thing and decided to implement my own engine. It's inspired by the code in WebKit etc.
http://github.com/chrsan/css-selectors/tree
I don't know of a Java library itself, but there is a Ruby library called Hpricot that does exactly what you're looking for. In conjunction with the Ruby implementation on the Java platform, JRuby, it should be relatively straightforward to call Ruby methods from your Java code (using BSF, JSR-222 Scripting APIs, or an internal API).
Are you using Coldfusion 8? Coldfusion 8, being based on Java 6, supports JSR-222 Scripting APIs "javax.scripting".
Take a look at this blog entry on embedding PHP within CFML. You should be able to do the same with Ruby. There is ZIP file example code linked from this blog posting, and if you crack open the CFML, you'll see a good example of embedding Ruby within CFML.
Although it might take a bit of work to make all the pieces work together, but with a bit of investment, it should give you the robust parsing/CSS selector querying that you're looking for.
Hpricot is definetly a fantastic solution if the JRuby-route is open to you.
Wrt. XPath being the "correct" way to access XML documents... sorry but this is rubbish. There are numerous ways to access elements of an XML document: DOM traversal, XPath, XQuery, CSS selectors to name a few. XPath is certainly popular but CSS selectors are very very powerful, assuming your XML document has HTML semantics.
If you can use PHP within your CFML (as mentioned above), you could take advantage of this excellent "jQuery for PHP" library, phpQuery
Full CSS selector support, manipulation functions, traversing, etc. It should work great for what you need.
Hope it helps.
There is a theoretical difference between the server and client. To a web browser, the document is a living DOM hierarchy. To your server code it's merely an XML document of whatever type. XPath is the "correct" way to access elements of an XML document.
So unless you have a serious performance problem with your current XPath solution, or it doesn't actually work correctly, I suggest you stick with it. Trying something too clever brings the risk of breaking something that's working.
If you find the XPath to be too verbose and ugly to leave sitting around, or want more power to re-use the tool in different cases, or just can't resist trying to do something clever, then you could try writing a utility that compiles a given CSS selector into an XPath. You could then call this in one line whenever you needed.
it may be easier to use cQuery.com - cQuery.com is an API based 'Content Query Engine' to extract content from live websites by using CSS.
You can using it programatically in you application.

Categories