URL structure of a web page with different languages, different characters

URL structure of a web page with different languages, different characters - java

Im working on a news site that is created with jsp. I would like to change the link structure using the "title" of news, not only their IDs.
At the following screen shot, the website puts the exact title to the URL although it has some different characters
I would like to generate url like: mydomain.com/news/id-title
I have some question about that :
1- Is it a correct approach using the url like this with different characters ? If not, how can I create URL for a Russian title (completely different characters) ?
2- Should I change these characters ? advantages, disadvantages ? (according to SEO)
3- Putting the title to URL has any benefit for SEO if we compare with the URL that is created only content ID ?

The approach is good but I would recommend against using special language characters in URLs, it often leads to errors and confusion, unless they are parameter values which often needs to hold special values (but even in those cases it's better to restrain from using special characters).
Instead if would be good to use the English title for example.
For example take a look at this example:
http://www.teamliquid.net/forum/starcraft-2
/437452-scelight-50-run-without-java-merged-accounts?page=21
Much readable, doesn't depend on who sees it, also if you use services like Google Analytics, the resource/page is directly readable from the URL etc.

[Stack Overflow is not the correct place for SEO advice (these questions are off-topic here). You could ask it on Webmasters SE, but this question is most likely answered there already. So my following answer will leave out any SEO aspects.]
You have to percent-encode the URL:
Some characters are allowed directly (e.g., a-z, A-Z, 0-9 etc.) -- you don’t have to percent-encode them,
some characters are allowed directly but have a reserved meaning -- you only have to percent-encode them if you don’t want this meaning,
and most characters are not allowed (including everything non-ASCII) -- you always have to percent-encode them.
Check the URI specification to learn which characters are allowed in which component.
Most programming languages have methods for percent-encoding URLs. For JSP, see for example these questions:
How to encode a URL with the special character "percentage"?
How to URL encode a URL in JSP?
Take for example the Russian Wikipedia page about bees. In your browser’s address bar, the URL will most likely look like
http://ru.wikipedia.org/wiki/Пчёлы
But the real URL is
http://ru.wikipedia.org/wiki/%D0%9F%D1%87%D1%91%D0%BB%D1%8B
You can easily check this yourself by copy-pasting the URL from the address bar to a text document.

Related

how to compare a value's encoding of string type with a specific encoding in java?

I'm told to write a code that get a string text and check if its encoding is equal the specific encoding that we want or not. I've searched a lot but I didn't seem to find anything. I found a method (getEncoding()) but it just works with files and that is not what I want. and also I'm told that i should use java library not methods of mozilla or apache.
I really appreciate any help. thanks in advance.

What you are thinking of is "Internationalization". There are libraries for this like, Loc4j, but you can also get this using java.util.Locale in Java. However in general text is just text. It is a token with a certain value. No localization information is stored in the character. This is why a file normally provides the encoding in the header. A console or terminal can also provide localization using certain commands/functions.
Unless you know the source encoding and the token used you will have a limited ability to guess what encoding is used in the other end. If you still would want to do this you will need to go into deeper areas such as decryption where this kind of stuff usually is done using statistic analysis. This in turn requires databases on the usage of different tokens and depending on the quality of the text, databases and algorithms a specific amount of text is required. Special stuff, like writing Swedish with eg. US encoding (like using a for å and ä or o for ö) will require more advanced analysis.
EDIT
Since I got a comment that encoding and internationalization is different entities I will add some comments. It is possible to work with different encodings working plainly with English (like some English special characters). It is also possible to work with encodings using for example Charset. However for many applications using different encodings it may still be efficient to use Locale, since this library can do a lot of operations on text with different encodings.

Thanks for ur answers and contribution but these two link did the trick. I had already seen these two pages but it didn't seem to work for me cause I was thinking about get the encoding directly and then compare it with the specific one.
This is one of them
This is another one.

Do I need to enable canonicalization when using OWASP ESAPI?

We are adding ESAPI 2.x (owasp java security library) to an application.
The change is easy though quite repetitive. We are adding validations to all input parameters so we make sure all the characters they are composed by are within a whitelist.
This is it:
Validator instance = ESAPI.validator();
Assert.assertTrue(instance.isValidInput("test", "xxx#gmail.com", "Email", 100, false));
Then Email patterns is set in the validation.properties file like:
Validator.Email=^[A-Za-z0-9._%'-]+#[A-Za-z0-9.-]+\\.[a-zA-Z]{2,4}$
Easy!
We are not encoding output given that after the input validation, data becomes trusted.
I can see in ESAPI that it has a flag to canonicalize the input String. I understand that canonicalization is "de-encoding" so any encoded String is transformed in plain text.
The question is. Why do we need to canonicalize?
Can anybody show a sample of an attack that will be prevented by using canonicalization?? (in java)
thank you!

Here's one (of several thousand possible examples):
Take this simple XSS input:
<script>alert('XSS');</script>
//Now we URI encode it:
%3Cscript%3Ealert(%27XSS%27)%3B%3C%2Fscript%3E
//Now we URI encode it again:
%253Cscript%253Ealert(%2527XSS%2527)%253B%253C%252Fscript%253E
Canonicalization on the input that's been encoded once will result in the original input, but in ESAPI's case, the third input will throw an IntrusionException because there is NEVER a valid use case where user input will be URI-encoded more than once. In this particular example, canonicalization means "all URI data will be reduced into its actual character representation." ESAPI actually does more than just URI decoding, btw. This is important if you wish to perform both security and/or business validation using regular expressions--the primary use of regular expressions in most applications.
At a bare minimum, canonicalization gives you good assurance that sneaking malicious input into the application isn't easy: The goal is to restrict to known-good values (whitelist) and reject everything else.
In regards to your ill-advised comment here:
We are not encoding output given that after the input validation, data becomes trusted.
Here's the dirty truth: Javascript, XML, JSON, and HTML are not "regular languages." They're nondeterministic. What this means in practical terms is that it is mathematically impossible to write a regular expression to reject all attempts to insert HTML or Javascript into your application. Look at that XSS Filter Evasion Cheat sheet I posted above.
Does your application use jquery? The following input is malcious:
$=''|'',_=$+!"",__=_+_,___=__+_,($)[_$=($$=(_$=""+{})[__+__+_])+_$[_]+(""+_$[-__])[_]+(""+!_)[___]+($_=(_$=""+!$)[$])+_$[_]+_$[__]+$$+$_+(""+{})[_]+_$[_]][_$]((_$=""+!_)[_]+_$[__]+_$[__+__]+(_$=""+!$)[_]+_$[$]+"("+_+")")()
So you must encode all data when output to the user, for the proper context, this means that if the piece of data is going to be first input into a javascript function, and then displayed as HTML, you encode for Javascript, and then HTML. If its output into an HTML data field (such as a default input box) you encode it for an HTML Attribute.
Its actually MORE IMPORTANT to do output encoding than to do input filtering in protecting against XSS. (If I HAD to just choose one...)
The pattern you want to follow in web development is one where any input that is coming from the outside world is treated as malicious at all times. You encode any time you're handing off to a dynamic interpreter.

Canonicalization of data is also about deriving the data to its basic form. So if we take a different scenario where a file path(relative/symlink) and its allied directory permission is involved we need to first canonicalize the path and then validate else it will allow somebody to explore those files without permission by just passing the target acceptable data.

Extracting webpage information based on a template in Java

Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change leads to a change in the existing Java code, this is a tedious task, because these webpage change very frequently. Also it requires a programmer to fix the Java code. Here is an example of HTML code of my interest on a webpage:
<div>
<p><strong>Score:</strong>2.5/5</p>
<p><strong>Director:</strong> Bryan Singer</p>
</div>
<div>some other info which I dont need</div>
Now here is what I want to do, I want to save this webpage (an HTML file) locally and create a template out of it, like:
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
<div>some other info which I dont need</div>
Along with the actual URLs of the webpages these HTML templates will be the input to the Java program which will find out the location of these predefined keywords (e.g. {MOVIE_RATING}, {MOVIE_DIRECTOR}) and extract the values from the actual webpages.
This way I wouldn't have to modify the Java program every time a webpage changes, I will just save the webpage's HTML and replace the data with these keywords and rest will be taken care by the program. For example in future the actual HTML code may look like this:
<div>
<div><b>Rating:</b>**1/2</div>
<div><i>Director:</i>Singer, Bryan</div>
</div>
and the corresponding template will look like this:
<div>
<div><b>Rating:</b>{MOVIE_RATING}</div>
<div><i>Director:</i>{MOVIE_DIRECTOR}</div>
</div>
Also creating these kind of templates can be done by a non-programmer, anyone who can edit a file.
Now the question is, how can I achieve this in Java and is there any existing and better approach to this problem?
Note: While googling I found some research papers, but most of them require some prior learning data and accuracy is also a matter of concern.

The approach you gave is pretty much similar to the Gilbert's except
the regex part. I don't want to step into the ugly regex world, I am
planning to use template approach for many other areas apart from
movie info e.g. prices, product specs extraction etc.
The template you describe is not actually a "template" in the normal sense of the word: a set static content that is dumped to the output with a bunch of dynamic content inserted within it. Instead, it is the "reverse" of a template - it is a parsing pattern that is slurped up & discarded, leaving the desired parameters to be found.
Because your web pages change regularly, you don't want to hard-code the content to be parsed too precisely, but want to "zoom in" on its' essential features, making the minimum of assumptions. i.e. you want to commit to literally matching key text such as "Rating:" and treat interleaving markup such as"<b/>" in a much more flexible manner - ignoring it and allowing it to change without breaking.
When you combine (1) and (2), you can give the result any name you like, but IT IS parsing using regular expressions. i.e. the template approach IS the parsing approach using a regular expression - they are one and the same. The question is: what form should the regular expression take?
3A. If you use java hand-coding to do the parsing then the obvious answer is that the regular expression format should just be the java.util.regex format. Anything else is a development burden and is "non-standard" and will be hard to maintain.
3B. If you use want to use an html-aware parser, then jsoup is a good solution. Problem is you need more text/regular expression handling and flexibility than jsoup seems to provide. It seems too locked into specific html tags and structures and so breaks when pages change.
3C. You can use a much more powerful grammar-controlled general text parser such as ANTLR - a form of backus-naur inspired grammar is used to control the parsing and generator code is inserted to process parsed data. Here, the parsing grammar expressions can be very powerful indeed with complex rules for how text is ordered on the page and how text fields and values relate to each other. The power is beyond your requirements because you are not processing a language. And there's no escaping the fact that you still need to describe the ugly bits to skip - such as markup tags etc. And wrestling with ANTLR for the first time involves educational investment before you get productivity payback.
3D. Is there a java tool that just uses a simple template type approach to give a simple answer? Well a google search doesn't give too much hope https://www.google.com/search?q=java+template+based+parser&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a. I believe that any attempt to create such a beast will degenerate into either basic regex parsing or more advanced grammar-controlled parsing because the basic requirements for matching/ignoring/replacing text drive the solution in those directions. Anything else would be too simple to actually work. Sorry for the negative view - it just reflects the problem space.
My vote is for (3A) as the simplest, most powerful and flexible solution to your needs.

Not really a template-based approach here, but jsoup can still be a workable solution if you just externalize your Selector queries to a configuration file.
Your non-programmer doesn't even have to see HTML, just update the selectors in the configuration file. Something like SelectorGadget will make it easier to pick out what selector to actually use.

How can I achieve this in Java and is there any existing and better approach to this problem?
The template approach is a good approach. You gave all of the reasons why in your question.
Your templates would consist of just the HTML you want to process, and nothing else. Here's my example based on your example.
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
Basically, you would use Jsoup to process your templates. Then, as you use Jsoup to process the web pages, you check all of your processed templates to see if there's a match.
On a template match, you find the keywords in the processed template, then you find the corresponding values in the processed web page.
Yes, this would be a lot of coding, and more difficult than my description indicates. Your Java programmer will have to break this description down into simpler and simpler tasks until she or he can code the tasks.

If the web page changes frequently, then you'll probably want to confine your search for the fields like MOVIE_RATING to the smallest possible part of the page, and ignore everything else. There are two possibilities: you could either use a regular expression for each field, or you could use some kind of CSS selector. I think either would work and either "template" can consist of a simple list of search expressions, regex or css, that you would apply. Just roll through the list and extract what you can, and fail if some particular field isn't found because the page changed.
For example, the regex could look like this:
"Score:"(.)*[0-9]\.[0-9]\/[0-9]
(I haven't tested this.)

Or you can try different approach, using what i would call 'rules' instead of templates: for each piece of information that you need from the page, you can define jQuery expression(s) that extracts the text. Often when page change is small, the same well written jQuery expressions would still give the same results.
Then you can use Jerry (jQuery in Java), with the almost the same expressions to fetch the text you are looking for. So its not only about selectors, but you also have other jQuery methods for walking/filtering the DOM tree.
For example, rule for some Director text would be (in sort of sudo-java-jerry-code):
$.find("div#movie").find("div:nth-child(2)")....text();
There could be more (and more complex) expressions in the rule, spread across several lines, that for example iterate some nodes etc.
If you are OO person, each rule may be defined in its own implementation. If you are groovy person, you can even rewrite rules when needed, without recompiling your project, and still being in java. Etc.
As you see, the core idea here is to define rules how to find your text; and not to match to patterns as that may be fragile to minor changes - imagine if just a space has been added between two divs:). In this example of mine, I've used jQuery-alike syntax (actually, it's Jerry-alike syntax, since we are in Java) to define rules. This is only because jQuery is popular and simple, and known by your web developer too; at the end you can define your own syntax (depending on parsing tool you are using): for example, you may parse HTML into DOM tree and then write rules using your helper methods how to traverse it to the place of interest. Jerry also gives you access to underlaying DOM tree, too.
Hope this helps.

I used the following approach to do something similar in a personal project of mine that generates a RSS feed out of here the leading real estate website in spain.
Using this tool I found the rented place I'm currently living in ;-)
Get the HTML code from the page
Transform the HTML into XHTML. I used this this library I guess there might be today better options available
Use XPath to navigate the XHTML to the information you're interesting in
Of course every time they change the original page you will have to change the XPath expression. The other approach I can think of -semantic analysis of the original HTML source- is far, far beyond my humble skills ;-)

Cleaning up URLs to remove personal information

Are there rules to identify and remove any PII information from URLs? I would like this to be generic and handle all sorts of urls which we might encounter on the internet.
Clarification : I have a list of urls of people browsing the internet and want to remove PII from those.

To answer the question as restated in your reply to snemarch:
Yes I understand that. I meant what considerations I need to keep in mind to identify PII in urls? What are the various ways in which PII might occur in URls?
HTTP GET information can be transmitted in many different ways. Some, and likely most, will look like this:
example.com/form.php?key=value.
Other websites, including stackoverflow, may use a URL rewrite to tranform the link "example.com/form/value" into the equivalent: "example.com/form.php?key=value." This URL rewrite is completely dependent on the configuration of the server and there is no simple way to detect and strip off PII presented this way.
With this in mind, there is really no way to 100% remove all PII from a list of different urls, as such information can be indiscernible from a URL without any PII. You can, at the very least, strip out information that is DEFINITELY PII, such as a URL in the form "example.com/form.php?key=value." I would be willing to bet that any URL with a "=" has some sort of variable in it, and should be filtered. Past that, you're going to have to manually parse a majority of the list.
Depending on how big the list is and how serious you are about filtering it, you could research popular mod_rewrite methods for popular products and attempt to match them in your list, scrape URLS to determine additional information about a URL, and do some complicated and likely ugly algorithms to attempt to guess at what may be a variable in a URL - possibly factoring into account similar URL's a user has visited and comparing the tokens of the URL. similar urls with slightly different text in a given token are probably variables, and should be filtered.
Good luck!

You should never pass any user sensitive information from URL via GET. If you use POST instead then just make sure the connection is HTTPS.

How can I extract only the main textual content from an HTML page?

Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.

You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock has some more exciting methods, feel free to play around!

There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
a web page with a single article in it (Boilerpipe worthy!)
a web with multiple articles in it, such as the front page of the New York times
a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.

You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet

My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)

http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site

You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:
Tag Soup
HTML Unit

You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.