CheckMarkx :: HRA_JAVA_CGI_REFLECTED_XSS_ALL_CLIENTS issue - java

I am struggling with one of the CheckMarx vulnerabilities. I need some guidance to support this.
Below is my code :
try(Bufferedreader in = new BufferedReader(new InputStreamReader(con.getInputStream()))){
String content = null;
while((content = in.readLine()) != null) {
// Logic to Parse JSON data and use it.
}
}
Here con is (HttpurlConnection) new URL("some url").openConnection().
So, checkmarx is highlighting issue at in.readLine().
Workarounds I tried:
1: StringEscapeUtils.unescapeJson(in.readLine()), it's not helping.
2: Used in.lines().collect(Collectors.joining()) in place of in.readline() by reading somewhere in google.
It helped to fix this but introduced a new one at con.getInputStream() (the same vulnerability).
Please help to fix this issue.
Thanks in advance.

Technically it should be StringEscapeUtils.escapeJson(in.readLine()) not StringEscapeUtils.unescapeJson(in.readLine()). The intent is to output encode to prevent XSS, not the other way around.
try(Bufferedreader in = new BufferedReader(new InputStreamReader(con.getInputStream()))){
String content = null;
while((content = StringEscapeUtils.escapeJson(in.readLine())) != null) {
// Logic to Parse JSON data and use it.
}
}
Still, I don't think Checkmarx will recognize this as a sanitizer, I can only see that it only looks for escapeXml, escapeHtml, escapeHtml3, escapeHtml4 methods under StringEscapeUtils.
Work with your security team to update the Checkmarx query to include escapeJson or you can use an alternative that Checkmarx recognizes such as the replace method that replaces malicious tags <,>,</,/> but this is not a full proof solution though to be considered a robust secure code

Related

mIn java, how would one go about loading a web page into a BufferedReader, as mine will not print?

Hey I a having a little trouble here. I am doing File Writing at school and we got the challenge of reading a webpage. How is it possible to do it? I had a go with a JSoup and an Apache plugin, but neither worked, but I have to use the net import
I am a bit of a noob at coding, so there will probably be a couple of errors!
Here is my code:
URL oracle = new URL("http://www.oracle.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = br.readLine()) != null){
System.out.println(inputLine);
}
br.close();
There is no output from the program, and earlier I managed output but it was in the form of HTML, however I deleted that code, ironically looking for a fix for that issue.
Any help or solutions would be greatly appreciated! Thank you all very much!
The code example is from Reading Directly from a URL, but the tutorial is old. The url http://www.oracle.com now redirects to https://www.oracle.com/ but you don't follow the redirect.
If you use a URL that does not redirect, like http://www.google.com you will see that the code works.
If you want a more robust program that handles redirects, you'll probably want to use a HttpURLConnection instead of the basic URL, as it has more features for you to use.

Web Crawler specifically for downloading images and files

I am doing an assignment for one of my classes.
I am supposed to write a webcrawler that download files and images from a website given a specified crawl depth.
I am allowed to use third party parsing api so I am using Jsoup. I've also tried htmlparser. Both nice softwares but they are not perfect.
I used the default java URLConnection to check content type before processing the url but it becomes really slow as the number of links grows.
Question : Anyone know any specialized parser api for images and links ?
I could start writing mine using Jsoup but am being lazy. Besides why reinvent the wheel if there could be a working solution out there? Any help would be appreciated.
i need to check contentType while looping through the links to check if the link is to a file, in an effective way but Jsoup does not have what i need. Heres what i have:
**
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
UPDATE
Based on Yoshi's answer, I was able to get my code to work right. Here's the link:
https://github.com/unekwu/cs_nemesis/blob/master/crawler/crawler/src/cu/cs/cpsc215/project1/parser/Parser.java
Use jSoup i think this API is good enough for your purpose. Also you can find good Cookbook on this site.
Several steps:
Jsoup: how to get an image's absolute url?
how to download image from any web page in java
You can write your own recursion method which walk through links on page which contains nesessary domain name or relative links. Use this way to grab all links and find all images on it. Write it yourself it's not bad practice.
You don't need to use URLConnection class, jSoup have wrapper for it.
e.g
You can use only one line of code to get DOM object:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Instead of this code:
URL oracle = new URL("http://www.oracle.com/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
Update1
try to add in your code next lines:
Connection.Response res = Jsoup.connect("http://en.wikipedia.org/").execute();
String pageContentType = res.contentType();

Java URLConnection Issues with Integers

I've been trying to get information from a webpage, specifically this site: http://www.ncbi.nlm.nih.gov/pubmed?term=%22pulmonary%20disease%2C%20chronic%20obstructive%22%5BMesh%5D (among other similar ones). I'm using the URL and URLConnection packages to do so. I'm trying to get a certain number from the webpage - on this page, I want the total number of articles (16428).
It says this near the top of the page: "Results: 1 to 20 of 16428" and when I look at the page source manually I can find this. However, when I try to use the java connection to obtain this number from the page source, for some reason the number it gets is "863399" instead of "16428".
Code:
URL connection = new URL("http://www.ncbi.nlm.nih.gov/pubmed?term=%22pulmonary%20disease%2C%20chronic%20obstructive%22%5BMesh%5D");
URLConnection yc = connection.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String html = "";
String inputLine;
while ((inputLine = in.readLine()) != null) html += inputLine;
in.close();
int startMarker = html.indexOf("ncbi_resultcount");
int endMarker = html.indexOf("ncbi_op");
System.out.println(html.substring(startMarker, endMarker));
When I run this code, I get:
ncbi_resultcount" content="863399" />
rather than:
ncbi_resultcount" content="16428" />
Does anyone know why this is / how I can fix it?
Thanks!
I can't reproduce your problem and I have no idea why this is happening. Perhaps it's sniffing specific Java user agent versions. You'd then need to try to set the User-Agent header to something else to pretend as a "real" webbrowser.
yc.setRequestProperty("User-Agent", "Mozilla");
Unrelated to the concrete problem, I'd suggest to use a real HTML parser for this job, such as Jsoup. It's then as easy as:
Document document = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed?term=%22pulmonary%20disease%2C%20chronic%20obstructive%22%5BMesh%5D").get();
Element nbci_resultcount = document.select("meta[name=ncbi_resultcount]").first();
System.out.println(nbci_resultcount.attr("content")); // 16433

Java - Best way to download a webpage's source html

I'm writing a little crawler. What is the best way to download a web page's source html? I'm currently using little piece of code below but some times the result is just half of the page's source!!! I don't know what's the problem. Some people suggested that I should use Jsoup but using .get.html() function from Jsoup also returns half of the page's source if it's too long. Since I'm writing a crawler, it's very important that the method support unicode (UTF-8) and the efficiency is also very important. I wanted to know the best modern way to do it so I asked you guys since I'm new to Java. Thanks.
Code:
public static String downloadPage(String url)
{
try
{
URL pageURL = new URL(url);
StringBuilder text = new StringBuilder();
Scanner scanner = new Scanner(pageURL.openStream(), "utf-8");
try {
while (scanner.hasNextLine()){
text.append(scanner.nextLine() + NL);
}
}
finally{
scanner.close();
}
return text.toString();
}
catch(Exception ex)
{
return null;
}
}
I use commons-io String html = IOUtils.toString(url.openStream(), "utf-8");
Personally, I'm very pleased with the Apache HTTP library http://hc.apache.org/httpcomponents-client-ga/. If you're writing a web crawler, which I am also, you may greatly appreciate the control it gives over things like cookies and client sharing and the like.

java program to retreive page source from google search automatically [duplicate]

I have 30000 dictionary words. In that I want to to search each word in Google and want to find hits of each word using Java program. Is it possible?
Look up <estimatedTotalResultsCount> using Google's SOAP search API.
You'll be limited to 1000 queries per day though. This limit is removed if you use their AJAX API.
Since your duplicate post is closed, I'll post my answer here as well:
Whether this is possible or not doesn't really matter: Google doesn't want you to do that. They have a public AJAX-search API developers can use: http://code.google.com/apis/ajaxsearch/web.html
Here is a Sun tutorial on Reading to and Writing from an URLConnection.
The simplest URL I can see to make a Google search is like:
http://www.google.com/#q=wombat
Reading from a url with java is pretty straight forward. A basic working example is as follows
public Set<String> readUrl(String url) {
String line;
Set<String> lines = new HashSet<String>();
try {
URL url = new URL(url);
URLConnection page = url.openConnection();
BufferedReader in = new BufferedReader( new InputStreamReader(page.getInputStream()));
while ((line = in.readLine()) != null) {
lines.add(line);
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return lines;
}
I'd recommend separating your problem into pieces. Get each one working, then marry them together for the solution you want.
You have a few things going on here:
Downloading text from a URL
Scanning a stream of characters and breaking it up into words
Iterating through a list of words and tallying up the hits from your dictionary
Computer science is all about taking large problems and decomposing them into smaller ones. I'd recommend that you start learning how to do that now.

Categories