Having trouble reading in content of url using InputStream - java

So I run the code below and it prints "!DOCTYPE html". How do I get the content of the url, like the html for instance?
public static void main(String[] args) throws IOException {
URL u = new URL("https://www.whitehouse.gov/");
InputStream ins = u.openStream();
InputStreamReader isr = new InputStreamReader(ins);
BufferedReader websiteText = new BufferedReader(isr);
System.out.println(websiteText.readLine());
}
According to java doc https://docs.oracle.com/javase/tutorial/networking/urls/readingURL.html: "When you run the program, you should see, scrolling by in your command window, the HTML commands and textual content from the HTML file located at ".... Why am I not getting that?

In your program, your did not put while loop.
URL u = new URL("https://www.whitehouse.gov/");
InputStream ins = u.openStream();
InputStreamReader isr = new InputStreamReader(ins);
BufferedReader websiteText = new BufferedReader(isr);
String inputLine;
while ((inputLine = websiteText.readLine()) != null){
System.out.println(inputLine);
}
websiteText.close();

You are only reading one line of the text.
Try this and you will see that you get two lines:
System.out.println(websiteText.readLine());
System.out.println(websiteText.readLine());
Try reading it in a loop to get all the text.

BufferedReader has a method called #lines() since Java 8. The return type of #lines() is Stream. To read an entire site you could do something like that:
String htmlText = websiteText.lines()
.reduce("", (text, nextLine) -> text + "\n" + nextLine)
.orElse(null);

Related

How can I write my exported XML to an XML File?

My job is to search for a book on WikiBooks and export the corresponding XML file. This file should be edited later. However, the error occurs earlier.
My idea was to read the page and write it line by line to an XML file. Here is my code for it:
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
System.out.println("Welches Buch wollen Sie suchen?");
book = (reader.readLine());
book = replaceSpace(book);
URL url = new URL("https://de.wikibooks.org/wiki/Spezial:Exportieren/" + book);
URLConnection uc = url.openConnection();
uc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0");
uc.connect();
BufferedReader xmlReader = new BufferedReader(new InputStreamReader(uc.getInputStream()));
File file = new File("wiki.xml");
FileWriter writer = new FileWriter(file);
String inputLine;
while ((inputLine = xmlReader.readLine()) != null) {
writer.write(inputLine + "\n");
}
xmlReader.close();
The code is executed without an error message, but the saved file ends in the middle of a word and is therefore incomplete.
How can I work around this problem?
As the comment suggested the problem is that the content of the stream is not flushed to the file. If you call close() on your writer the content is automatically flushed to the file.
Here is your code with the added statement in the end:
BufferedReader xmlReader = new BufferedReader(new InputStreamReader(uc.getInputStream()));
File file = new File("wiki.xml");
FileWriter writer = new FileWriter(file);
String inputLine;
while ((inputLine = xmlReader.readLine()) != null) {
writer.write(inputLine + "\n");
}
writer.close();
xmlReader.close();
A much easier solution that is built into Java is using the Files class. My suggestion is that you replace the above code with the following simple statement, which directly stores your InputStream into a file and automatically takes care of the streams.
Files.copy(uc.getInputStream(), Paths.get("wiki.xml"), StandardCopyOption.REPLACE_EXISTING);

display the output-stream of a Process returned by Runtime.exec()

How do I print to stdout the output string obtained executing a command?
So Runtime.getRuntime().exec() would return a Process, and by calling getOutputStream(), I can obtain the object as follows, but how do I display the content of it to stdout?
OutputStream out = Runtime.getRuntime().exec("ls").getOutputStream();
Thanks
I believe you are trying to get the output from process, and for that you should get the InputStream.
InputStream is = Runtime.getRuntime().exec("ls").getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader buff = new BufferedReader (isr);
String line;
while((line = buff.readLine()) != null)
System.out.print(line);
You get the OutputStream when you want to write/ send some output to Process.
Convert the stream to string as discussed in Get an OutputStream into a String and simply use Sysrem.out.print()

Corrupted JSON encoding?

I'm sending a JSON object of the same class from a servlet to an applet, but
all strings variables in this class are missing some characters like: 'ą', 'ę', 'ś', 'ń', 'ł'.
However, 'ó' is displayed normally (?). For example:
"Zaznacz prawid?ow? operacj? porównywania dwóch zmiennych typu"
Solution
I wish I could explain it more thoroughly, but as Henry noticed, it's IDE causing this issue. I solved it using farmer1992's class from the google ticket. It prints escaped unicode characters (\u...) - the only way my applet could encode characters correctly. Also I have to restart NetBeans IDE from time to time to force the Tomcat servlet to work correctly (I have no idea why :) ).
Servlet code (updated with solution):
//begin of the servlet code extract
public void sendToApplet(HttpServletResponse response, String path) throws IOException
{
TestServlet x = new TestServlet();
x.load(path);
String json = new Gson().toJson(x);
response.setCharacterEncoding("UTF-8");
response.setContentType("application/json;charset=UTF-8");
PrintWriter out = response.getWriter();
//out.print(json);
//out.flush();
GhettoAsciiWriter out2 = new GhettoAsciiWriter(out);
out2.write(json);
out2.flush();
}
//end of the servlet code extract
Applet code:
//begin of the applet code extract
public void retrieveFromServlet(String path) throws MalformedURLException, IOException
{
String encoder = URLEncoder.encode(path, "UTF-8");
URL urlServlet = new URL("http://localhost:8080/ProjektServlet?action=" + encoder);
URLConnection connection = urlServlet.openConnection();
connection.setDoInput(true);
connection.setDoOutput(true);
connection.setUseCaches(false);
connection.setRequestProperty("Content-Type", "application/json;charset=UTF-8");
InputStream inputStream = connection.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(inputStream));
String json = br.readLine();
Test y = new Gson().fromJson(json, Test.class);
inputStream.close();
}
//end of the applet code extract
those chars should encode in \uxxxx form
you can see this ticket
http://code.google.com/p/google-gson/issues/detail?id=388#c4
With this line
BufferedReader br = new BufferedReader(new InputStreamReader(inputStream));
the platform character encoding will be used (which may or may not be UTF-8). Try to set the encoding explicitly with
BufferedReader br = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"));

Character looks like "?" at Reading the Content of an Uploaded File

I have a client that uploads a vcf file, and I get this file at server side and reads it contents and saves them to a txt file. But there is a character error when I try read it, if there is turkish characters it looks like "?". My read code is here:
FileItemStream item = null;
ServletFileUpload upload = new ServletFileUpload();
FileItemIterator iterator = upload.getItemIterator(request);
String encoding = null;
while (iterator.hasNext()) {
item = iterator.next();
if ("fileUpload".equals(item.getFieldName())) {
InputStreamReader isr = new InputStreamReader(item.openStream(), "UTF-8");
String str = "";
String temp="";
BufferedReader br = new BufferedReader(isr);
while((temp=br.readLine()) != null){
str +=temp;
}
br.close();
File f = new File("C:/sedat.txt");
BufferedWriter buf = new BufferedWriter(new FileWriter(f));
buf.write(str);
buf.close();
}
BufferedWriter buf = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(f), "UTF-8"));
If this is production code, i would recommend writing the output straight to the file and not accumulating it in the string first. And, you could avoid any potential encoding issues by reading the source as an InputStream and writing as an OutputStream (and skipping the conversion to characters).

Extract HTML from URL

I'm using Boilerpipe to extract text from url, using this code:
URL url = new URL("http://www.example.com/some-location/index.html");
String text = ArticleExtractor.INSTANCE.getText(url);
the String text contains just the text of the html page, but I need to extract to whole html code from it.
Is there anyone who used this library and knows how to extract the HTML code?
You can check the demo page for more info on the library.
For something as simple as this you don't really need an external library:
URL url = new URL("http://www.google.com");
InputStream is = (InputStream) url.getContent();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line = null;
StringBuffer sb = new StringBuffer();
while((line = br.readLine()) != null){
sb.append(line);
}
String htmlContent = sb.toString();
Just use the KeepEverythingExtractor instead of the ArticleExtractor.
But this is using the wrong tool for the wrong job. What you want is just to download the HTML content of a URL (right?), not extract content. So why use a content extractor?
With Java 7 and a trick of Scanner, you can do the following:
public static String toHtmlString(URL url) throws IOException {
Objects.requireNonNull(url, "The url cannot be null.");
try (InputStream is = url.openStream(); Scanner sc = new Scanner(is)) {
sc.useDelimiter("\\A");
if (sc.hasNext()) {
return sc.next();
} else {
return null; // or empty
}
}
}

Categories