How to fetch source string from a utf-8 represented string

How to fetch source string from a utf-8 represented string - java

I have a page get from the Internet, and the content is utf-8 encoded as a String which may be like:
{"has_more": true, "items": [{"body": "\u6ca1\u6709\u4f20\u8bf4\u4e2d\u7684\u90a3\u4e48\u597d",...}
I tried to use URLDecoder.decode(), but it doesn't work, it output exactly what the input is. Any suggestions? This is String object that utf-8 encoded explicit, it's not a inputStream or sth. I have done some searching effort, finding little relevant.

The source code notation is u-encoded (\uXXXX) but the String itself is an undistinguishable normal string (Java/JavaScript), like \n or \t.
The JDK has a conversion tool though:
native2ascii -encoding UTF-8 -reverse mypage.json plain-utf8.json

That's JSON encoding, which handles certain specific characters in a specific way. It is not the URL Encoding, hence that not working.
Why don't you try using a JSON library? json simple or GSON are good ones to start off.
As a curiosity: here's where the encoding you're seeing is described: RFC4627

You can use Gson to convert them to a Map.

Check out the libs in Java - JSON in Java

Yes its JSON (JavaScript Object Notation) is a lightweight data-interchange format.
Go through http://www.json.org/java/

Related

How to convert Protobuf ByteString to an octal-escaped String in java?

Could anyone please let me know how to convert protobuf's ByteString to an octal escape sequence String in java?
In my case, I am getting the ByteString value as \376\024\367 so, when I print the string value in console using System.out.println(), I should get "\376\024\367".
Many thanks.

Normally, you'd convert a ByteString to a String using ByteString#toString(Charset). This method lets you specify what charset the text is encoded in. If it's UTF-8, you can also use the method toStringUtf8() as a shortcut.
From your question, though, it sounds like you actually want to produce the escaped format using C-style three-digit octal escapes. AFAIK there's no public function to do this, but you can see the code here. You could copy that code into your own project and use it.

I have used http://doc.akka.io/japi/akka/2.3.7/akka/util/ByteString.ByteStrings.html
You will see to method decodeString(java.lang.String charset)
else see to https://github.com/akka/akka/issues/18738

URL Decode Difference between C# and Java

I got a url encode string %B9q
while I use C# code:
string res = HttpUtility.UrlDecode("%B9q", Encoding.GetEncoding("Big5"));
It outputs as 電，which is the correct answer that I want
But when I use Java decode function:
String res = URLDecoder.decode("%B9q", "Big5");
Then I got the output ?q
Does anyone knows how it happens and how should I solve it?
Thanks for any suggestions and helps!

As far as I can tell from the relevant spec, it looks like Java's way of handling things is correct.
Especially the example presented when discussing URI to IRI conversion seems meaningful:
Conversions from URIs to IRIs MUST NOT use any character encoding
other than UTF-8 in steps 3 and 4, even if it might be possible to
guess from the context that another character encoding than UTF-8 was
used in the URI. For example, the URI
"http://www.example.org/r%E9sum%E9.html" might with some guessing be
interpreted to contain two e-acute characters encoded as iso-8859-1.
It must not be converted to an IRI containing these e-acute
characters. Otherwise, in the future the IRI will be mapped to
"http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
URI from "http://www.example.org/r%E9sum%E9.html".

Maybe Java's URLDecoder ignore some rules on big5 encoding standard. C# do same things as browsers like Chrome, but Java's URLDecoder doesn't. See the relevant question: https://stackoverflow.com/a/27635806/1321255

How to add a URL String in a JSON object

I need to add a URL typically in the format http:\somewebsite.com\somepage.asp.
When I create a string with the above URL and add it to JSON object json
using
json.put("url",urlstring);
it's appending an extra "\" and when I check the output it's like http:\\\\somewebsite.com\\somepage.asp
When I give the URL as http://somewebsite.com/somepage.asp
the json output is http:\/\/somewebsite.com\/somepage.asp
Can you help me to retrieve the URL as it is, please?
Thanks

Your JSON library automatically escapes characters like slashes. On the receiving end, you'll have to remove those backslashes by using a function like replace().
Here's an example:
string receivedUrlString = "http:\/\/somewebsite.com\/somepage.asp";<br />
string cleanedUrlString = receivedUrlString.replace('\', '');
cleanedUrlString should be "http://somewebsite.com/somepage.asp".
Hope this helps.
Reference: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replace(char,%20char)

Tichodroma's answer has nailed it. You can solve the "problem" by storing valid URLs.
In addition, the JSON format requires that backslashes in strings are escaped with a second backslash. If the 2nd backslash is left out, the result is invalid JSON. Refer to the JSON syntax diagrams at http://www.json.org
The fact that the double backslashes are giving you problems actually means that the software that is reading the files is broken. A properly written JSON parser will automatically de-escape the strings. The site I linked to above lists many JSON parser libraries written in many languages. You should use one of these rather than trying to write the JSON parsing code yourself.

JSON decode issue

I'm trying to decode JSON output of a Java program (jackson) and having some issues.
The cause of the problem is the following snippet:
{
"description": "... lives\uMOVE™ OFFERS ",
}
Which causes ValueError: Invalid \uXXXX escape.
Any ideas on how to fix this?
EDIT: The output is from an Avro file, the Avro package uses jackson to emit records as JSON.
EDIT2: After poking about in the source files, it might be the case that the JSON is constructed manually (sorry jackson).

What's the original string supposed to look like? \uXXXX is a unicode escape sequence, so it's interpreting \uMOVE as a single character, but it's not a valid unicode value. JSON is always assumed to be unicode, so you'll likely need to fix the string in the originating app

Try quoting the \u like this:
{
"description": "... lives\\uMOVE™ OFFERS ",
}

Basically the input isn't valid json.
The spec on http://www.json.org/ defines how strings should be be encoded. You will have to fix the JSON output from the other application.

This is a known bug in Avro versions < 1.6.0. See AVRO-851 for more details.

Jackson does not currently have a configuration feature to allow accepting such input. (Was it generated with Jackson?)
You could modify the stream parser to handle it. Follow the stack trace to the method(s) that would need changing.
You could submit a change request at http://jira.codehaus.org/browse/JACKSON for Jackson to be enhanced to provide such a feature, though I'm not sure how popular the request would be, and whether it would ever be implemented.

How do you unescape URLs in Java?

When I read the xml through a URL's InputStream, and then cut out everything except the url, I get "http://cliveg.bu.edu/people/sganguly/player/%20Rang%20De%20Basanti%20-%20Tu%20Bin%20Bataye.mp3".
As you can see, there are a lot of "%20"s.
I want the url to be unescaped.
Is there any way to do this in Java, without using a third-party library?

This is not unescaped XML, this is URL encoded text. Looks to me like you want to use the following on the URL strings.
URLDecoder.decode(url);
This will give you the correct text. The result of decoding the like you provided is this.
http://cliveg.bu.edu/people/sganguly/player/ Rang De Basanti - Tu Bin Bataye.mp3
The %20 is an escaped space character. To get the above I used the URLDecoder object.

Starting from Java 11 use
URLDecoder.decode(url, StandardCharsets.UTF_8).
for Java 7/8/9 use URLDecoder.decode(url, "UTF-8").
URLDecoder.decode(String s) has been deprecated since Java 5
Regarding the chosen encoding:
Note: The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites.

I'm having problems using this method when I have special characters like á, é, í, etc. My (probably wild) guess is widechars are not being encoded properly... well, at least I was expecting to see sequences like %uC2BF instead of %C2%BF.
Edited: My bad, this post explains the difference between URL encoding and JavaScript's escape sequences: URI encoding in UNICODE for apache httpclient 4

In my case the URL contained escaped html entities, so StringEscapeUtils.unescapeHtml4() from apache-commons did the trick

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to fetch source string from a utf-8 represented string - java

The source code notation is u-encoded (\uXXXX) but the String itself is an undistinguishable normal string (Java/JavaScript), like \n or \t. The JDK has a conversion tool though: native2ascii -encoding UTF-8 -reverse mypage.json plain-utf8.json

That's JSON encoding, which handles certain specific characters in a specific way. It is not the URL Encoding, hence that not working. Why don't you try using a JSON library? json simple or GSON are good ones to start off. As a curiosity: here's where the encoding you're seeing is described: RFC4627

You can use Gson to convert them to a Map.

Check out the libs in Java - JSON in Java

Yes its JSON (JavaScript Object Notation) is a lightweight data-interchange format. Go through http://www.json.org/java/

Related

How to convert Protobuf ByteString to an octal-escaped String in java?

URL Decode Difference between C# and Java

How to add a URL String in a JSON object

JSON decode issue

How do you unescape URLs in Java?

Categories

Resources