I am parsing data from this XML url The text input varies, depending on the user. Whenever there are spaces in the text variable I get this exception:
org.apache.harmony.xml.ExpatParser$ParseException: At line 11, column 2: mismatched tag
org.apache.harmony.xml.ExpatParser.parseFragment(ExpatParser.java:520)
org.apache.harmony.xml.ExpatParser.parseDocument(ExpatParser.java:479)
org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:318)
org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:275)
gps.app.tkartor.XMLObjects.XMLParser.findStreet(XMLParser.java:99)
If there are no spaces it works fine. Parsing code:
public void findStreet(String searchWord) {
try {
url = new URL(
"http://maps.travelsouthyorkshire.com/iGNMSearchService.asmx/TextSearch?text="
+ searchWord + "&maxResults=100");
System.out.println(url.toString());
parserFactory = SAXParserFactory.newInstance();
parser = parserFactory.newSAXParser();
reader = parser.getXMLReader();
streetHandler = new StreetHandler();
reader.setContentHandler(streetHandler);
reader.parse(new InputSource(url.openStream()));//line 99
poi = streetHandler.getAllStreets();
} catch (Exception e) {
e.printStackTrace();
}
}
Indicate spaces with '+'
http://maps.travelsouthyorkshire.com/iGNMSearchService.asmx/TextSearch?text=West+bar&maxResults=100
So in your case you have to replace the spaces of your String "searchWord". Something like
searchWord = searchWord.replace(" ", "+");
When I view both of those URLs using my web browser, I get valid XML. Certainly, there is no error at line 11 as indicated by your stacktrace.
My conclusion is that fetching the URLs (particularly the one with the space in it) is giving a different result when you do it programmatically versus doing it in a web browser. I expect that is because the browser is "helpfully" fixing the URL to encode the space character before sending it. (And I suspect that you pasted that fixed URL into the question ...)
To confirm this diagnosis, you need to capture and view the actual file contents that your Android app gets from the server. My guess is that it is actually an HTML error page. That typically won't be valid XML, and hence the XML parse error.
If this turns out to be the problem, then you need to correctly encode the search string before embedding it in the URL. If you were using plain Java, then I'd suggest using URLEncoder.encode, or assembling the URL from its components using the URI class. There may be a better way on the Android platform ...
Related
Using Jsoup to scrape URLS and one of the URLS I keep getting has this  symbol in it. I have tried decoding the URL:
url = URLDecoder.decode(url, "UTF-8" );
but it still remains in the code looking like this:
I cant find much online about this other than it is "The object replacement character, sometimes used to represent an embedded object in a document when it is converted to plain text."
But if this is the case I should be able to print the symbol if it is plain text but when I run
System.out.println("");
I get the following complication error:
and it reverts back to the last save.
Sample URL: https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/
NOTE: If you decode the url then compare it to the decoded url it comes back as not the same e.g.:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
if(url.contains("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles?/")){
System.out.println("The same");
}else {
System.out.println("Not the same");
}
That's not a compilation error. That's the eclipse code editor telling you it can't save the source code to a file, because you have told it to save the file in a cp1252 encoding, but that encoding can't express a .
Put differently, your development environment is currently configured to store source code in the cp1252 encoding, which doesn't support the character you want, so you either configure your development environment to store source code using a more flexible encoding (such as UTF-8 the error message suggests), or avoid having that character in your source code, for instance by using its unicode escape sequence instead:
System.out.println("\ufffc");
Note that as far as the Java language and runtime are concerned,  is a character like any other, so there may not be a particular need to "handle" it. Also, I am unsure why you'd expect URLDecoder to do anything if the URL hasn't been URL-encoded to begin with.
"ef bf bc" is a 3 bytes UTF-8 character so as the error says, there's no representation for that character in "CP1252" Windows page encoding.
An option could be to replace that percent encoding sequence with an ascii representation to make the filename for saving:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/".replace("%ef%bf%bc", "-xEFxBFxBC"), "UTF-8");
url ==> "https://www.breightgroup.com/job/hse-advisor-emb ... contract-roles-xEFxBFxBC/"
Another option using CharsetDecoder
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
CharsetDecoder decoder = Charset.forName("CP1252").newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE);
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
ByteBuffer buffer = ByteBuffer.wrap(urlDec.getBytes(Charset.forName("UTF-8")));
decoder.decode(buffer).toString();
Result
"https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles/"
I found the issue resolved by just replacing URLs with this symbol because there are other URLs with Unicode symbols that were invisible that couldnt be converted ect..
So I just compared the urls to the following regex if it returns false then I just bypass it. Hope this helps someone out:
boolean newURL = url.matches("^[a-zA-Z0-9_:;/.&|%!+=#?-]*$");
I am trying to fetch the below xml from db using a java method but I am getting an error
Code used to parse the xml
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new ByteArrayInputStream(cond.getBytes()));
Document doc = db.parse(is);
Element elem = doc.getDocumentElement();
// here we expect a series of <data><name>N</name><value>V</value></data>
NodeList nodes = elem.getElementsByTagName("data");
TableID jobId = new TableID(_processInstanceId);
Job myJob = Job.queryByID(_clientContext, jobId, true);
if (nodes.getLength() == 0) {
log(Level.DEBUG, "No data found on condition XML");
}
for (int i = 0; i < nodes.getLength(); i++) {
// loop through the <data> in the XML
Element dataTags = (Element) nodes.item(i);
String name = getChildTagValue(dataTags, "name");
String value = getChildTagValue(dataTags, "value");
log(Level.INFO, "UserData/Value=" + name + "/" + value);
myJob.setBulkUserData(name, value);
}
myJob.save();
The Data
<ContactDetails>307896043</ContactDetails>
<ContactName>307896043</ContactName>
<Preferred_Completion_Date>
</Preferred_Completion_Date>
<service_address>A-End Address: 1ST HELIERST HELIERJT2 3XP832THE CABLES 1 POONHA LANEST HELIER JE JT2 3XP</service_address>
<ServiceOrderId>315473043</ServiceOrderId>
<ServiceOrderTypeId>50</ServiceOrderTypeId>
<CustDesiredDate>2013-03-20T18:12:04</CustDesiredDate>
<OrderId>307896043</OrderId>
<CreateWho>csmuser</CreateWho>
<AccountInternalId>20100333</AccountInternalId>
<ServiceInternalId>20766093</ServiceInternalId>
<ServiceInternalIdResets>0</ServiceInternalIdResets>
<Primary_Offer_Name action='del'>MyMobile Blue £44.99 [12 month term]</Primary_Offer_Name>
<Disc_Reason action='del'>8</Disc_Reason>
<Sup_Offer action='del'>80000257</Sup_Offer>
<Service_Type action='del'>A-01-00</Service_Type>
<Priority action='del'>4</Priority>
<Account_Number action='del'>0</Account_Number>
<Offer action='del'>80000257</Offer>
<msisdn action='del'>447797142520</msisdn>
<imsi action='del'>234503184</imsi>
<sim action='del'>5535</sim>
<ocb9_ARM action='del'>false</ocb9_ARM>
<port_in_required action='del'>
</port_in_required>
<ocb9_mob action='del'>none</ocb9_mob>
<ocb9_mob_BB action='del'>
</ocb9_mob_BB>
<ocb9_LandLine action='del'>
</ocb9_LandLine>
<ocb9_LandLine_BB action='del'>
</ocb9_LandLine_BB>
<Contact_2>
</Contact_2>
<Acc_middle_name>
</Acc_middle_name>
<MarketCode>7</MarketCode>
<Acc_last_name>Port_OUT</Acc_last_name>
<Contact_1>
</Contact_1>
<Acc_first_name>.</Acc_first_name>
<EmaiId>
</EmaiId>
The ERROR
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I read in some threads it's because of some special characters in the xml.
How to fix this issue ?
How to fix this issue ?
Read the data using the correct character encoding. The error message means that you are trying to read the data as UTF-8 (either deliberately or because that is the default encoding for an XML file that does not specify <?xml version="1.0" encoding="somethingelse"?>) but it is actually in a different encoding such as ISO-8859-1 or Windows-1252.
To be able to advise on how you should do this I'd have to see the code you're currently using to read the XML.
Open the xml in notepad
Make sure you dont have extra space at the beginning and end of the document.
Select File -> Save As
select save as type -> All files
Enter file name as abcd.xml
select Encoding - UTF-8 -> Click Save
Try:
InputStream inputStream= // Your InputStream from your database.
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handler);
If it's anything else than UTF-8, just change the encoding part for the good one.
I was getting the xml as a String and using xml.getBytes() and getting this error. Changing to xml.getBytes(Charset.forName("UTF-8")) worked for me.
I had the same problem in my JSF application which was having a comment line containing some special characters in the XMHTL page. When I compared the previous version in my eclipse it had a comment,
//Some � special characters found
Removed those characters and the page loaded fine. Mostly it is related to XML files, so please compare it with the working version.
I had this problem, but the file was in UTF-8, it was just that somehow on character had come in that was not encoded in UTF-8. To solve the problem I did what is stated in this thread, i.e. I validated the file:
How to check whether a file is valid UTF-8?
Basically you run the command:
$ iconv -f UTF-8 your_file -o /dev/null
And if there is something that is not encoded in UTF-8 it will give you the line and row numbers so that you can find it.
I happened to run into this problem because of an Ant build.
That Ant build took files and applied filterchain expandproperties to it. During this file filtering, my Windows machine's implicit default non-UTF-8 character encoding was used to generate the filtered files - therefore characters outside of its character set could not be mapped correctly.
One solution was to provide Ant with an explicit environment variable for UTF-8.
In Cygwin, before launching Ant: export ANT_OPTS="-Dfile.encoding=UTF-8".
This error comes when you are trying to load jasper report file with the extension .jasper
For Example
c://reports//EmployeeReport.jasper"
While you should load jasper report file with the extension .jrxml
For Example
c://reports//EmployeeReport.jrxml"
[See Problem Screenshot ][1] [1]: https://i.stack.imgur.com/D5SzR.png
[See Solution Screenshot][2] [2]: https://i.stack.imgur.com/VeQb9.png
I had a similar problem.
I had saved some xml in a file and when reading it into a DOM document, it failed due to special character. Then I used the following code to fix it:
String enco = new String(Files.readAllBytes(Paths.get(listPayloadPath+"/Payload.xml")), StandardCharsets.UTF_8);
Document doc = builder.parse(new ByteArrayInputStream(enco.getBytes(StandardCharsets.UTF_8)));
Let me know if it works for you.
I have met the same problem and after long investigation of my XML file I found the problem: there was few unescaped characters like « ».
Those like me who understand character encoding principles, also read Joel's article which is funny as it contains wrong characters anyway and still can't figure out what the heck (spoiler alert, I'm Mac user) then your solution can be as simple as removing your local repo and clone it again.
My code base did not change since the last time it was running OK so it made no sense to have UTF errors given the fact that our build system never complained about it....till I remembered that I accidentally unplugged my computer few days ago with IntelliJ Idea and the whole thing running (Java/Tomcat/Hibernate)
My Mac did a brilliant job as pretending nothing happened and I carried on business as usual but the underlying file system was left corrupted somehow. Wasted the whole day trying to figure this one out. I hope it helps somebody.
I had the same issue. My problem was it was missing “-Dfile.encoding=UTF8” argument under the JAVA_OPTION in statWeblogic.cmd file in WebLogic server.
You have a library that needs to be erased
Like the following library
implementation 'org.apache.maven.plugins:maven-surefire-plugin:2.4.3'
This error surprised me in production...
The error is because the char encoding is wrong, so the best solution is implement a way to auto detect the input charset.
This is one way to do it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
someReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...
I'm building a small CMS-like application in Java, that takes a .txt file with shirt names/descriptions and loads the names/descriptions into an ArrayList of customShirts (small class I made). Then, it iterates through the ArrayList, and uses JSoup to parse a template (template.html) and insert the unique details of the shirt into the HTML. Finally, it pumps out each shirt into its own HTML file in an output folder.
When the descriptions are loaded into the ArrayList of customShirts, I replace all special characters with the appropriate character codes so they can be properly displayed (for example, replacing apostrophes with ’). The issue is, I've noticed that JSoup seems to automatically turn the character codes into the actual character, which is an issue, since I need the output to be displayable (which requires character codes). Is there anything I can do to fix this? I've looked at other workarounds, like at: Jsoup unescapes special characters, but they seem to require parsing the file before insertion with replaceAll, and I insert the character-code sensitive text with JSoup, which doesn't seem to make this an option.
Below is the code for the HTML generator I made:
public void generateShirtHTML(){
for(int i = 0; i < arrShirts.size(); i++){
File input = new File("html/template/template.html");
Document doc = null;
try {
doc = Jsoup.parse(input, "UTF-8", "");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Element title = doc.select("title").first();
title.append(arrShirts.get(i).nameToCapitalized());
Element headingTitle = doc.select("h1#headingTitle").first();
headingTitle.html(arrShirts.get(i).nameToCapitalized());
Element shirtDisplay = doc.select("p#alt1").first();
shirtDisplay.html(arrShirts.get(i).name);
Element descriptionBox = doc.select("div#descriptionbox p").first();
descriptionBox.html(arrShirts.get(i).desc);
System.out.println(arrShirts.get(i).desc);
PrintWriter output;
try {
output = new PrintWriter("html/output/" + arrShirts.get(i).URL);
output.println(doc.outerHtml());
//System.out.println(doc.outerHtml());
output.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Shirt " + i + " HTML generated!");
}
}
Thanks in advance!
Expanding a little on my comment (since Stephan encouraged me..), you can use
doc.outputSettings().escapeMode(Entities.EscapeMode.extended);
To tell Jsoup to escape / encode special characters in the output, eg. left double quotes (“) as “. To make Jsoup encode all special characters, you may also need to add
doc.outputSettings().charset("ASCII");
In order to ensure that all Unicode special characters will be HTML encoded.
For larger projects where you have to fill in data into HTML files, you can look at using a template engine such as Thymeleaf - it's easier to use for this kind of job (less code and such), and it offers many more features specifically for this process. For small projects (like yours), Jsoup is good (I've used it like this in the past), but for bigger (or even small) projects, you'll want to look into some more specialized tools.
I need to replace the spaces inside a string with the % symbol but I'm having some issues, what I tried is:
imageUrl = imageUrl.replace(' ', "%20");
But It gives me an error in the replace function.
Then:
imageUrl = imageUrl.replace(' ', "%%20");
But It still gives me an error in the replace function.
The I tried with the unicode symbol:
imageUrl = imageUrl.replace(' ', (char) U+0025 + "20");
But it still gives error.
Is there an easy way to do it?
String.replace(String, String) is the method you want.
replace
imageUrl.replace(' ', "%");
with
imageUrl.replace(" ", "%");
System.out.println("This is working".replace(" ", "%"));
I suggest you to use a URL Encoder for Encoding Strings in java.
String searchQuery = "list of banks in the world";
String url = "http://mypage.com/pages?q=" + URLEncoder.encode(searchQuery, "UTF-8");
I've ran into issues like this in the past with certain frameworks. I don't have enough of your code to know for sure, but what might be happening is whatever http framework you are using, in my case it was spring, is encoding the URL again. I spent a few days trying to solve a similar problem where I thought that string replace and the URI.builder() was broken. What ended up being the problem was that my http framework had taken my encoded url, and encoded it again. that means that any place it saw a "%20", it would see the '%' charictor and switch it out for '%' http code, "%25", resulting in. "%2520". The request would then fail because %2520 didn't translate into the space my server was expecting. While the issue apeared to be one of my encoding not working, it was really an issue of encoding too many times. I have an example from some working code in one of my projects below
//the Url of the server
String fullUrl = "http://myapiserver.com/path/";
//The parameter to append. contains a space that will need to be encoded
String param 1 = "parameter 1"
//Use Uri.Builder to append parameter
Uri.Builder uriBuilder = Uri.parse(fullUrl).buildUpon();
uriBuilder.appendQueryParameter("parameter1",param1);
/* Below is where it is important to understand how your
http framework handles unencoded url. In my case, which is Spring
framework, the urls are encoded when performing requests.
The result is that a url that is already encoded will be
encoded twice. For instance, if you're url is
"http://myapiserver.com/path?parameter1=param 1"
and it needs to be read by the server as
"http://myapiserver.com/path?parameter1=param%201"
it makes sense to encode the url using URI.builder().append, or any valid
solutions listed in other posts. However, If the framework is already
encoding your url, then it is likely to run into the issue where you
accidently encode the url twice: Once when you are preparing the URL to be
sent, and once again when you are sending the message through the framework.
this results in sending a url that looks like
"http://myapiserver.com/path?parameter1=param%25201"
where the '%' in "%20" was replaced with "%25", http's representation of '%'
when what you wanted was
"http://myapiserver.com/path?parameter1=param%201"
this can be a difficult bug to squash because you can copy the url in the
debugger prior to it being sent and paste it into a tool like fiddler and
have the fiddler request work but the program request fail.
since my http framework was already encoding the urls, I had to unencode the
urls after appending the parameters so they would only be encoded once.
I'm not saying it's the most gracefull solution, but the code works.
*/
String finalUrl = uriBuilder.build().toString().replace("%2F","/")
.replace("%3A", ":").replace("%20", " ");
//Call the server and ask for the menu. the Menu is saved to a string
//rest.GET() uses spring framework. The url is encoded again as
part of the framework.
menuStringFromIoms = rest.GET(finalUrl);
There is likely a more graceful way to keep a url from encoding twice. I hope this example helps point you on the right direction or eliminate a possability. Good luck.
Try this:
imageUrl = imageUrl.replaceAll(" ", "%20");
Replace spaces is not enought, try this
url = java.net.URLEncoder.encode(url, "UTF-8");
I'm making an android application in which I'm parsing an XML using SAX parser.
In the XML there is tag:
<title>Deals & Dealmakers: Technology, media and communications M&A </title>
As you can see it contains some special charters like &
The issue is I'm using SAX's implicit method:
#Override
public void characters(char[] ch, int start, int length) throws SAXException{}
Here, the parameter 'char[] ch' is supposed to fetch the entire line Deals & Dealmakers: Technology, media and communications M&A
But it is only getting "Deals ".
How can I solve this issue?
One issue might be because of the way I'm passing the XML to the SAX parser. Do I need to change the encoding or format?
Currently, I'm passing the XML as InputStream & using the below code:
HttpResponse httpResponse = utils.sendRequestAndGetHTTPResponse(URL);
if (httpResponse.getStatusLine().getStatusCode() == 200) {
HttpEntity entity = httpResponse.getEntity();
InputStream in = entity.getContent();
parseResponse(in);
}
// Inside parseResponse method:
try {
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
XMLReader xmlReader = sp.getXMLReader();
MyHandler handler = new MyHandler();
xmlReader.setContentHandler(handler);
xmlReader.parse(new InputSource(in));
} catch (Exception e) {
}
Here, the parameter 'char[] ch' is supposed to fetch the entire line Deals & Dealmakers: Technology, media and communications M&A But it is only getting "Deals ".
You seem to be assuming that you'll get the whole text in one call. There's no guarantee of that. I strongly suspect that your characters method will be called multiple times for the same text node, which is valid for the parser to do. You need to make sure your code handles that.
From the documentation:
SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.
There may be a feature you can set to ensure you get all the data in one go; I'm not sure.
I guess UTF-8 is exactly the problem . In the file,you parsing the encoding is defined as ISO-8859-1
so just try following code:
InputSource is = new InputSource(yourInputStream);
is.setEncoding("ISO-8859-1");
xmlReader.parse(is);
hope this helps.