Lesson: Internationalization of Network Resources In a modern Internet community, many users are no longer satisfied with using only ASCII symbols to identify a domain name or a web resource. For example, they would like to be able to register a new domain using their native characters in Arabic or Chinese or to define a new URI using Unicode characters. That is why the internationalization of network resources is a cornerstone in widening horizons for the World Wide Web.
This lesson describes the internationalization of the network resources Domain Name and Resource Identifier.
Internationalized Domain Name This section explains how to perform a mapping between Unicode domain name and its ASCII form.
Internationalized Resource Identifier This section shows how to use the mapping methods of the URI class to convert between an IRI and a URI.
Internationalized Domain Name Historically an Internet domain names contain ASCII symbols only. But lately the number of those users who want to use Unicode characters when registering their domain names increased steeply. But domain name resolving system does not allow to apply Unicode characters. Internationalizing Domain Names in Applications (IDNA) was adopted as the chosen standard and has a purpose to convert Unicode characters into standard ASCII domain names and thus preserve the stability of the domain name system. Examples of the internationalized domain names:
http:// .cn http://www.транспорт.com
As you follow one of these links you may notice that a Unicode domain name represented in the address bar will be sustituted by the ASCII string. You may get interested about how to perform such conversion in your application. According to RFC 3490, IDNA does not extend the service offered by DNS to the applications. Instead, the applications (and, by implication, the users) continue to see an exact-match lookup service. There are two main operations to accomplish the conversion between ASCII and non ASCII formats:
448
In Java™SE the ToASCII operation is used before sending an IDN to domain name resolving system or writing an IDN into a file where ASCII characters are expected (such as a DNS master file). The ToUnicode operation is used when displaying names to users, for example names obtained from a DNS zone.
A special class java.net.IDN in Java™ SE allows to perform these operations. This class has two methods per each operations. The toASCII(String input, int flag) method allows to convert Unicode characters to ASCII. flag parameter defines the behavior of the conversion process. The ALLOW_UNASSIGNED flag
indicates the using of code points that are unassigned in Unicode 3.2 and the USE_STD3_ASCII_RULES flag enables the check against STD-3 ASCII rules. You can use these flags separately or logically OR'ed together. If the flag equals zero, you can specify its value in the two-argument method or just invoke a counterpart method: toASCII(input);
If the an input argiment doesn't conform to RFC 3490, this method will throw IllegalArgumentException. String ace_name = IDN.toASCII("http://
.cn/");
The toUnicode method Translates a string from ASCII Compatible Encoding (ACE) to Unicode code points. This method never fails, in case of any error the input string remains the same and will be returned unmodified.
Security concern A potential security risk appeared because IDN allows websites to use Unicode names. It can make easier to create a web site that can has a domain name, security certificates or even an outward appearance exactly like your own site. But in fact, it can be used for phishing purpose in order to collect private information about your site visitors. These sites are called a spoofed web sites. For example, somebody can register a site with identical domain name as you have, by substituting a small Latin "a" or "o" with a resembling Cyrillic "a" or "o". In this case, new domain points users to another site and potentially opens users up to homograph attacks. This is a well-known issue from the very beginning of introducing of the IDN conception. You can avoid it by turning off the IDN support entirely. You should type "about:config" into the address bar of the browser, find the "network.enableIDN" setting, and change its value to "false".
449
Also, both Mozilla and Opera have now announced the using of per-domain whitelists for selectively switching on IDN for those domains which are taking appropriate anti-spoofing precautions. You can try to adjust the "network.IDN.whitelist." settings to enable/disable a whitelist for a partucular language.
Internationalized Resource Identifier Internationalized Resource Identifier (IRI) like IDN may contain Unicode characters, while Uniform Resource Identifier (URI) is limited to ASCII symbols only. According to RFC 3987 IRIs are meant to replace URIs in identifying resources for protocols, formats, and software components that use a UCS-based character repertoire. At first sight, you may consider that this task must been decided with the same means as for IDN. But there is not so exactly. Let's view a resource identifier structure:
You may notice that it has several components. The authority component of a URI parses according to the following syntax [user-info@]host[:port]
where the characters @ and : stand for themselves. The host component can be an IP-literal, an IPv4address, or just a name. In a case, where a host is a domain name the IDN approach, i.e. the mapping, could be applied.
450
But generally the URI structure is more complicated. Applications can use URI-reference syntax to make reference to a URI, instead of always using above generic syntax rule. A URI-reference is either a URI or a relative reference. If a URI-reference doesn't specifies a scheme, it is said to be a relative reference. Usually, a relative reference expresses a URI reference relative to the name space of another URI. Nevertheless, the instances the java.net.URI class can represent IRIs whenever they contain non ASCII characters. This class was enhanced by the following methods to perform the operations and conversions according to RFC 3987:
toASCIIString() - converts an IRI to a URI and returns its content as a US-ASCII string. toString() - returns the content of this URI as a string in its original Unicode form. toIRIString()Converts this URI to an IRI and returns its content as a string.
As regards the following code: URI uri = new URI("http:// .cn/"); HttpURLConnection conn = (HttpURLConnection) uri.toURL().openConnection(); conn.getResponseCode();
Unfortunately, we can not perform this now, it is planned for the next release of Java™ SE.
451