|
German umlauts and UTF-8
By Antje Binas-Holz
Last Update:
Thursday, December 29, 2005
Netcat is one of the best TCP tools at all. It enables you to implement TCP server and clients without writing a single line of code. The tool is freeware and can be downloaded from vulnwatch.org.
We have been asked recently in our Forum about German Umlauts based on UTF-8 . It turned out to be a problem being worth a more deeper discussion. The problem itself came up while working with CityDesk used as Content Manager. Everything looked ok, CityDesk templates have been configured correctly, however all characters above code 127 have not been displayed correctly in the browser. What went wrong?
CityDesk is storing sites in Unicode, actually in UTF-8. Therefore it is necessary to let the browser know about it by HTML header meta tag charset=UTF-8 (as mentioned by CityDesk documentation) . The best place to do that is the appropriate template, where all HTML header tags should be defined, for example:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" >
Unfortunately this did not cure the problem. The browser seemed to ignore this tag and after some research we found out, that following response header sent by the Web server have been the trouble maker:
Content-Type: text/html; charset=ISO-8859-1
In other words, the character set defined by HTML code header has been invalidated by appropriate response header entry with different value. In this case the Web server had to be reconfigured. It is actually not a good idea to define the character set by the server. Otherwise it will not be possible to host sites with different character sets on the same server (Arabian, Russian and English based sites, for example). The problem itself could be fixed very easily by modifying setup file http.conf assuming the Web server software is Apache. Somewhere in the mentioned file must be a line like
AddType "text/html; charset=ISO-8859-1" html
which should be replaced by
AddType "text/html" html
or
AddType "text/html; charset=UTF-8" html
In general response header are not visible in the browser, but can be easily displayed by telnetting the site or using a small but very powerful TCP tool Netcat.
Netcat is a Unix based utility which has been made available to Windows (see sidebar info). Netcat is sometimes called as "TCP/IP Swiss army knife", which is a pretty good description over all. It can be used to create connections to TCP servers and moreover, it can implement TCP servers itself. According to our problem it can be used to receive a website including the HTTP response header sent in front of the content by the server. Assuming we are working on a Windows box the following script (httptest.bat) is able to get response header and content of the current web site as one text file:
del request.txt
echo GET /ger/sections/tips/utf8.html HTTP/1.0> request.txt
echo Accept: */*>> request.txt
echo User-Agent: Mozilla/4.0>> request.txt
echo Host: www.sqldbu.com>> request.txt
echo.>>request.txt
type request.txt | netcat -w 3 -i 1 www.sqldbu.com 80 > response.txt
The script can be easily changed in order to receive another site which for example does not display German umlauts correctly. Assuming we are interested in /de/index.html located at www.badserver.de the script can be adapted as follows:
del request.txt
echo GET /de/index.html HTTP/1.0> request.txt
echo Accept: */*>> request.txt
echo User-Agent: Mozilla/4.0>> request.txt
echo Host: www.badserver.de>> request.txt
echo.>>request.txt
type request.txt | netcat -w 3 -i 1 www.sqldbu.com 80 > response.txt
Run the script and take a look into file Response.txt afterwards. In case of problems described above you will most probably see something like this:
HTTP/1.1 200 OK
Date: Wed, 14 Dec 2005 13:05:40 GMT
Server: Apache/1.3.28 (Unix) FrontPage/5.0.2.2634 PHP/4.3.3 AuthMySQL/2.20 mod_ssl/2.8.15 OpenSSL/0.9.7a
Last-Modified: Wed, 14 Dec 2005 12:59:50 GMT
ETag: "394263-3850-43a01746"
Accept-Ranges: bytes
Content-Length: 14416
Connection: close
Content-Type: text/html; charset=ISO-8859-1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
...
</html>
If you will come to the same result (content type set by server) talk to your provider or web server administrator and ask him to changes the response header Content-Type to be returned like follows:
Content-Type: text/html
In order to make sure in advance that this really solves the problem you easily test it by yourself. Just use Netcat in order to start a server process on your computer which sends response.txt received by httptest.bat to the client (your web browser). Following script (Server.bat) implements such a web server using Netcat :
:start
type response.txt | netcat -l -p 8080
goto start
Start the server and use your browser to test it by typing into it's URL:
http://localhost:8080/
I did use port 8080 on purpose if you are using a local web server on port 80 by default. If not, you can start the Netcat based server on port 80 and typ into your browser URL only following address:
http://localhost
You might not see any difference to your real world web server's response, because Netcat is sending identical data. Go ahead and modify response.txt in order to delete or change the charset entry from line
Content-Type: text/html; charset=ISO-8859-1
and repeat the test. With described test you can locally analyze different results of Content-Type manipulation within response header and HTML code because you run both sides of TCP communication on your computer.
Note: File utf8.zip contains scripts described here and moreover some additional Netcat based HTTP server implementing scripts which are sending one and the same content (body.txt) prefixed by different charset response headers to the client in order to demonstrate the discussed problem.
|