Win32 Visual Studio 2008 设计/图形新手 HTML 中级开发 Windows C#

玩转 Unicode

杨国华

4.99/5 (28投票s)

2014年5月19日

CPOL

11分钟阅读

51453

2873

直接向文本框输入 Unicode，包括对代理对的支持。创建简单的网页来显示奇特的字体

引言

本文介绍了一种技术，允许您直接在文本框中输入 Unicode，而无需使用专用的输入法编辑器 (IME) 或字符映射表工具。它还讨论了代理对编码以及一个用于创建显示奇特字体的简单网页的有趣工具的实现。

背景

一些初步概念

Unicode 码点

Unicode 码点通过写“U+”后跟其十六进制数字来引用。对于基本多文种平面 (BMP) 中的码点，使用四位数字。例如，U+222B 是积分的数学符号“∫”的码点。其他多文种平面可以有包含 5 位十六进制数字的码点。例如，古埃及象形文字的码点范围是 U+F3000 - U+F4B92。

Unicode 编码

所有 Unicode 码点都可以通过以下两种标准编码格式进行编码：UTF16 和 UTF8。

UTF16 大多数是双字节编码（代理对除外）。如果字节顺序是 Big endian，U+222B 的编码是十六进制 22 2B；如果字节顺序是 Little endian，编码是十六进制 2B 22。对于编码基本多文种平面之外的 Unicode 码点，使用两组四位十六进制数字。有关如何进行编码的更多详细信息，请参阅 Microsoft 产品中的代理支持。

UTF8 是一种编码标准，它使用一个或多个字节来编码每个 Unicode 码点。

字形

这些是用于渲染显示中代表 Unicode 码点的字符的图形。请注意，对于相同的 Unicode 码点，对于阿拉伯语等语言，所使用的字形取决于相邻字符。

字体

这些是根据语言或用法通常组合在一起的字形集合。字体文件中的每个字形都标记了一个 Unicode 码点。如果您想要查看一些有趣的字体文件，可以访问此网站：古代文字的 Unicode 字体

IME（输入法编辑器）

一种特定于语言的工具，用于高效创建要输入到支持 Unicode 的文本输入接口中的 Unicode 码点。在 Windows 7 中，您可以通过“控制面板”->“区域和语言”->“键盘和语言”来安装新的 IME。

字符映射表工具

Microsoft 提供的一个通用工具，可以生成基本多文种平面的所有 Unicode 码点，您可以将其复制并粘贴到支持 Unicode 的文本输入接口中。您可以通过“开始”->“所有程序”->“附件”->“系统工具”->“字符映射表”来访问该工具。

私有字符编辑器

一个鲜为人知的工具，可用于创建和编辑私有字符区域 U+E000 - U+F877 的字符。此区域可容纳 6400 个字符。它保留供私人使用。可以通过 c:\windows\system32\eudcedit.exe 访问私有字符编辑器。

创建的字形位于文件 c:\windows\fonts\eudc.euf 和 c:\windows\fonts\eudc.tte 中。如果您尝试使用资源管理器访问这些文件，它们是隐藏的。但是，您可以使用命令提示符复制出这些文件。

要查看创建的字形，可以使用“字符映射表”并搜索字体：“所有字体 (私有字符)”。或者，您可以使用我们在此处开发的程序。

使用代码

下面的代码执行生成 Unicode 码点的主要任务。当用户在底部文本框 (textBox3) 中输入时，代码开始执行。它会检查输入的键是否为，并且前面的字符格式为“U+####”或“U+#####”，然后用它们所代表的 Unicode 码点的编码替换这些字符。

请注意，该代码适用于基本多文种平面“U+####”，以及其他所有平面“U+#####”，其中每个 Unicode 码点由 5 位十六进制数字表示。

您还需要一个用于文本框的 Unicode 字体。我使用的是 Arial Unicode MS，14.25pt 字体，随 Windows 7 提供。

        private void HandleKeyPress(object sender, KeyPressEventArgs e)
        {
            TextBox textbox = (TextBox)sender;
            //System.Diagnostics.Debug.WriteLine(textbox.Text + " " + e.KeyChar);
            string s = "";
            
            if (e.KeyChar == ' ' && textbox.SelectionStart >= 6)
            {
                textbox.SelectedText = "";
                
                //n is number of chars preceeding the cursor position 
                //that we want to analysze

                //U+#### is 6 chars and U+##### is 7 chars
                //if possible we will analyze 7 chars
                //Otherwise if cursor position is < 7, we take n as 6
                int n = (textbox.SelectionStart == 6) ? 6 : 7;

                //take n preceeding chars from the cursor position
                //this would be the text that we want to analyze
                s = textbox.Text.Substring(textbox.SelectionStart - n, n);

                //n1 is the position of the the search pattern header "U+" 
                int n1 = s.ToUpper().IndexOf("U+");

                // System.Diagnostics.Debug.WriteLine(s);

                //if we found "U+" header for the search patterns
                if (n1 >= 0)
                {
                    //get the chars after the "U+" header
                    //s1 are the following chars up till the cursor position
                    //s1 could valid unicode code point
                    string s1 = s.Substring(n1 + 2, s.Length - (n1 + 2));
                    //System.Diagnostics.Debug.WriteLine(s1);
                    
                    //we attempt to encode s1 in utf16 encoding
                    string s2 = "";
                    unicodepoint2utf16(s1, ref s2);

                    //if we have a valid utf16 encoding 
                    //we get actual character from the utf16 string representation
                    if (s2 != "")
                    {
                        uint d = Convert.ToUInt32(s2, 16);
                        uint maskb0 = Convert.ToUInt32("FF000000", 16);
                        uint maskb1 = Convert.ToUInt32("FF0000", 16);
                        uint maskb2 = Convert.ToUInt32("FF00", 16);
                        uint maskb3 = Convert.ToUInt32("FF", 16);
                        byte b0 = (byte)((d & maskb0) >> 24);
                        byte b1 = (byte)((d & maskb1) >> 16);
                        byte b2 = (byte)((d & maskb2) >> 8);
                        byte b3 = (byte)((d & maskb3));

                        byte[] bytes;

                        //b0 is the highest order byte and 
                        //b3 is the lowest order byte
                        //a code unit is 4 hex digits, ie 2 bytes
                        //b0,b1 is the high order code unit
                        //b2,b3 is the low order code unit

                        //Note that Windows uses Little Endian 
                        //for the byte ordering for char
                        //so to encode the code unit (each a char of 2 bytes)
                        //we have to put the lower order byte to the left
                        //the encoding for the code units would be as follows
                        //high order code unit: b1,b0
                        //low order code unit: b3,b2

                        if (b0 == 0 && b1 == 0)
                            //high order code unit is 0000
                            //we only need use the low order code unit
                            bytes = new byte[] { b3, b2 };
                        else
                            //high order code unit <> 0000
                            //this is a surrogate pair encoding
                            //we need 2 code units
                            //b1, b0 for high order code unit
                            //b3, b3 for low order code unit
                            bytes = new byte[] { b1, b0, b3, b2 };

                        UnicodeEncoding u = new UnicodeEncoding();
                        //generate the character from the byte array
                        //Note that if we send in a surrogate pair encoding of 4 bytes
                        //we would get a double char character
                        //a double char character is render as one glyph
                        //but has a length of 2,
                        //if s3 holds a double char character, s3.Length is 2
                        string s3 = u.GetString(bytes);
                        //System.Diagnostics.Debug.WriteLine(s3);
                        //(n-n1) is the number of chars that we want to replace
                        //it is the length of the U+#### or U+#####
                        //we select these chars to be replaced 
                        //by s3 (the Unicode character that we generated)
                        textbox.SelectionStart = textbox.SelectionStart - (n - n1);
                        textbox.SelectionLength = (n - n1);
                        textbox.SelectedText = s3;

                        //we have taken care of the <space> entered
                        //do not futher process it
                        e.Handled = true;
                    }
                }
            }
        }

unicodepoint2utf16() 函数接收一个 Unicode 码点字符串和一个字符串引用作为参数，该字符串引用将被修改以保存生成的 UTF-16 编码。生成的 UTF-16 字符串可以包含 4 位（对于 U+####）或 8 位（对于 U+#####）十六进制数字。对于 8 位十六进制输出字符串，前 4 位十六进制数字和后 4 位十六进制数字组成 UTF-16 编码的代理对。例如，U+2040A 将被编码为 D841、DC0A 对。代理对将包含 2 个代码单元的编码。代码单元的范围是

高位：U+D800 - U+DBFF

低位：U+DC00 - U+DFFF

这种编码标准允许 (DBFF-D800 +1)*(DFFF-DC00+1) = 1048576 个码点！

        //unicode point to utf-16 including surrogate pair encoding
        private void unicodepoint2utf16(string unp, ref string utf16)
        {         
            utf16 = "";

            //test for 5 hexadecimal unicode point
            //remove any leading "0" or spaces
            uint testint=0;
            string simplified_unp = "";
            try
            {
                testint = Convert.ToUInt32(unp, 16);
            }
            catch
            {   //not a hexadecimal
                return;
            }
            
            simplified_unp=testint.ToString("x");
       
            if (simplified_unp.Length == 5)
            {
                try
                {
                    uint d = Convert.ToUInt32(simplified_unp, 16);

                    uint d1 = Convert.ToUInt32("10000", 16);
                    uint d2 = d - d1;
                    uint p1 = d2 >> 10;
                    uint m1 = Convert.ToUInt32("1111111111", 2);
                    uint p2 = d2 & m1;
                    uint d800 = Convert.ToUInt32("d800", 16);
                    uint dc00 = Convert.ToUInt32("dc00", 16);
                    uint s1 = d800 + p1;
                    uint s2 = dc00 + p2;
                    utf16 = s1.ToString("x4") + s2.ToString("x4");
                }
                catch
                {
                    return;
                }
                return;
            }

            //for checking of 4 hexadecimals, we include leading 0 but not leading spaces
            if (unp.Length == 4 && unp.TrimStart(' ').Length ==4)
            {
                try
                {
                    uint d = Convert.ToUInt32(simplified_unp, 16);
                    utf16 = d.ToString("x4");
                }
                catch
                {
                    return;
                }
               
            }
        }

基本演示

当演示启动时，底部文本框的内容将是：....U+265b<按空格键获取此 Unicode 的字符>

按键，代码 U+265b 将被 U+265b 所代表的字符替换。猜猜它是什么？

您可以从组合框中选择“帮助”以获取有关使用左上角文本框的帮助。

以下是一些您可以尝试测试的 Unicode 码点样本

CJK（简化中文意为“东”）：U+4E1C。输入 U+4e1c 后跟空格

希腊语（Pi）：U+03c0。输入 U+03c0 后跟空格

符号（黑桃）：U+2664。输入 U+2664 后跟空格

一个 5 位十六进制数字的 Unicode：U+2040b。输入 U+2040b 后跟空格

如果您已从古代文字的 Unicode 字体下载了 Aegyptus 字体，则可以通过将字体文件复制到 Windows 字体目录 c:\\windows\fonts 来安装它。然后将文本框的字体更改为 Aegyptus。双击任何文本框将弹出“字体”对话框，以选择要分配给文本框的字体。

您可能想尝试下面图片中显示的 Unicode 码点。例如，要获取猫头鹰的字符（第一项之后的顶行第 9 项），Unicode 码点将是 U+10980 + 9（9 的十六进制）= U+10989。因此，对于双波浪（第一项之后的第 10 项），它将是 U+10980 + A（10 的十六进制）= U+1098A。您应该能够相当容易地找出以下其余图形的 Unicode 码点。

输入 U+#####<空格>，例如，输入 U+10989 后跟空格将把猫头鹰输入到文本框中。

使用其他两个文本框

使用键盘或 IME 在左上角的文本框中输入。您也可以从“字符映射表工具”中获取字符并粘贴到此文本框中。要查找任何字符的 Unicode 码点，请单击该字符右侧以设置光标，将弹出工具提示显示 Unicode 码点。有关更多功能，请从组合框中选择“帮助”。

单击此文本框旁边的 -> 按钮，将 UTF-16 编码全部显示到右侧文本框。

同样，您可以将空格分隔的四位十六进制 UTF-16 编码集输入到右侧文本框中，然后单击 <-- 按钮在左侧文本框中查看字符。

以下是您可以从组合框中选择的 Unicode 组

Meroitic U+10980 - U+109ff Aegyptus,36,BOLD

Hieroglyphs U+f3000 - U+f4b92 Aegyptus,36,BOLD

Chinese U+4e00 - U+9fa5 Arial Unicode MS,14,REGULAR

Phaistos Disc1 U+F01D0 - U+F01E7 Aegean,36,REGULAR

Phaistos Disc2 U+F0200 - U+F0247 Aegean,36,REGULAR

Cypro-Minoan U+F1000 - U+F1136 Aegean,36,REGULAR

Cypriot Syllabary U+F1700-U+F1853 Aegean,36,REGULAR

已添加了许多其他组。请参阅顶部图片。

高级演示

对于此演示，您需要下载 Aegyptus 和 Aegean 字体。可以通过本文顶部的链接进行下载。

下载并安装这些字体后，您应该能够显示上面每个 Unicode 范围的所有字形。

但是，Windows 文本框一次只能分配一种字体，目前没有一种通用字体可以支持所有可能的 Unicode 码点。

如果您在底部文本框中输入 U+10980<空格>U+F1000<空格>，则至少有一个字符将无法正确显示。这是因为 U+10980 的字形在 Aegyptus 字体中，而 U+F1000 在 Aegean 字体中。如果您为文本框分配 Aegean 字体，U+10980 将无法正确显示；如果您分配 Aegyptus 字体，U+F1000 将无法正确显示。除非您能找到支持这两个码点的字体，否则您没有解决方案。

啊……但是，我们可以使用富文本框控件，对吧？不行。富文本框控件的当前版本不支持代理对编码，尽管它支持多种字体。U+10980 和 U+F1000 都使用代理对进行编码，因此我们无法使用富文本框显示这些字符。

解决此问题的一种方法是使用网页浏览器控件。当前版本的网页浏览器控件支持代理对编码。为了正确显示 Unicode 范围内的字符，我们将字符放在带有为这些标签分配了正确字体的 CSS 样式的 <div> 或 <span> 标签内。

<span style="font-family:@font@;color:@color@;font-size:@font-size@px"><b>&#x@unicode@;</b></span> 

<div style="font-family:@font@;color:@color@;font-size:@font-size@px"><b>@block@</b></div>

上面的内容是用于生成 <span> 和 <div> 标签的模板。我们可以使用下面的 getHTMLformatEntry(string s1, string font) 函数替换占位符（那些 @xx@ 项）。

        string getHTMLformatEntry(string s1, string font)
        {
/*
    <span style="font-family:@font@;color:@color@;font-size:@font-size@px"><b>&#x@unicode@;</b></span>

*/
            string s = Resource1.sSpan_Template;
            string f = font;
            string[] vf = f.Split(',');
            s=s.Replace("@font@", vf[0]);
            int font_size = (int.Parse(vf[1]) * 3) / 2;
            s = s.Replace("@font-size@", font_size+"");
            Random r= new Random();
            int i=r.Next(0,7);
            string[] colors = new string[] {"red","green","blue","magenta",
                                            "cyan","black","orange","pink" };
            string color = colors[i];
            s=s.Replace("@color@",color);
            s=s.Replace("@unicode@", s1);
            if (vf[2] != "BOLD")
            {
                s = s.Replace("<b>", "");
                s= s.Replace("</b>","");
            }
            return s;
        }

例如，如果我们想显示 U+F1000，我们传入参数

s1：“f1000”

font：“Aegean,36,REGULAR”

输出将是

<span style="font-family:Aegyptus;color:magenta;font-size:54px"><b>&#xf1000;</b></span>

颜色是随机分配的，但其余占位符将由输入参数中的数据替换

同样，我们也可以替换 <div> 模板中的占位符。

<div> 标签和 <span> 标签的主要区别在于，<div> 标签将占据网页的整行（如果我们不使用表格和单元格）。如果我们想要两个具有不同字体的字符并排显示，我们将使用 <span> 标签。<div> 标签用于具有相同字体的字符块。

此演示的步骤

1）在左侧的底部文本框中键入一些消息

2）单击此文本框旁边的 -> 按钮

3）从组合框中选择“Cypro-Minoan U+F1000 - U+F1136”Unicode 范围

4）按住 Alt 键并用鼠标左键单击左上角文本框中的第一个字符

5）从组合框中选择“Meroitic U+10980 - U+109ff”Unicode 范围

6）按住 Alt 键并用鼠标左键单击左上角文本框中的第一个字符

分析和解释

在步骤 2 中，单击 -> 按钮时，我们使用 <div> 模板生成如下所示的 <div> 标签

<div style="font-family:Arial Unicode MS;color:black;font-size:14.25px"><b>Demo:
Putting "U+F1000" Aegean font with
"U+10980" side by side</b></div>

在步骤 4 中，从鼠标单击开始，我们将光标位置设置在目标字符后面，以获取该字符的 Unicode 码点，在本例中为“f1000”。Alt 键是为了指示我们还想将字符粘贴到网页浏览器控件中。我们调用 getHTMLformatEntry() 函数，传入此码点和当前字体（“Aegean,36,REGULAR”）来创建下面的标签

<span style="font-family:Aegean;color:cyan;font-size:54px">&#xf1000;</span>

同样，步骤 6 也将生成一个 <span> 标签，但现在字体不同，并且将生成下面的标签

<span style="font-family:Aegyptus;color:black;font-size:54px"><b>&#x10980;</b></span>

除了这两个模板之外，我们还有一个模板用于创建整个 HTML 页面。占位符 @@ 将被先前生成的 <div> 和 <span> 标签的串联替换。在生成标签时，我们将它们存储在全局变量 htmlelements 中。用 htmlelements 的内容替换 @@，将得到一个格式良好的 HTML 页面，我们可以用它来更新网页浏览器控件。使用 htmlelements 替换 @@ 会得到一个格式良好的 HTML 页面，我们可以用它来更新网页浏览器控件

<!DOCTYPE html><html><body>@@</body></html>

完成所有 6 个步骤后，单击“查看源代码”按钮在记事本中查看 HTML 页面。该文件创建在当前目录中，默认名称为temp.html.txt。将其重命名为temp.html 并在任何网页浏览器中查看该页面。

或者，您可以直接单击“在外部浏览器中查看”按钮，将页面直接启动到您系统中的默认网页浏览器。

我已成功在 IE 8 和 Chrome 上测试了创建的页面。如果引用的字体已安装在您的 Windows 系统中，页面应该会正确呈现，因为较新版本的浏览器大多支持代理对编码。

您还可以单击“删除最后插入项”以删除您最后插入到网页浏览器中的项。

最后单击“清除”以清除网页浏览器控件的内容。

关注点

1）在文本框中直接输入 Unicode 的代码片段非常小而简单，您可以轻松地将其包含在您的项目中。要在任何文本框中启用此功能

        //To enable Unicode processing
        //**************************************************************
        private UnicodeProcessing uniprocessing = new UnicodeProcessing();
        //***************************************************************

        //To enable you to type unicode directly to the text-box 
        //*****************************************************************************
        textBox3.KeyPress += new KeyPressEventHandler(uniprocessing.HandleKeyPress);
        //******************************************************************************

2）使用 3.0 版本，您可以创建具有所有这些有趣字形的奇特网页。

玩得开心！

历史

2014 年 5 月 19 日：1.0 版本

2014 年 5 月 21 日：2.0 版本：添加对代理对的支持。

2014 年 5 月 23 日：2d 版本：添加组合框以选择 Unicode 范围

2014 年 5 月 24 日：3.0 版本：添加网页浏览器控件以支持多种字体

2014 年 5 月 26 日：3b 版本：封装所有 Unicode 处理函数，使其更容易重用这些功能。向演示中的 HTML 处理添加更多功能，允许删除最后插入项。修复处理 HTML 页面的前导空格和逗号的错误

2014 年 5 月 28 日：3c 版本：添加了广泛的字符分组列表，包括私有区域 U+e000 - U+f8ff。还包括有关私有字符编辑器 eudcedit.exe 的讨论。

参考

维基百科：Unicode

维基百科：UTF-16