Package com.ibm.icu.dev.tool.translit
Class UnicodeSetCloseOver
java.lang.Object
com.ibm.icu.dev.tool.translit.UnicodeSetCloseOver
This class produces the data tables used by the closeOver() method
of UnicodeSet.
Whenever the Unicode database changes, this tool must be re-run
(AFTER the data file(s) underlying ICU4J are udpated).
The output of this tool should then be pasted into the appropriate
files:
ICU4J: com.ibm.icu.text.UnicodeSet.java
ICU4C: /icu/source/common/uniset.cpp
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescription(package private) static void
analyzeCaseData
(Map equivClasses, StringBuffer pairs, Vector nonpairs, Vector lengths) Analyze the case fold equivalency classes.(package private) static Map
Create a map of String => Set.(package private) static void
emitRangesString
(PrintStream out, UnicodeSet set, String id) Given a UnicodeSet, emit it as a Java string.(package private) static void
emitUCharRangesArray
(PrintStream out, UnicodeSet set, String id) Given a UnicodeSet, emit it as an array of UChar pairs.(package private) static void
(package private) static UnicodeSet
Create the set of case-sensitive characters.static void
-
Field Details
-
JAVA_OUT
- See Also:
-
JAVA_CHARPROP_OUT
- See Also:
-
C_SET_OUT
- See Also:
-
C_UCHAR_OUT
- See Also:
-
WARNING
-
DEFAULT_CASE_MAP
static final boolean DEFAULT_CASE_MAP- See Also:
-
-
Constructor Details
-
UnicodeSetCloseOver
UnicodeSetCloseOver()
-
-
Method Details
-
main
- Throws:
IOException
-
createCaseFoldEquivalencyClasses
Create a map of String => Set. The String in this case is a folded string for which UCharacter.foldCase(folded. DEFAULT_CASE_MAP).equals(folded). The Set contains all single-character strings x for which UCharacter.foldCase(x, DEFAULT_CASE_MAP).equals(folded), as well as folded itself. -
analyzeCaseData
Analyze the case fold equivalency classes. Break them into two groups: 'pairs', and 'nonpairs'. Create a tally of the length configurations of the nonpairs. Length configurations of equivalency classes, as of Unicode 3.2. Most of the classes (83%) have two single codepoints. Here "112:28" means there are 28 equivalency classes with 2 single codepoints and one string of length 2. 11:656 111:16 1111:3 112:28 113:2 12:31 13:12 22:38 Note: This method does not count the frequencies of the different length configurations (as shown above after ':'); it merely records which configurations occur.- Parameters:
pairs
- Accumulate equivalency classes that consist of exactly two codepoints here. This is 83+% of the classes. E.g., {"a", "A"}.nonpairs
- Accumulate other equivalency classes here, as lists of strings. E,g, {"st", "ſt", "st"}.lengths
- Accumulate a list of unique length structures, not including pairs. Each length structure is represented by a string of digits. The digit string "12" means the equivalency class contains a single code point and a string of length 2. Typical contents of 'lengths': { "111", "1111", "112", "113", "12", "13", "22" }. Note the absence of "11".
-
generateCaseData
- Throws:
IOException
-
getCaseSensitive
Create the set of case-sensitive characters. These are characters that participate in any case mapping operation as a source or as a member of a target string. -
emitUCharRangesArray
Given a UnicodeSet, emit it as an array of UChar pairs. Each pair will be the start/end of a range. Code points >= U+10000 will be represented as surrogate pairs. -
emitRangesString
Given a UnicodeSet, emit it as a Java string. The most economical format is not the pattern, but instead a pairs list, with each range pair represented as two adjacent characters.
-