Vorwort zur Ăbersetzung
Dies ist eine Ăbersetzung des erlĂ€uternden Teils des Vorschlags Intl.Segmenter, der wahrscheinlich in die nĂ€chste ECMAScript-Spezifikation aufgenommen wird.
Der Vorschlag ist bereits in V8 implementiert und kann ohne das Flag in Version 8.7 (genauer gesagt in 8.7.38und höher) verwendet werden. Daher kann er in Google Chrome Canary (ab Version 87.0.4252.0) oder in Node.js V8 Canary (ab Version ) getestet werden v15.0.0-v8-canary202009025a2ca762b8. FĂŒr Windows sind BinĂ€rdateien verfĂŒgbar v15.0.0-v8-canary202009173b56586162).
Wenn Sie in frĂŒheren Versionen mit dem Flag testen --harmony-intl-segmenter, seien Sie vorsichtig, da sich die Spezifikation geĂ€ndert hat und die Implementierung unter dem Flag möglicherweise veraltet ist. ĂberprĂŒfen Sie anhand der Ausgabe in Codebeispielen.
Nach der Ăbersetzung werden Links zu Materialien ĂŒber die GrĂŒnde fĂŒr die Probleme bereitgestellt, die dieser Vorschlag löst.
Intl.Segmenter: Unicode-Segmentierung in JavaScript
Der Vorschlag befindet sich in Phase 3 mit UnterstĂŒtzung von Richard Gibson.
Motivation
(code point) «» . , (, ). , . , .
, CLDR (Common Locale Data Repository, ) (, locales). , , , .
, UAX 29. , JavaScript .
Chrome API Intl.v8BreakIterator. API . API, API JavaScript â , ES2015.
, segment(), Intl.Segmenter, Iterable.
// .
let segmenter = new Intl.Segmenter("fr", {granularity: "word"});
// .
let input = "Moi? N'est-ce pas.";
let segments = segmenter.segment(input);
// !
for (let {segment, index, isWordLike} of segments) {
console.log("segment at code units [%d, %d): «%s»%s",
index, index + segment.length,
segment,
isWordLike ? " (word-like)" : ""
);
}
// console.log:
// segment at code units [0, 3): «Moi» (word-like)
// segment at code units [3, 4): «?»
// segment at code units [4, 6): « »
// segment at code units [6, 11): «N'est» (word-like)
// segment at code units [11, 12): «-»
// segment at code units [12, 14): «ce» (word-like)
// segment at code units [14, 15): « »
// segment at code units [15, 18): «pas» (word-like)
// segment at code units [18, 19): «.»
, API .
// â0 1 2 3 4 5â6â7â8â9
// âA l l o n sâ-âyâ!â
let input = "Allons-y!";
let segmenter = new Intl.Segmenter("fr", {granularity: "word"});
let segments = segmenter.segment(input);
let current = undefined;
current = segments.containing(0)
// â { index: 0, segment: "Allons", isWordLike: true }
current = segments.containing(5)
// â { index: 0, segment: "Allons", isWordLike: true }
current = segments.containing(6)
// â { index: 6, segment: "-", isWordLike: false }
current = segments.containing(current.index + current.segment.length)
// â { index: 7, segment: "y", isWordLike: true }
current = segments.containing(current.index + current.segment.length)
// â { index: 8, segment: "!", isWordLike: false }
current = segments.containing(current.index + current.segment.length)
// â undefined
API
new Intl.Segmenter(locale, options)
.
options , granularity, ("grapheme" ( ), "word" ( ) "sentence" ( ); â "grapheme").
Intl.Segmenter.prototype.segment(string)
%Segments% Iterable .
:
segmentâ .indexâ (code unit index) , .inputâ .isWordLikeâtrue,"word"( ) ( /// ..);false,"word"( // ..);undefined,"word".
%Segments%.prototype:
%Segments%.prototype.containing(index)
, , (code unit) , undefined, .
%Segments%.prototype[Symbol.iterator]
%SegmentIterator%, "" (lazy, ) , .
%SegmentIterator%.prototype:
%SegmentIterator%.prototype.next()
next() Iterator, IteratorResult, value , .
FAQ
? ?
â , . . . CLDR. , CLDR/ICU , .
API ?
, 3- , . TC39 . ; , , .
?
API, , API : , API (, ). API CSS Houdini.
?
API:
- .
- .
- , (.. Web API (Web Platform), ECMAScript).
- , . CLDR ICU . CSS, . . , , , ; .
?
%SegmentIterator%.prototype, (, seek([inclusiveStartIndex = thisIterator.index + 1]) seekBefore([exclusiveLastIndex = thisIterator.index]), . ECMA-262 ( ). , , .
API Intl, String?
, . segment() SegmentIterator. , API Intl, ECMA-402. , . String, , .
?
n (code unit), . , "Hello, world\u{1F499}" ( , - â ), 0, 5, 6, 7 12. : âHelloâ,â âworldâ\u{1F499}â, (code units), (code point). , .
?
, next().
, ?
, - QA ;)
Number: null 0, â 0 1, , , Symbol BigInt, undefined NaN *. , ( , ).
* . "fail". Chrome Canary, Symbol BigInt TypeError, undefined NaN , 0.
JavaScript.
- Joel Spolsky. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- Dmitri Pavlutin. What every JavaScript developer should know about Unicode
- Dr. Axel Rauschmayer. JavaScript for impatient programmers: 17. Unicode â a brief introduction
- Dr. Axel Rauschmayer. JavaScript for impatient programmers: 18.6. Atoms of text: Unicode characters, JavaScript characters, grapheme clusters
- Jonathan New. "\u{1F4A9}".length === 2
- NicolĂĄs Bevacqua. ES6 Strings (and Unicode, ) in Depth
- Mathias Bynens. JavaScript has a Unicode problem
- Mathias Bynens. Unicode-aware regular expressions in ECMAScript 6
- Mathias Bynens. Unicode property escapes in JavaScript regular expressions
- Mathias Bynens. Unicode sequence property escapes
- Awesome Unicode: a curated list of delightful Unicode tidbits, packages and resources