214 lines
13 KiB
HTML
214 lines
13 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en">
|
||
<head>
|
||
<meta charset="utf-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||
<meta name="generator" content="rustdoc">
|
||
<meta name="description" content="API documentation for the Rust `utf8_ranges` crate.">
|
||
<meta name="keywords" content="rust, rustlang, rust-lang, utf8_ranges">
|
||
|
||
<title>utf8_ranges - Rust</title>
|
||
|
||
<link rel="stylesheet" type="text/css" href="../normalize.css">
|
||
<link rel="stylesheet" type="text/css" href="../rustdoc.css">
|
||
<link rel="stylesheet" type="text/css" href="../main.css">
|
||
|
||
|
||
|
||
|
||
</head>
|
||
<body class="rustdoc mod">
|
||
<!--[if lte IE 8]>
|
||
<div class="warning">
|
||
This old browser is unsupported and will most likely display funky
|
||
things.
|
||
</div>
|
||
<![endif]-->
|
||
|
||
|
||
|
||
<nav class="sidebar">
|
||
|
||
<p class='location'>Crate utf8_ranges</p><div class="block items"><ul><li><a href="#structs">Structs</a></li><li><a href="#enums">Enums</a></li></ul></div><p class='location'></p><script>window.sidebarCurrent = {name: 'utf8_ranges', ty: 'mod', relpath: '../'};</script>
|
||
</nav>
|
||
|
||
<nav class="sub">
|
||
<form class="search-form js-only">
|
||
<div class="search-container">
|
||
<input class="search-input" name="search"
|
||
autocomplete="off"
|
||
placeholder="Click or press ‘S’ to search, ‘?’ for more options…"
|
||
type="search">
|
||
</div>
|
||
</form>
|
||
</nav>
|
||
|
||
<section id='main' class="content">
|
||
<h1 class='fqn'><span class='in-band'>Crate <a class="mod" href=''>utf8_ranges</a></span><span class='out-of-band'><span id='render-detail'>
|
||
<a id="toggle-all-docs" href="javascript:void(0)" title="collapse all docs">
|
||
[<span class='inner'>−</span>]
|
||
</a>
|
||
</span><a class='srclink' href='../src/utf8_ranges/lib.rs.html#1-511' title='goto source code'>[src]</a></span></h1>
|
||
<div class='docblock'><p>Crate <code>utf8-ranges</code> converts ranges of Unicode scalar values to equivalent
|
||
ranges of UTF-8 bytes. This is useful for constructing byte based automatons
|
||
that need to embed UTF-8 decoding.</p>
|
||
|
||
<p>See the documentation on the <code>Utf8Sequences</code> iterator for more details and
|
||
an example.</p>
|
||
|
||
<h1 id='wait-what-is-this' class='section-header'><a href='#wait-what-is-this'>Wait, what is this?</a></h1>
|
||
<p>This is simplest to explain with an example. Let's say you wanted to test
|
||
whether a particular byte sequence was a Cyrillic character. One possible
|
||
scalar value range is <code>[0400-04FF]</code>. The set of allowed bytes for this
|
||
range can be expressed as a sequence of byte ranges:</p>
|
||
|
||
<pre class="rust rust-example-rendered">
|
||
[<span class="ident">D0</span><span class="op">-</span><span class="ident">D3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>
|
||
|
||
<p>This is simple enough: simply encode the boundaries, <code>0400</code> encodes to
|
||
<code>D0 80</code> and <code>04FF</code> encodes to <code>D3 BF</code>, and create ranges from each
|
||
corresponding pair of bytes: <code>D0</code> to <code>D3</code> and <code>80</code> to <code>BF</code>.</p>
|
||
|
||
<p>However, what if you wanted to add the Cyrillic Supplementary characters to
|
||
your range? Your range might then become <code>[0400-052F]</code>. The same procedure
|
||
as above doesn't quite work because <code>052F</code> encodes to <code>D4 AF</code>. The byte ranges
|
||
you'd get from the previous transformation would be <code>[D0-D4][80-AF]</code>. However,
|
||
this isn't quite correct because this range doesn't capture many characters,
|
||
for example, <code>04FF</code> (because its last byte, <code>BF</code> isn't in the range <code>80-AF</code>).</p>
|
||
|
||
<p>Instead, you need multiple sequences of byte ranges:</p>
|
||
|
||
<pre class="rust rust-example-rendered">
|
||
[<span class="ident">D0</span><span class="op">-</span><span class="ident">D3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] <span class="attribute"># <span class="ident">matches</span> <span class="ident">codepoints</span> <span class="number">0400</span><span class="op">-</span><span class="number">04FF</span>
|
||
[<span class="ident">D4</span>]</span>[<span class="number">80</span><span class="op">-</span><span class="ident">AF</span>] <span class="attribute"># <span class="ident">matches</span> <span class="ident">codepoints</span> <span class="number">0500</span><span class="op">-</span><span class="number">052F</span></pre>
|
||
|
||
<p>This gets even more complicated if you want bigger ranges, particularly if
|
||
they naively contain surrogate codepoints. For example, the sequence of byte
|
||
ranges for the basic multilingual plane (<code>[0000-FFFF]</code>) look like this:</p>
|
||
|
||
<pre class="rust rust-example-rendered">
|
||
[<span class="number">0</span><span class="op">-</span><span class="number">7F</span>]
|
||
[<span class="ident">C2</span><span class="op">-</span><span class="ident">DF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
|
||
[<span class="ident">E0</span>][<span class="ident">A0</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
|
||
[<span class="ident">E1</span><span class="op">-</span><span class="ident">EC</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
|
||
[<span class="ident">ED</span>][<span class="number">80</span><span class="op">-</span><span class="number">9F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
|
||
[<span class="ident">EE</span><span class="op">-</span><span class="ident">EF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>
|
||
|
||
<p>Note that the byte ranges above will <em>not</em> match any erroneous encoding of
|
||
UTF-8, including encodings of surrogate codepoints.</p>
|
||
|
||
<p>And, of course, for all of Unicode (<code>[000000-10FFFF]</code>):</p>
|
||
|
||
<pre class="rust rust-example-rendered">
|
||
[<span class="number">0</span><span class="op">-</span><span class="number">7F</span>]
|
||
[<span class="ident">C2</span><span class="op">-</span><span class="ident">DF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
|
||
[<span class="ident">E0</span>][<span class="ident">A0</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
|
||
[<span class="ident">E1</span><span class="op">-</span><span class="ident">EC</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
|
||
[<span class="ident">ED</span>][<span class="number">80</span><span class="op">-</span><span class="number">9F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
|
||
[<span class="ident">EE</span><span class="op">-</span><span class="ident">EF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
|
||
[<span class="ident">F0</span>][<span class="number">90</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
|
||
[<span class="ident">F1</span><span class="op">-</span><span class="ident">F3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
|
||
[<span class="ident">F4</span>][<span class="number">80</span><span class="op">-</span><span class="number">8F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>
|
||
|
||
<p>This crate automates the process of creating these byte ranges from ranges of
|
||
Unicode scalar values.</p>
|
||
|
||
<h1 id='why-would-i-ever-use-this' class='section-header'><a href='#why-would-i-ever-use-this'>Why would I ever use this?</a></h1>
|
||
<p>You probably won't ever need this. In 99% of cases, you just decode the byte
|
||
sequence into a Unicode scalar value and compare scalar values directly.
|
||
However, this explicit decoding step isn't always possible. For example, the
|
||
construction of some finite state machines may benefit from converting ranges
|
||
of scalar values into UTF-8 decoder automata (e.g., for character classes in
|
||
regular expressions).</p>
|
||
|
||
<h1 id='lineage' class='section-header'><a href='#lineage'>Lineage</a></h1>
|
||
<p>I got the idea and general implementation strategy from Russ Cox in his
|
||
<a href="https://swtch.com/%7Ersc/regexp/regexp3.html">article on regexps</a> and RE2.
|
||
Russ Cox got it from Ken Thompson's <code>grep</code> (no source, folk lore?).
|
||
I also got the idea from
|
||
<a href="https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java">Lucene</a>,
|
||
which uses it for executing automata on their term index.</p>
|
||
</div><h2 id='structs' class='section-header'><a href="#structs">Structs</a></h2>
|
||
<table>
|
||
<tr class=' module-item'>
|
||
<td><a class="struct" href="struct.Utf8Range.html"
|
||
title='struct utf8_ranges::Utf8Range'>Utf8Range</a></td>
|
||
<td class='docblock-short'>
|
||
<p>A single inclusive range of UTF-8 bytes.</p>
|
||
</td>
|
||
</tr>
|
||
<tr class=' module-item'>
|
||
<td><a class="struct" href="struct.Utf8Sequences.html"
|
||
title='struct utf8_ranges::Utf8Sequences'>Utf8Sequences</a></td>
|
||
<td class='docblock-short'>
|
||
<p>An iterator over ranges of matching UTF-8 byte sequences.</p>
|
||
</td>
|
||
</tr></table><h2 id='enums' class='section-header'><a href="#enums">Enums</a></h2>
|
||
<table>
|
||
<tr class=' module-item'>
|
||
<td><a class="enum" href="enum.Utf8Sequence.html"
|
||
title='enum utf8_ranges::Utf8Sequence'>Utf8Sequence</a></td>
|
||
<td class='docblock-short'>
|
||
<p>Utf8Sequence represents a sequence of byte ranges.</p>
|
||
</td>
|
||
</tr></table></section>
|
||
<section id='search' class="content hidden"></section>
|
||
|
||
<section class="footer"></section>
|
||
|
||
<aside id="help" class="hidden">
|
||
<div>
|
||
<h1 class="hidden">Help</h1>
|
||
|
||
<div class="shortcuts">
|
||
<h2>Keyboard Shortcuts</h2>
|
||
|
||
<dl>
|
||
<dt>?</dt>
|
||
<dd>Show this help dialog</dd>
|
||
<dt>S</dt>
|
||
<dd>Focus the search field</dd>
|
||
<dt>⇤</dt>
|
||
<dd>Move up in search results</dd>
|
||
<dt>⇥</dt>
|
||
<dd>Move down in search results</dd>
|
||
<dt>⏎</dt>
|
||
<dd>Go to active search result</dd>
|
||
<dt>+</dt>
|
||
<dd>Collapse/expand all sections</dd>
|
||
</dl>
|
||
</div>
|
||
|
||
<div class="infos">
|
||
<h2>Search Tricks</h2>
|
||
|
||
<p>
|
||
Prefix searches with a type followed by a colon (e.g.
|
||
<code>fn:</code>) to restrict the search to a given type.
|
||
</p>
|
||
|
||
<p>
|
||
Accepted types are: <code>fn</code>, <code>mod</code>,
|
||
<code>struct</code>, <code>enum</code>,
|
||
<code>trait</code>, <code>type</code>, <code>macro</code>,
|
||
and <code>const</code>.
|
||
</p>
|
||
|
||
<p>
|
||
Search functions by type signature (e.g.
|
||
<code>vec -> usize</code> or <code>* -> vec</code>)
|
||
</p>
|
||
</div>
|
||
</div>
|
||
</aside>
|
||
|
||
|
||
|
||
<script>
|
||
window.rootPath = "../";
|
||
window.currentCrate = "utf8_ranges";
|
||
</script>
|
||
<script src="../main.js"></script>
|
||
<script defer src="../search-index.js"></script>
|
||
</body>
|
||
</html> |