pnmixer-rust/utf8_ranges/index.html

214 lines
13 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="generator" content="rustdoc">
<meta name="description" content="API documentation for the Rust `utf8_ranges` crate.">
<meta name="keywords" content="rust, rustlang, rust-lang, utf8_ranges">
<title>utf8_ranges - Rust</title>
<link rel="stylesheet" type="text/css" href="../normalize.css">
<link rel="stylesheet" type="text/css" href="../rustdoc.css">
<link rel="stylesheet" type="text/css" href="../main.css">
</head>
<body class="rustdoc mod">
<!--[if lte IE 8]>
<div class="warning">
This old browser is unsupported and will most likely display funky
things.
</div>
<![endif]-->
<nav class="sidebar">
<p class='location'>Crate utf8_ranges</p><div class="block items"><ul><li><a href="#structs">Structs</a></li><li><a href="#enums">Enums</a></li></ul></div><p class='location'></p><script>window.sidebarCurrent = {name: 'utf8_ranges', ty: 'mod', relpath: '../'};</script>
</nav>
<nav class="sub">
<form class="search-form js-only">
<div class="search-container">
<input class="search-input" name="search"
autocomplete="off"
placeholder="Click or press S to search, ? for more options…"
type="search">
</div>
</form>
</nav>
<section id='main' class="content">
<h1 class='fqn'><span class='in-band'>Crate <a class="mod" href=''>utf8_ranges</a></span><span class='out-of-band'><span id='render-detail'>
<a id="toggle-all-docs" href="javascript:void(0)" title="collapse all docs">
[<span class='inner'>&#x2212;</span>]
</a>
</span><a class='srclink' href='../src/utf8_ranges/lib.rs.html#1-511' title='goto source code'>[src]</a></span></h1>
<div class='docblock'><p>Crate <code>utf8-ranges</code> converts ranges of Unicode scalar values to equivalent
ranges of UTF-8 bytes. This is useful for constructing byte based automatons
that need to embed UTF-8 decoding.</p>
<p>See the documentation on the <code>Utf8Sequences</code> iterator for more details and
an example.</p>
<h1 id='wait-what-is-this' class='section-header'><a href='#wait-what-is-this'>Wait, what is this?</a></h1>
<p>This is simplest to explain with an example. Let&#39;s say you wanted to test
whether a particular byte sequence was a Cyrillic character. One possible
scalar value range is <code>[0400-04FF]</code>. The set of allowed bytes for this
range can be expressed as a sequence of byte ranges:</p>
<pre class="rust rust-example-rendered">
[<span class="ident">D0</span><span class="op">-</span><span class="ident">D3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>
<p>This is simple enough: simply encode the boundaries, <code>0400</code> encodes to
<code>D0 80</code> and <code>04FF</code> encodes to <code>D3 BF</code>, and create ranges from each
corresponding pair of bytes: <code>D0</code> to <code>D3</code> and <code>80</code> to <code>BF</code>.</p>
<p>However, what if you wanted to add the Cyrillic Supplementary characters to
your range? Your range might then become <code>[0400-052F]</code>. The same procedure
as above doesn&#39;t quite work because <code>052F</code> encodes to <code>D4 AF</code>. The byte ranges
you&#39;d get from the previous transformation would be <code>[D0-D4][80-AF]</code>. However,
this isn&#39;t quite correct because this range doesn&#39;t capture many characters,
for example, <code>04FF</code> (because its last byte, <code>BF</code> isn&#39;t in the range <code>80-AF</code>).</p>
<p>Instead, you need multiple sequences of byte ranges:</p>
<pre class="rust rust-example-rendered">
[<span class="ident">D0</span><span class="op">-</span><span class="ident">D3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>] <span class="attribute"># <span class="ident">matches</span> <span class="ident">codepoints</span> <span class="number">0400</span><span class="op">-</span><span class="number">04FF</span>
[<span class="ident">D4</span>]</span>[<span class="number">80</span><span class="op">-</span><span class="ident">AF</span>] <span class="attribute"># <span class="ident">matches</span> <span class="ident">codepoints</span> <span class="number">0500</span><span class="op">-</span><span class="number">052F</span></pre>
<p>This gets even more complicated if you want bigger ranges, particularly if
they naively contain surrogate codepoints. For example, the sequence of byte
ranges for the basic multilingual plane (<code>[0000-FFFF]</code>) look like this:</p>
<pre class="rust rust-example-rendered">
[<span class="number">0</span><span class="op">-</span><span class="number">7F</span>]
[<span class="ident">C2</span><span class="op">-</span><span class="ident">DF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">E0</span>][<span class="ident">A0</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">E1</span><span class="op">-</span><span class="ident">EC</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">ED</span>][<span class="number">80</span><span class="op">-</span><span class="number">9F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">EE</span><span class="op">-</span><span class="ident">EF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>
<p>Note that the byte ranges above will <em>not</em> match any erroneous encoding of
UTF-8, including encodings of surrogate codepoints.</p>
<p>And, of course, for all of Unicode (<code>[000000-10FFFF]</code>):</p>
<pre class="rust rust-example-rendered">
[<span class="number">0</span><span class="op">-</span><span class="number">7F</span>]
[<span class="ident">C2</span><span class="op">-</span><span class="ident">DF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">E0</span>][<span class="ident">A0</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">E1</span><span class="op">-</span><span class="ident">EC</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">ED</span>][<span class="number">80</span><span class="op">-</span><span class="number">9F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">EE</span><span class="op">-</span><span class="ident">EF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">F0</span>][<span class="number">90</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">F1</span><span class="op">-</span><span class="ident">F3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">F4</span>][<span class="number">80</span><span class="op">-</span><span class="number">8F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>
<p>This crate automates the process of creating these byte ranges from ranges of
Unicode scalar values.</p>
<h1 id='why-would-i-ever-use-this' class='section-header'><a href='#why-would-i-ever-use-this'>Why would I ever use this?</a></h1>
<p>You probably won&#39;t ever need this. In 99% of cases, you just decode the byte
sequence into a Unicode scalar value and compare scalar values directly.
However, this explicit decoding step isn&#39;t always possible. For example, the
construction of some finite state machines may benefit from converting ranges
of scalar values into UTF-8 decoder automata (e.g., for character classes in
regular expressions).</p>
<h1 id='lineage' class='section-header'><a href='#lineage'>Lineage</a></h1>
<p>I got the idea and general implementation strategy from Russ Cox in his
<a href="https://swtch.com/%7Ersc/regexp/regexp3.html">article on regexps</a> and RE2.
Russ Cox got it from Ken Thompson&#39;s <code>grep</code> (no source, folk lore?).
I also got the idea from
<a href="https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java">Lucene</a>,
which uses it for executing automata on their term index.</p>
</div><h2 id='structs' class='section-header'><a href="#structs">Structs</a></h2>
<table>
<tr class=' module-item'>
<td><a class="struct" href="struct.Utf8Range.html"
title='struct utf8_ranges::Utf8Range'>Utf8Range</a></td>
<td class='docblock-short'>
<p>A single inclusive range of UTF-8 bytes.</p>
</td>
</tr>
<tr class=' module-item'>
<td><a class="struct" href="struct.Utf8Sequences.html"
title='struct utf8_ranges::Utf8Sequences'>Utf8Sequences</a></td>
<td class='docblock-short'>
<p>An iterator over ranges of matching UTF-8 byte sequences.</p>
</td>
</tr></table><h2 id='enums' class='section-header'><a href="#enums">Enums</a></h2>
<table>
<tr class=' module-item'>
<td><a class="enum" href="enum.Utf8Sequence.html"
title='enum utf8_ranges::Utf8Sequence'>Utf8Sequence</a></td>
<td class='docblock-short'>
<p>Utf8Sequence represents a sequence of byte ranges.</p>
</td>
</tr></table></section>
<section id='search' class="content hidden"></section>
<section class="footer"></section>
<aside id="help" class="hidden">
<div>
<h1 class="hidden">Help</h1>
<div class="shortcuts">
<h2>Keyboard Shortcuts</h2>
<dl>
<dt>?</dt>
<dd>Show this help dialog</dd>
<dt>S</dt>
<dd>Focus the search field</dd>
<dt>&larrb;</dt>
<dd>Move up in search results</dd>
<dt>&rarrb;</dt>
<dd>Move down in search results</dd>
<dt>&#9166;</dt>
<dd>Go to active search result</dd>
<dt>+</dt>
<dd>Collapse/expand all sections</dd>
</dl>
</div>
<div class="infos">
<h2>Search Tricks</h2>
<p>
Prefix searches with a type followed by a colon (e.g.
<code>fn:</code>) to restrict the search to a given type.
</p>
<p>
Accepted types are: <code>fn</code>, <code>mod</code>,
<code>struct</code>, <code>enum</code>,
<code>trait</code>, <code>type</code>, <code>macro</code>,
and <code>const</code>.
</p>
<p>
Search functions by type signature (e.g.
<code>vec -> usize</code> or <code>* -> vec</code>)
</p>
</div>
</div>
</aside>
<script>
window.rootPath = "../";
window.currentCrate = "utf8_ranges";
</script>
<script src="../main.js"></script>
<script defer src="../search-index.js"></script>
</body>
</html>