pnmixer-rust/utf8_ranges/index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta name="generator" content="rustdoc">
    <meta name="description" content="API documentation for the Rust `utf8_ranges` crate.">
    <meta name="keywords" content="rust, rustlang, rust-lang, utf8_ranges">

    <title>utf8_ranges - Rust</title>

    <link rel="stylesheet" type="text/css" href="../normalize.css">
    <link rel="stylesheet" type="text/css" href="../rustdoc.css">
    <link rel="stylesheet" type="text/css" href="../main.css">


</head>
<body class="rustdoc mod">
    <!--[if lte IE 8]>
    <div class="warning">
        This old browser is unsupported and will most likely display funky
        things.
    </div>
    <![endif]-->


    <nav class="sidebar">

        <p class='location'>Crate utf8_ranges</p><div class="block items"><ul><li><a href="#structs">Structs</a></li><li><a href="#enums">Enums</a></li></ul></div><p class='location'></p><script>window.sidebarCurrent = {name: 'utf8_ranges', ty: 'mod', relpath: '../'};</script>
    </nav>

    <nav class="sub">
        <form class="search-form js-only">
            <div class="search-container">
                <input class="search-input" name="search"
                       autocomplete="off"
                       placeholder="Click or press ‘S’ to search, ‘?’ for more options…"
                       type="search">
            </div>
        </form>
    </nav>

    <section id='main' class="content">
<h1 class='fqn'><span class='in-band'>Crate <a class="mod" href=''>utf8_ranges</a></span><span class='out-of-band'><span id='render-detail'>
                   <a id="toggle-all-docs" href="javascript:void(0)" title="collapse all docs">
                       [<span class='inner'>&#x2212;</span>]
                   </a>
               </span><a class='srclink' href='../src/utf8_ranges/lib.rs.html#1-511' title='goto source code'>[src]</a></span></h1>
<div class='docblock'><p>Crate <code>utf8-ranges</code> converts ranges of Unicode scalar values to equivalent
ranges of UTF-8 bytes. This is useful for constructing byte based automatons
that need to embed UTF-8 decoding.</p>

<p>See the documentation on the <code>Utf8Sequences</code> iterator for more details and
an example.</p>

<h1 id='wait-what-is-this' class='section-header'><a href='#wait-what-is-this'>Wait, what is this?</a></h1>
<p>This is simplest to explain with an example. Let&#39;s say you wanted to test
whether a particular byte sequence was a Cyrillic character. One possible
scalar value range is <code>[0400-04FF]</code>. The set of allowed bytes for this
range can be expressed as a sequence of byte ranges:</p>

<pre class="rust rust-example-rendered">
[<span class="ident">D0</span><span class="op">-</span><span class="ident">D3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>

<p>This is simple enough: simply encode the boundaries, <code>0400</code> encodes to
<code>D0 80</code> and <code>04FF</code> encodes to <code>D3 BF</code>, and create ranges from each
corresponding pair of bytes: <code>D0</code> to <code>D3</code> and <code>80</code> to <code>BF</code>.</p>

<p>However, what if you wanted to add the Cyrillic Supplementary characters to
your range? Your range might then become <code>[0400-052F]</code>. The same procedure
as above doesn&#39;t quite work because <code>052F</code> encodes to <code>D4 AF</code>. The byte ranges
you&#39;d get from the previous transformation would be <code>[D0-D4][80-AF]</code>. However,
this isn&#39;t quite correct because this range doesn&#39;t capture many characters,
for example, <code>04FF</code> (because its last byte, <code>BF</code> isn&#39;t in the range <code>80-AF</code>).</p>

<p>Instead, you need multiple sequences of byte ranges:</p>

<pre class="rust rust-example-rendered">
[<span class="ident">D0</span><span class="op">-</span><span class="ident">D3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]  <span class="attribute"># <span class="ident">matches</span> <span class="ident">codepoints</span> <span class="number">0400</span><span class="op">-</span><span class="number">04FF</span>
[<span class="ident">D4</span>]</span>[<span class="number">80</span><span class="op">-</span><span class="ident">AF</span>]     <span class="attribute"># <span class="ident">matches</span> <span class="ident">codepoints</span> <span class="number">0500</span><span class="op">-</span><span class="number">052F</span></pre>

<p>This gets even more complicated if you want bigger ranges, particularly if
they naively contain surrogate codepoints. For example, the sequence of byte
ranges for the basic multilingual plane (<code>[0000-FFFF]</code>) look like this:</p>

<pre class="rust rust-example-rendered">
[<span class="number">0</span><span class="op">-</span><span class="number">7F</span>]
[<span class="ident">C2</span><span class="op">-</span><span class="ident">DF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">E0</span>][<span class="ident">A0</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">E1</span><span class="op">-</span><span class="ident">EC</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">ED</span>][<span class="number">80</span><span class="op">-</span><span class="number">9F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">EE</span><span class="op">-</span><span class="ident">EF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>

<p>Note that the byte ranges above will <em>not</em> match any erroneous encoding of
UTF-8, including encodings of surrogate codepoints.</p>

<p>And, of course, for all of Unicode (<code>[000000-10FFFF]</code>):</p>

<pre class="rust rust-example-rendered">
[<span class="number">0</span><span class="op">-</span><span class="number">7F</span>]
[<span class="ident">C2</span><span class="op">-</span><span class="ident">DF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">E0</span>][<span class="ident">A0</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">E1</span><span class="op">-</span><span class="ident">EC</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">ED</span>][<span class="number">80</span><span class="op">-</span><span class="number">9F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">EE</span><span class="op">-</span><span class="ident">EF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">F0</span>][<span class="number">90</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">F1</span><span class="op">-</span><span class="ident">F3</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]
[<span class="ident">F4</span>][<span class="number">80</span><span class="op">-</span><span class="number">8F</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>][<span class="number">80</span><span class="op">-</span><span class="ident">BF</span>]</pre>

<p>This crate automates the process of creating these byte ranges from ranges of
Unicode scalar values.</p>

<h1 id='why-would-i-ever-use-this' class='section-header'><a href='#why-would-i-ever-use-this'>Why would I ever use this?</a></h1>
<p>You probably won&#39;t ever need this. In 99% of cases, you just decode the byte
sequence into a Unicode scalar value and compare scalar values directly.
However, this explicit decoding step isn&#39;t always possible. For example, the
construction of some finite state machines may benefit from converting ranges
of scalar values into UTF-8 decoder automata (e.g., for character classes in
regular expressions).</p>

<h1 id='lineage' class='section-header'><a href='#lineage'>Lineage</a></h1>
<p>I got the idea and general implementation strategy from Russ Cox in his
<a href="https://swtch.com/%7Ersc/regexp/regexp3.html">article on regexps</a> and RE2.
Russ Cox got it from Ken Thompson&#39;s <code>grep</code> (no source, folk lore?).
I also got the idea from
<a href="https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java">Lucene</a>,
which uses it for executing automata on their term index.</p>
</div><h2 id='structs' class='section-header'><a href="#structs">Structs</a></h2>
<table>
                       <tr class=' module-item'>
                           <td><a class="struct" href="struct.Utf8Range.html"
                                  title='struct utf8_ranges::Utf8Range'>Utf8Range</a></td>
                           <td class='docblock-short'>
                                <p>A single inclusive range of UTF-8 bytes.</p>
                           </td>
                       </tr>
                       <tr class=' module-item'>
                           <td><a class="struct" href="struct.Utf8Sequences.html"
                                  title='struct utf8_ranges::Utf8Sequences'>Utf8Sequences</a></td>
                           <td class='docblock-short'>
                                <p>An iterator over ranges of matching UTF-8 byte sequences.</p>
                           </td>
                       </tr></table><h2 id='enums' class='section-header'><a href="#enums">Enums</a></h2>
<table>
                       <tr class=' module-item'>
                           <td><a class="enum" href="enum.Utf8Sequence.html"
                                  title='enum utf8_ranges::Utf8Sequence'>Utf8Sequence</a></td>
                           <td class='docblock-short'>
                                <p>Utf8Sequence represents a sequence of byte ranges.</p>
                           </td>
                       </tr></table></section>
    <section id='search' class="content hidden"></section>

    <section class="footer"></section>

    <aside id="help" class="hidden">
        <div>
            <h1 class="hidden">Help</h1>

            <div class="shortcuts">
                <h2>Keyboard Shortcuts</h2>

                <dl>
                    <dt>?</dt>
                    <dd>Show this help dialog</dd>
                    <dt>S</dt>
                    <dd>Focus the search field</dd>
                    <dt>&larrb;</dt>
                    <dd>Move up in search results</dd>
                    <dt>&rarrb;</dt>
                    <dd>Move down in search results</dd>
                    <dt>&#9166;</dt>
                    <dd>Go to active search result</dd>
                    <dt>+</dt>
                    <dd>Collapse/expand all sections</dd>
                </dl>
            </div>

            <div class="infos">
                <h2>Search Tricks</h2>

                <p>
                    Prefix searches with a type followed by a colon (e.g.
                    <code>fn:</code>) to restrict the search to a given type.
                </p>

                <p>
                    Accepted types are: <code>fn</code>, <code>mod</code>,
                    <code>struct</code>, <code>enum</code>,
                    <code>trait</code>, <code>type</code>, <code>macro</code>,
                    and <code>const</code>.
                </p>

                <p>
                    Search functions by type signature (e.g.
                    <code>vec -> usize</code> or <code>* -> vec</code>)
                </p>
            </div>
        </div>
    </aside>


    <script>
        window.rootPath = "../";
        window.currentCrate = "utf8_ranges";
    </script>
    <script src="../main.js"></script>
    <script defer src="../search-index.js"></script>
</body>
</html>