Breaking rust strings

May 5, 2018    rust

fn main() {
    let chinese_text = "可通過每頁左上角的連結隨時調整";
    let chinese_text_slice = &chinese_text[0..1]; // Focus here
    println!("{}", chinese_text_slice);
}

In the above code i want to extract the first character out which is 可. Since the end indices are exclusive in rust i put forth a slice 0..1 on chinese_text.

If you expected the output to be 可. You are in for a surprise!

break_strings$ RUST_BACKTRACE=1 cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `target/debug/break_strings`
thread 'main' panicked at 'byte index 1 is not a char boundary; it is inside '' (bytes 0..3) of `可通過每頁左上角的連結隨時調整`', src/libcore/str/mod.rs:2234:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::panicking::default_hook::{{closure}}
             at src/libstd/sys_common/backtrace.rs:68
             at src/libstd/sys_common/backtrace.rs:57
             at src/libstd/panicking.rs:381
  << Blah blah blah>>
  17: std::sys_common::bytestring::debug_fmt_bytestring
             at src/libstd/panicking.rs:459
             at src/libstd/panic.rs:365
             at src/libstd/rt.rs:58
  18: std::rt::lang_start
             at /Users/travis/build/rust-lang/rust/src/libstd/rt.rs:74
  19: break_strings::main

The worse part is this is a run time error and the back trace does not even give the exact line number. 😓

The above was a simple 3 liner consider the same happening in a bigger project and say that the string slice is extracted from a JSON REST response which gave you chinese where you were expecting ASCII and you cant know which line this error occurred. What will you do? Will you ditch slices altogether in rust? I leave it to you…

Why did this happen

The error message says ‘byte index 1 is not a char boundary; it is inside ‘可’ (bytes 0..3) of ‘可通過每頁左上角的連結隨時調整’. In rust the strings indices are actually byte indices and a that a single character can occupy multiple bytes. In this case the character 可 requires 3 bytes to for storage. So, when you are creating a string slice its up you to make sure the start and end byte index are actually char indexes 😄. Good luck with that! 😄

The fix

Since the 可 requires 3 bytes for storage we should take 3 bytes out.

fn main() {
    let chinese_text = "可通過每頁左上角的連結隨時調整";
    let chinese_text_slice = &chinese_text[0..3]; // Focus here
    println!("{}", chinese_text_slice);
}

The slice 0..3 refers byte 0,1,2

break_strings$ RUST_BACKTRACE=1 cargo run
   Compiling break_strings v0.1.0 (file:///Users/kishanb/Programming/Personal/learn-rust/break_strings)
    Finished dev [unoptimized + debuginfo] target(s) in 0.47 secs
     Running `target/debug/break_strings`

Now your program works!



comments powered by Disqus