Strings and Characters
This document describes how the built-in Dafny types string
and char
are compiled to the various supported target languages.
The string
type is just an alias for seq<char>
, so most of the interesting
decisions center around to represent the char
type faithfully in each
target languages. Historically, the char
type has acted much like the
following newtype declaration:
newtype char = c: int | 0 <= c < 65536
That is, char
really meant “UTF-16 code unit”. Since Dafny string
values can contain invalid use of surrogates, and since this is suboptimal
for interoperating with other encodings such as UTF-8, the --unicode-char
option was added to enable defining char
to mean “Unicode scalar value”
instead. This is equivalent to the following newtype declaration instead:
newtype char = c: int | 0 <= c < 0xD800 || 0xE000 <= c < 0x11_0000
The selection of target language type for char
is chosen primarily to
ensure enough bits to efficiently store and work with these ranges of values,
but also secondarily to align with the idiomatic character type where possible,
and to support outputting the expected representation when passing characters
and strings to the print
statement.
Language | –unicode-char=false | –unicode-char=true |
---|---|---|
C# | char |
Dafny.Rune |
Java | char |
int / dafny.CodePoint |
JavaScript | string of length 1 |
_dafny.CodePoint (number wrapper) |
Python | str of length 1 |
_dafny.CodePoint (str of length 1 wrapper) |
Go | _dafny.Char (rune wrapper) |
_dafny.CodePoint (rune wrapper) |
C++ | char |
(not supported) |
The various runtimes for each language support both use cases,
without any need for conditional compilation or multiple packages.
Instead some utility functions have two different implementations with similar
names but different signatures, and the corresponding compilers select
which function to reference based on the value of --unicode-char
.
For example, most runtimes have a function named something like SeqFromString
,
for converting from the native string type to the appropriate type for the Dafny
runtime if --unicode-char=false
, but also a version named something like
UnicodeSeqFromString
for the --unicode-char=true
case. Both accept the
same input type, but the former will return something like a Seq<Character>
whereas the latter will return something like a Seq<CodePoint>
.
Here are some notes to explain and justify the choices in each language
when --unicode-char=true
:
C#
The Dafny.Rune
struct is a wrapper around an int
value,
and its API guarantees that invalid values (e.g. surrogates) will be rejected on construction.
Because C# optimizes this special case of a struct with a single value,
using this in place of a direct int
value when --unicode-char
is enabled
does not have any runtime overhead.
We would use the standard System.Text.Rune
struct instead,
but the DafnyRuntime
package is built for older .NET framework versions
that don’t include this type.
Java
Java does not include a dedicated type for Unicode scalar values
and instead uses the 32-bit wide int
primitive type in general.
The Java backend uses int
whenever the static type of a value is known,
but boxing of some kind is necessary to cast a primitive int
value
as a reference value when interfacing with generic code.
The solution is to define our own CodePoint
boxing type that wraps an int
just as Integer
does, but knows the true intended type more precisely
and overrides toString()
accordingly. We also define a TypeDescriptor
for UNICODE_CHAR
that uses int
as the primitive,
unboxed type but CodePoint
as the boxing type,
and this enables using the optimized dafny.Array
class to store a
sequence of code points in an int[]
instead of in a CodePoint[]
.
JavaScript / Python / Go
In all three of these runtimes, the _dafny.CodePoint
type is a wrapper
around a reasonable built-in type for code points.
Note that in Go, the underlying rune
type allows surrogate values,
and hence can be used whether --unicode-char
is enabled or not.
C++
The C++ char
type is unfortunately only 8 bits wide, and hence
is not an adequate choice for arbitrary UTF-16 code units.
Addressing this and supporting the 21-bit range of Unicode scalar values
is deferred for now, as it will require pulling in additional libraries
to support printing to standard out in order to implement the print
statement.
Printing strings and characters
Although string
is treated as a pure type alias within the core semantics
of Dafny, it is highly beneficial to print
a string value as Hello
rather than ['H', 'e', 'l', 'l', 'o']
. With --unicode-char=false
,
the various runtimes make best effort attempts
to identify which sequence values contain character values and hence should be
printed as strings. This means this behavior depends on how easy it is
to track this type information at runtime, and hence is not consistent
between backends.
With --unicode-char=true
, however, this is simplified
and consistent across backends: a value is only printed as a string if
its static type is known to be seq<char>
. This means that when using some
more generic datatypes, string values will be printed as sequence values:
printing an Option<string>
value for example.
This is generally implemented by including a function named something similar
to ToVerbatimString
on the sequence type in each runtime. This is only
invoked in generated code when the sequence is known to be a seq<char>
,
and hence the runtimes are free to cast as needed to treat the values as
strings.
--unicode-char
also changes how character values are printed by the
print
statement: 'D'
will print as D
when --unicode-char=false
.
but 'D'
when --unicode-char=true
.
String and character literals
The simplest way to compile literals would be to materialize them as
equivalent sequence displays of code point values, un-escaping escape sequences
as needed. In other words, compiling a Dafny "Hello"
string literal
to something like Seq(72, 101, 108, 108, 111)
.
This would be a substantial hit to the readability of the compiled code,
however, making it harder for Dafny users and contributors to debug issues:
string literals are often helpful landmarks for orienting oneself
when trying to understand the compiled code. Therefore,
character and string literals are still translated to literals in the target
language where possible.
The Util.MightContainNonAsciiCharacters
method is used to conservatively
identify string and character literals that should use the more general
approach using raw code point values instead.