- Mastering Julia
- Malcolm Sherrington
- 1390字
- 2021-07-16 13:42:40
Char and strings
So far we have been dealing with numeric and boolean datatypes. In this section we will look at character representation and how Julia handles ASCII and UTF-8 strings of characters. We will also introduce the concept of regular expressions, widely used in pattern matching and filtering operations.
Characters
Julia has a built-in type Char
to represent a character. A character occupies 32 bits not 8, so a character can represent a UTF-8 symbol and may be assigned in a number of ways:
julia> c = 'A' julia> c = char(65) julia> c = '\U0041'
All these represent the ASCII character capital A
.
It is possible to specify a character code of '\Uffff'
but char
conversion does not check that every value is valid. However, Julia provides an isvalid_char()
function:
julia> c = '\Udff3'; julia> is_valid_char(c; ) # => gives false.
Julia uses the special C-like syntax for certain ASCII control characters such as '\b'
,'\t'
,'\n'
,'\r'
,\'f'
for backspace, tab, newline, carriage return and form feed. Otherwise the backslash acts as an escape character, so int('\s')
gives 115
whereas int('\t')
gives 9
.
Strings
The type of string we are most familiar with comprises a list of ASCII characters which, in Julia, are normally delimited with double quotes, that is:
julia> s = "Hello there, Blue Eyes"; typeof(s) ASCIIString (constructor with 2 methods)
In fact a string is an abstraction not a concrete type and ASCIIString
is only one such abstraction. Looking at Base::boot.jl
we see:
abstract String abstract DirectIndexString <: String immutable ASCIIString <: DirectIndexString data::Array{Uint8,1} end immutable UTF8String <: String data::Array{Uint8,1} end typealias ByteString Union(ASCIIString,UTF8String)
In Julia (as in Java), strings are immutable: that is, the value of a String
object cannot be changed. To construct a different string value, you construct a new string from parts of other strings.
ASCII strings are also indexable so from s
as defined previously: s[14:17]
gives "Blue"
. The values in the range are inclusive and if we wish we can change the increment as s[14:2:17]
which gives "Bu"
or reverse the slice as s[17:-1:14]
which gives "eulB"
. Omitting the end of the range is equivalent to running to the end of the string: s[14:]
which gives "Blue Eyes"
.
However s[:14]
is somewhat unexpected and gives the character B
not the string upto and including B
. This is because the :
defines a 'symbol', and for a literal :14
is equivalent to 14
, so s[:14]
is the same as s[14]
and not s[1:14]
.
Strings allow for the special characters such as '\n'
, '\t'
, and so on. If we wish to include the double quote we can escape it but Julia provides a """
delimiter. So s = "This is the double quote \" character"
and s = """This is the double quote"character"""
are equivalent:
julia> s = "This is a double quote \" character."; println(s); This is a double quote " character.
Strings also provide the $
convention when displaying the value of variable:
julia> age = 21; s = "I've been $age for many years now!" "I've been 21 for many years now!"
Concatenation of strings can be done using the $
convention but also Julia uses the *
operator (rather than +
or some other symbol):
julia> s = "Who are you?"; julia> t = " said the Caterpillar."
The following two expressions are directly equivalent:
julia> s*t "Who are you? said the Caterpillar." julia> "$s$t" "Who are you? said the Caterpillar."
Unicode support
We saw from the definition above that apart from ASCII strings Julia defines UTF-8 strings. In fact UTF-8 is not all that Julia supports and adding support for new encodings is quite easy. In particular, Julia also provides UTF16String
and UTF32String
types, constructed by the utf16(s)
and utf32(s)
functions respectively, for UTF-16 and UTF-32 encodings.
Julia provides a function endof()
which can be used to used to determine the end of a string and a symbol end
to denote the character in the last index position.
Because of variable-length encodings, the number of characters in a string which is given by length(s)
is not always the same as the last index. If you iterate through the indices 1 through endof(s)
and index into s
, the sequence of characters returned when errors aren't thrown is the sequence of characters comprising the string. Thus we have the identity that length(s) <= endof(s)
, since each character in a string must have its own index:
julia> s = "\u2200 x \u2203 y" # this is the mathematical expression ..... julia>typeof(s) # => UTF8String julia>endof(s) # => 11 julia> length(s) # => 7 julia> s[end] # => 'y'
In this case, since the string s
has two UTF characters each occupying 3 bytes the only valid indices are s[1]
, s[4]
, s[5]
, s[6]
, s[7]
, s[11]
, so for example s[7]
will return the character '\u2203'
.
Regular expressions
Regular expressions (Regex) came to prominence with their inclusion in Perl programming.
There is an old adage: "I had a problem and decided to solve it using regular expressions, now I have two problems".
Regular expressions are used for pattern matching, numerous books have been written on them and support is available in a variety of our programming languages post-Perl, notably Java and Python.
Julia supports regular expressions via a special form of string prefixed with an r
.
Suppose we define the pattern empat
as:
empat = r"^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$"
The following example will give a clue to what the pattern is associated with:
julia>ismatch(empat, "fred.flintstone@bedrock.net") # => true julia>ismatch(empat, "Fredrick Flintstone@bedrock.net") # => false
The pattern is for a valid e-mail address and in the second case the space in "Fredrick Flintstone"
is not valid so the match fails.
Since we may wish to know not only whether a string matches a certain pattern but also how it is matched, Julia has a function match()
:
julia> m = match(r"@bedrock","barney,rubble@bedrock.net") RegexMatch("@bedrock")
If this matches, the function returns a RegexMatch
object, otherwise it returns Nothing
:
julia>m.match # => "@bedrock" julia>m.offset # => 14 julia>m.captures # => 0-element Array{Union(SubString{UTF8String},Nothing),1}
Byte array literals
Another special form is the byte array literal: b"..."
which enables string notation express arrays of Uint8
values.
The rules for byte array literals are the following:
- ASCII characters and ASCII escapes produce a single byte
\x
and octal escape sequences produce the byte corresponding to the escape value- Unicode escape sequences produce a sequence of bytes encoding that code point in UTF-8
Consider the following two examples:
julia> A = b"HEX:\xefcc" # => 7-element Array{Uint8,1}:[0x48,0x45,0x58,0x3a,0xef,0x63,0x63] julia> B = b"\u2200 x \u2203 y" #=> 11-element Array{Uint8,1}:[0xe2,0x88,0x80,0x20,0x78,0x20,0xe2,0x88,0x83,0x20,0x79]
Version literals
Version numbers can easily be expressed with non-standard string literals such as v"..."
.
Version number literals create VersionNumber
objects which follow the specifications of semantic versioning (http://semver.org), and therefore are composed of major, minor and patch numeric values, followed by pre-release and build alpha-numeric annotations.
So a full specification typically would be: v"0.3.1-rc1"
; the major version is "0"
, minor version "3"
, patch level "1"
and release candidate is 1
. Only the major version needs to be provided and the others assume default values. So v"1"
is equivalent to v"1.0.0"
.
We met the use of version numbers previously when using the package manager to pin a package to a specific version: Pkg.pin("NumericExtensions",v"0.2.1")
.
An example
Let us look at some code to play the game Bulls and Cows. A computer program moo, written in 1970 at MIT in the PL/I, was among the first Bulls and Cows computer implementation.
It is proven that any number could be solved for up to seven turns and the minimal average game length is 5.21 turns.
The computer enumerates a four digit random number from the digits 1 to 9, without duplication. The player inputs his/her guess and the program should validate the player's guess, reject guesses that are malformed, then print the 'score' in terms of number of bulls and cows.
The score is computed as follows:
- One bull is accumulated for each digit in the guess that equals the corresponding digit in the randomly chosen initial number
- One cow is accumulated for each digit in the guess that also appears in the randomly chosen number, but in the wrong position
- The player wins if the guess is the same as the randomly chosen number, and the program ends
- Otherwise the program accepts a new guess, incrementing the number of 'tries'
Coding it up in Julia:
function bacs () bulls = cows = turns = 0 A = {} srand(int(time())) while length(unique(A)) < 4 push!(A,rand('1':'9')) end bacs_number = unique(A) println("Bulls and Cows") while (bulls != 4) print("Guess? ") if eof(STDIN) s = "q" else s = chomp(readline(STDIN)) end if (s == "q") print("My guess was "); [print(bacs_number[i]) for i=1:4] return end guess = collect(s) if !(length(unique(guess)) == length(guess) == 4 && all(isdigit,guess)) print("\nEnter four distinct digits or q to quit: ") continue end bulls = sum(map(==, guess, bacs_number)) cows = length(intersect(guess,bacs_number)) - bulls println("$bulls bulls and $cows cows!") turns += 1 end println("You guessed my number in $turns turns.") end
The preceding code can be explained as follows:
- We define an array
A
asA = {}
rather thanA = []
. This is because although arrays were described as homogeneous collections, Julia provides a typeAny
which can, as the name suggests, store any form of variable. This is similar to the Microsoft variant datatype.julia> A = {"There are ",10, " green bottles", " hanging on the wall.\n"} julia> [print(A[i]) for i = 1:length(A)] There are 10 green bottles hanging on the wall.
- Integers are created as characters using the
rand()
function and pushed ontoA
withpush!()
. - The array
A
may consist of more than 4 entries so aunique()
function is applied which reduces it to 4 by eliminating duplicates and this is stored inbacs_number
. - User input is via
readline(STDIN)
and this will be a stringincluding
the trailing return (\n
), so achomp()
function is applied to remove it and the input is compared withq
to allow an escape before the number is guessed. - A
collect()
function applied is applied to return a 4-element array of typeChar
and it is checked that there are 4 elements and that these are all digits. - The number of
bulls
is determined by comparing each entry inguess
andbacs_number
. This is achieved by using amap()
function to applying the== operator
, if 4 bulls then we are done. Otherwise it's possible to construct a new array as the intersection ofguess
andbacs_number
which will contain all the elements which match. So subtracting the number of 'bulls' leaves the number ofcows
.
- Progressive Web Apps with React
- TypeScript Blueprints
- Oracle從新手到高手
- Julia機器學習核心編程:人人可用的高性能科學計算
- Ray分布式機器學習:利用Ray進行大模型的數據處理、訓練、推理和部署
- Java Web基礎與實例教程
- Java深入解析:透析Java本質的36個話題
- WordPress Plugin Development Cookbook(Second Edition)
- Mastering Apache Spark 2.x(Second Edition)
- WebRTC技術詳解:從0到1構建多人視頻會議系統
- 深度學習:Java語言實現
- Terraform:多云、混合云環境下實現基礎設施即代碼(第2版)
- Kotlin極簡教程
- 大學計算機基礎實驗指導
- PHP+MySQL動態網站開發從入門到精通(視頻教學版)