官术网_书友最值得收藏!

Char and strings

So far we have been dealing with numeric and boolean datatypes. In this section we will look at character representation and how Julia handles ASCII and UTF-8 strings of characters. We will also introduce the concept of regular expressions, widely used in pattern matching and filtering operations.

Characters

Julia has a built-in type Char to represent a character. A character occupies 32 bits not 8, so a character can represent a UTF-8 symbol and may be assigned in a number of ways:

julia> c = 'A'
julia> c = char(65)
julia> c = '\U0041'

All these represent the ASCII character capital A.

It is possible to specify a character code of '\Uffff' but char conversion does not check that every value is valid. However, Julia provides an isvalid_char() function:

julia> c = '\Udff3';
julia> is_valid_char(c; ) # => gives false.

Julia uses the special C-like syntax for certain ASCII control characters such as '\b','\t','\n','\r',\'f' for backspace, tab, newline, carriage return and form feed. Otherwise the backslash acts as an escape character, so int('\s') gives 115 whereas int('\t') gives 9.

Strings

The type of string we are most familiar with comprises a list of ASCII characters which, in Julia, are normally delimited with double quotes, that is:

julia> s = "Hello there, Blue Eyes"; typeof(s)
ASCIIString (constructor with 2 methods)

In fact a string is an abstraction not a concrete type and ASCIIString is only one such abstraction. Looking at Base::boot.jl we see:

abstract String
abstract DirectIndexString <: String
immutable ASCIIString <: DirectIndexString
    data::Array{Uint8,1}
end
immutable UTF8String <: String
    data::Array{Uint8,1}
end
typealias ByteString Union(ASCIIString,UTF8String)

In Julia (as in Java), strings are immutable: that is, the value of a String object cannot be changed. To construct a different string value, you construct a new string from parts of other strings.

ASCII strings are also indexable so from s as defined previously: s[14:17] gives "Blue". The values in the range are inclusive and if we wish we can change the increment as s[14:2:17] which gives "Bu" or reverse the slice as s[17:-1:14] which gives "eulB". Omitting the end of the range is equivalent to running to the end of the string: s[14:] which gives "Blue Eyes".

However s[:14] is somewhat unexpected and gives the character B not the string upto and including B. This is because the : defines a 'symbol', and for a literal :14 is equivalent to 14, so s[:14] is the same as s[14] and not s[1:14].

Strings allow for the special characters such as '\n', '\t', and so on. If we wish to include the double quote we can escape it but Julia provides a """ delimiter. So s = "This is the double quote \" character" and s = """This is the double quote"character""" are equivalent:

julia> s = "This is a double quote \" character."; println(s);
This is a double quote " character.

Strings also provide the $ convention when displaying the value of variable:

julia> age = 21; s = "I've been $age for many years now!"
"I've been 21 for many years now!"

Concatenation of strings can be done using the $ convention but also Julia uses the * operator (rather than + or some other symbol):

julia> s = "Who are you?";
julia> t = " said the Caterpillar."

The following two expressions are directly equivalent:

julia> s*t
"Who are you? said the Caterpillar."
julia> "$s$t"
"Who are you? said the Caterpillar."
Unicode support

We saw from the definition above that apart from ASCII strings Julia defines UTF-8 strings. In fact UTF-8 is not all that Julia supports and adding support for new encodings is quite easy. In particular, Julia also provides UTF16String and UTF32String types, constructed by the utf16(s) and utf32(s) functions respectively, for UTF-16 and UTF-32 encodings.

Julia provides a function endof() which can be used to used to determine the end of a string and a symbol end to denote the character in the last index position.

Because of variable-length encodings, the number of characters in a string which is given by length(s) is not always the same as the last index. If you iterate through the indices 1 through endof(s) and index into s, the sequence of characters returned when errors aren't thrown is the sequence of characters comprising the string. Thus we have the identity that length(s) <= endof(s), since each character in a string must have its own index:

julia> s = "\u2200 x \u2203 y" # this is the mathematical expression .....
julia>typeof(s) # => UTF8String
julia>endof(s) # => 11
julia> length(s) # => 7
julia> s[end] # => 'y'

In this case, since the string s has two UTF characters each occupying 3 bytes the only valid indices are s[1], s[4], s[5], s[6], s[7], s[11], so for example s[7] will return the character '\u2203'.

Regular expressions

Regular expressions (Regex) came to prominence with their inclusion in Perl programming.

There is an old adage: "I had a problem and decided to solve it using regular expressions, now I have two problems".

Regular expressions are used for pattern matching, numerous books have been written on them and support is available in a variety of our programming languages post-Perl, notably Java and Python.

Julia supports regular expressions via a special form of string prefixed with an r.

Suppose we define the pattern empat as:

empat = r"^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$"

The following example will give a clue to what the pattern is associated with:

julia>ismatch(empat, "fred.flintstone@bedrock.net") # => true
julia>ismatch(empat, "Fredrick Flintstone@bedrock.net") # => false

The pattern is for a valid e-mail address and in the second case the space in "Fredrick Flintstone" is not valid so the match fails.

Since we may wish to know not only whether a string matches a certain pattern but also how it is matched, Julia has a function match():

julia> m = match(r"@bedrock","barney,rubble@bedrock.net")
RegexMatch("@bedrock")

If this matches, the function returns a RegexMatch object, otherwise it returns Nothing:

julia>m.match # => "@bedrock"
julia>m.offset # => 14
julia>m.captures # => 0-element Array{Union(SubString{UTF8String},Nothing),1}
Byte array literals

Another special form is the byte array literal: b"..." which enables string notation express arrays of Uint8 values.

The rules for byte array literals are the following:

  • ASCII characters and ASCII escapes produce a single byte
  • \x and octal escape sequences produce the byte corresponding to the escape value
  • Unicode escape sequences produce a sequence of bytes encoding that code point in UTF-8

Consider the following two examples:

julia> A = b"HEX:\xefcc" # => 7-element Array{Uint8,1}:[0x48,0x45,0x58,0x3a,0xef,0x63,0x63]
julia> B = b"\u2200 x \u2203 y" #=> 11-element Array{Uint8,1}:[0xe2,0x88,0x80,0x20,0x78,0x20,0xe2,0x88,0x83,0x20,0x79]
Version literals

Version numbers can easily be expressed with non-standard string literals such as v"...".

Version number literals create VersionNumber objects which follow the specifications of semantic versioning (http://semver.org), and therefore are composed of major, minor and patch numeric values, followed by pre-release and build alpha-numeric annotations.

So a full specification typically would be: v"0.3.1-rc1"; the major version is "0", minor version "3", patch level "1" and release candidate is 1. Only the major version needs to be provided and the others assume default values. So v"1" is equivalent to v"1.0.0".

We met the use of version numbers previously when using the package manager to pin a package to a specific version: Pkg.pin("NumericExtensions",v"0.2.1").

An example

Let us look at some code to play the game Bulls and Cows. A computer program moo, written in 1970 at MIT in the PL/I, was among the first Bulls and Cows computer implementation.

It is proven that any number could be solved for up to seven turns and the minimal average game length is 5.21 turns.

The computer enumerates a four digit random number from the digits 1 to 9, without duplication. The player inputs his/her guess and the program should validate the player's guess, reject guesses that are malformed, then print the 'score' in terms of number of bulls and cows.

The score is computed as follows:

  • One bull is accumulated for each digit in the guess that equals the corresponding digit in the randomly chosen initial number
  • One cow is accumulated for each digit in the guess that also appears in the randomly chosen number, but in the wrong position
  • The player wins if the guess is the same as the randomly chosen number, and the program ends
  • Otherwise the program accepts a new guess, incrementing the number of 'tries'

Coding it up in Julia:

function bacs ()
  bulls = cows = turns = 0
  A = {}
  srand(int(time()))
  while length(unique(A)) < 4 
  push!(A,rand('1':'9'))
  end
  bacs_number = unique(A)
  println("Bulls and Cows")
  while (bulls != 4)
  print("Guess? ")
  if eof(STDIN)
  s = "q"
  else
  s = chomp(readline(STDIN))
  end
  if  (s == "q")
  print("My guess was "); [print(bacs_number[i]) for i=1:4]
  return
  end
  guess = collect(s)
  if  !(length(unique(guess)) == length(guess) == 4 && all(isdigit,guess))
  print("\nEnter four distinct digits or q to quit: ")
  continue
  end
  bulls = sum(map(==, guess, bacs_number))
  cows = length(intersect(guess,bacs_number)) - bulls
  println("$bulls bulls and $cows cows!")
  turns += 1
  end
  println("You guessed my number in $turns turns.")
end

The preceding code can be explained as follows:

  1. We define an array A as A = {} rather than A = []. This is because although arrays were described as homogeneous collections, Julia provides a type Any which can, as the name suggests, store any form of variable. This is similar to the Microsoft variant datatype.
    julia> A = {"There are ",10, " green bottles", " hanging on the wall.\n"}
    julia> [print(A[i]) for i = 1:length(A)]
    There are 10 green bottles hanging on the wall.
    
  2. Integers are created as characters using the rand() function and pushed onto A with push!().
  3. The array A may consist of more than 4 entries so a unique() function is applied which reduces it to 4 by eliminating duplicates and this is stored in bacs_number.
  4. User input is via readline(STDIN) and this will be a string including the trailing return (\n), so a chomp() function is applied to remove it and the input is compared with q to allow an escape before the number is guessed.
  5. A collect() function applied is applied to return a 4-element array of type Char and it is checked that there are 4 elements and that these are all digits.
  6. The number of bulls is determined by comparing each entry in guess and bacs_number. This is achieved by using a map() function to applying the == operator, if 4 bulls then we are done. Otherwise it's possible to construct a new array as the intersection of guess and bacs_number which will contain all the elements which match. So subtracting the number of 'bulls' leaves the number of cows.
主站蜘蛛池模板: 道孚县| 榆林市| 牙克石市| 张北县| 邹平县| 正宁县| 神木县| 兴城市| 宜春市| 宜黄县| 水城县| 龙州县| 陕西省| 连山| 榆树市| 巴彦县| 岱山县| 浙江省| 乐昌市| 嘉善县| 崇文区| 马尔康县| 乐至县| 岑溪市| 宁南县| 永修县| 黑山县| 汉中市| 东源县| 喀喇沁旗| 城步| 新野县| 永嘉县| 武胜县| 安吉县| 罗定市| 修文县| 固安县| 蓬安县| 灌南县| 遵义市|