官术网_书友最值得收藏!

Regular expressions

To search for and match patterns in text and other data, regular expressions are an indispensable tool for the data scientist. Julia adheres to the Perl syntax of regular expressions. For a complete reference, refer to http://www.regular-expressions.info/reference.html. Regular expressions are represented in Julia as a double (or triple) quoted string preceded by r, such as r"..." (optionally, followed by one or more of the i, s, m, or x flags), and they are of type Regex. The regexp.jl script shows some examples.

In the first example, we will match the email addresses (#> shows the result):

email_pattern = r".+@.+" 
input = "john.doe@mit.edu" 
println(occursin(email_pattern, input)) #> true 

The regular expression pattern + matches any (non-empty) group of characters. Thus, this pattern matches any string that contains @ somewhere in the middle.

In the second example, we will try to determine whether a credit card number is valid or not:

visa = r"^(?:4[0-9]{12}(?:[0-9]{3})?)$"  # the pattern 
input = "4457418557635128" 
occursin(visa, input)  #> true 
if occursin(visa, input) 
    println("credit card found") 
    m = match(visa, input) 
    println(m.match) #> 4457418557635128 
    println(m.offset) #> 1 
    println(m.offsets) #> [] 
end 

The occursin(regex, string) function returns true or false, depending on whether the given regex matches the string, so we can use it in an if expression. If you want the detailed information of the pattern matching, use match instead of occursin. This either returns nothing when there is no match, or an object of type RegexMatch when the pattern is found (nothing is, in fact, a value to indicate that nothing is returned or printed, and it has a type of Nothing).

The RegexMatch object has the following properties:

  • match contains the entire substring that matches (in this example, it contains the complete number)
  • offset states at what position the matching begins (here, it is 1)
  • offsets gives the same information as the preceding line, but for each of the captured substrings
  • captures contains the captured substrings as a tuple (refer to the following example)

Besides checking whether a string matches a particular pattern, regular expressions can also be used to capture parts of the string. We can do this by enclosing parts of the pattern in parentheses ( ). For instance, to capture the username and hostname in the email address pattern used earlier, we modify the pattern as follows:

email_pattern = r"(.+)@(.+)" 

Notice how the characters before @ are enclosed in brackets. This tells the regular expression engine that we want to capture this specific set of characters. To see how this works, consider the following example:

email_pattern = r"(.+)@(.+)" 
input = "john.doe@mit.edu" 
m = match(email_pattern, input) 
println(m.captures) #> Union{Nothing,
SubString{String}}["john.doe", "mit.edu"]

Here is another example:

m = match(r"(ju|l)(i)?(a)", "Julia") 
println(m.match) #> "lia" 
println(m.captures) #> l - i - a 
println(m.offset) #> 3 
println(m.offsets) #> 3 - 4 - 5 

The search and replace functions also take regular expressions as arguments, for example, replace("Julia", r"u[\w]*l" => "red") returns "Jredia". If you want to work with all the matches, matchall and eachmatch come in handy:

str = "The sky is blue"
reg = r"[\w]{3,}" # matches words of 3 chars or more 
r = collect((m.match for m = eachmatch(reg, str)))
show(r) #> ["The","sky","blue"]

iter = eachmatch(reg, str) 
for i in iter 
    println("\"$(i.match)\" ") 
end 

The collect function returns an array with RegexMatch for each match. eachmatch returns an iterator, iter, over all the matches, which we can loop through with a simple for loop. The screen output is "The", "sky", and "blue", printed on consecutive lines.

主站蜘蛛池模板: 宣汉县| 阿瓦提县| 威信县| 丰镇市| 余干县| 镇宁| 新乡市| 五河县| 游戏| 安龙县| 沧州市| 万盛区| 安达市| 珠海市| 岚皋县| 浮山县| 和龙市| 巴中市| 靖安县| 大丰市| 平湖市| 神木县| 喀喇沁旗| 奉节县| 珠海市| 玛曲县| 桐城市| 泽州县| 六盘水市| 平邑县| 道孚县| 石景山区| 宁明县| 奉化市| 黑水县| 靖宇县| 上饶市| 瑞昌市| 威海市| 都昌县| 柯坪县|