Pattern matching

From MorphOS Library

Revision as of 11:25, 15 December 2009 by ASiegel (talk | contribs)

Introduction

String pattern matching is comparing a text string with a pattern. The pattern uses specific notation to specify matching criteria. A very simple and well known form of pattern matching are the MS-DOS wildcard characters "?" and "*". The most common usage of pattern matching in computers is matching files for shell commands. Of course, MorphOS uses pattern matching in its shell, but it is also used in file requesters. Many applications use pattern matching to filter information presented. For example, Snoopium and MediaLogger match names of logged processes against user-specified patterns.

MorphOS uses its own pattern notation, inherited from Amiga OS. While not as powerful as Unix regular expressions, it is easier to read and understand. On the other hand, MorphOS pattern matching is more flexible than wildcard characters. Here is an example:

delete "the #? - #?come#?.(mp3|aiff|wav)"

This command will delete all files with "mp3", "aiff" or "wav" extension, starting with "the" as a single word and having the string "come" (as a word or part of a word) anywhere after a hyphen surrounded by spaces. While this example may look complicated on first sight, it shows the power of MorphOS pattern matching. It will be clean and understandable after reading the following description of MorphOS pattern notation. For now, let's state that the whole pattern has been double-quoted only because it has spaces inside.

Notation

  • "?" (question mark) - matches one and exactly one character. For example "????" matches all 4-letter strings. "t??" matches all three-letter strings starting with "t".
  • "#" (hash) - this is a repetition operator. Matches expression standing on the right side, repeated zero or more times. For example, "#a" matches any number of letters "a", but also matches an empty string. To match at least one "a", the pattern should be denoted as "a#a". A very common pattern "#?" matches any string, so delete #? is an equivalent to unix rm *.
  • "()" (parentheses) - used for changing priority for other operators, as in math. For example "a#?a" matches any string starting and ending with "a", but "a#(?a)" matches any string in which every odd character is "a", and every even character is anything. More examples can be found below.
  • "|" (vertical bar) - means alternative. For example, "(a|b)#?" matches any string starting with "a" or "b". "#?(cat|dog)" matches any string ending with "cat" or with "dog". A typical example is matching a set of file extenstions like in the example in introduction. An even simpler example is this: "#?.(txt|doc|rtf)" matches names ended with any of three document extensions.
  • "~" (tilde) - means negation, may be read as an "all except of" expression standing on the right. For example "~a#?" matches all strings not starting with "a". Similarly "~(foo)#?" matches all strings not starting with "foo". Note the usage of parentheses, without them "~foo#? will match all strings not starting with "f", but having "o" as the second and the third character.

Examples

???#? matches all strings having at least three characters.

#(0|1|2|3|4|5|6|7|8|9) matches only numbers.