Pattern matching

From MorphOS Library

Introduction

String pattern matching is comparing a text string with a pattern. The pattern uses specific notation to specify matching criteria. A very simple and well known form of pattern matching are the MS-DOS wildcard characters "?" and "*". The most common usage of pattern matching in computers is matching files for shell commands. Of course, MorphOS uses pattern matching in its shell, but it is also used in file requesters. Many applications use pattern matching to filter information that is displayed to users. For example, Snoopium and MediaLogger match names of logged processes against user-specified patterns.

MorphOS uses its own pattern notation, inherited from Amiga OS. While not as powerful as Unix regular expressions, it is easier to read and understand. On the other hand, MorphOS pattern matching is more flexible than wildcard characters. Here is an example:

delete "the #? - #?come#?.(mp3|aiff|wav)"

This command will delete all files with "mp3", "aiff" or "wav" extension, starting with "the" as a single word and having the string "come" (as a word or part of a word) anywhere after a hyphen surrounded by spaces. While this example may look complicated on first sight, it shows the power of MorphOS pattern matching. It will be clean and understandable after reading the following description of MorphOS pattern notation. For now, let's state that the whole pattern has been double-quoted only because it has spaces inside.

Notation

  • "?" (question mark) - matches one and exactly one character. For example ???? matches all 4-letter strings. t?? matches all three-letter strings starting with t.
  • "#" (hash) - this is a repetition operator. Matches expression standing on the right side, repeated zero or more times. For example, #a matches any number of letters a, but also matches an empty string. To match at least one a, the pattern should be denoted as a#a. A very common pattern #? matches any string, so delete #? is an equivalent to unix rm *.
  • "()" (parentheses) - used for changing priority for other operators, as in math. For example a#?a matches any string starting and ending with a , but a#(?a) matches any string in which every odd character is a, and every even character is anything. More examples can be found below.
  • "|" (vertical bar) - means alternative. For example, (a|b)#? matches any string starting with a or b . #?(cat|dog) matches any string ending with cat or with dog. A typical example is matching a set of file extenstions like in the example in introduction. An even simpler example is this: #?.(txt|doc|rtf) matches names ended with any of three document extensions.
  • "~" (tilde) - means negation, may be read as an "all except of" expression standing on the right. For example ~a#? matches all strings not starting with a. Similarly ~(foo)#? matches all strings not starting with foo. Note the usage of parentheses, without them ~foo#? will match all strings not starting with f, but having o as the second and the third character.
  • "%" (percent sign) - rarely used, matches empty string.
  • "'" (single quotation mark) - this is the escape character. The next special character after this one is treated as an ordinary character. For example #?'|#? matches all strings containing vertical bar. '##? matches all strings starting with hash. Note well that using special pattern matching characters in filenames is strongly discouraged and is only asking for troubles. The possibility of escaping is useful for general pattern matching however, not related to shell and files.
  • "[]" (square brackets) - are used to denote set or range of characters. A set is specified just by writing all the characters of it. A range is specified with hyphen. For example [AEIOUY] means "any character of A, E, I, O, U, Y". [A-F] means "any character of A, B, C, D, E, F". A tilde used inside square brackets negates the whole class, so [~EF] means "any character except of E and F".

Examples

???#? matches all strings having at least three characters.

~(???#?) matches all strings having less than three characters.

#[0-9] matches only numbers.

[1-9]#[0-9] the same as above, but numbers with leading zero(s) are not allowed.

#([0-9]|[A-F]|[a-f]) matches hexadecimal numbers written with small or capital letters.

[0123][0-9](.|-)[01][0-9](.|-)[0-9][0-9][0-9][0-9] matches european style date (2 digits day, 2 digits month, 4 digits year), with dots or hyphens used as separators.