You can create user-defined text entities, allowing you to re-use commonly encountered words or phrases. You can also apply PERL/POSIX expressions to configure user-defined text entities. This topic can be used as an introduction and reference guide for constructing regular expressions.
Regular expressions can have a significant effect on system performance if used excessively. It is recommended that you avoid the use of regular expressions where possible. |
Atom |
A single character or an expression surrounded by ‘()’, or ‘[]’. These can be wholly affected by repeating metacharacters such as ‘*’ and by bounds ‘{}’. |
Expression |
A simple keyword, an ordered list of words within a phrase or a regular expression. |
Normal |
Alphanumeric characters (A-Z, a-z, 0-9) (used in user-defined lexical expressions). |
Punctuation |
Any characters other than those classified as normal or whitespace. This class includes the hyphen and underscore characters. (used in user defined lexical expressions). |
Whitespace | Whitespace character. Avoid building custom expressions which begin or end with a whitespace. |
Word |
A string of alphanumeric characters delimited by whitespace or punctuation (or the beginning or end of the string). This leads to words in plain-text expressions or in user-defined expressions containing punctuation being split into multiple ‘Lex words’. |
The regular expression definition for a ‘word’ character is slightly different. See Character classes for details. |
The following characters can be used to configure regular expressions:
|
Any single character. |
|
Anchor character, matches the start of a line. |
|
Anchor character, matches the end of a line. |
|
Sub-expression. Separates the expression within the brackets allowing it to be subjected to the repeating metacharacters or referenced with the back reference function. See below for details of these functions. |
|
Or operator. Matches either the expression preceding or succeeding the operator. For example, “a|b” matches either “a” or “b”. |
|
Zero or more occurrences of the preceding atom. For example, “ab*c” matches “ac”, “abc”, “abbbbbbbbc” etc... “(he)*” matches “he”, “hehehehehehe” etc... |
|
One or more occurrences of the preceding atom. |
|
Zero or one occurrences of the preceding atom. |
|
Bounded repeat. Matches exactly ‘x’ occurrences of the preceding atom. |
|
Matches between ‘x’ and ‘y’ (inclusive) occurrences of the preceding atom. |
|
Matches ‘x’ or more (inclusive) occurrences of the preceding atom. |
|
Non greedy repeat of repeating metacharacter ‘x’. The repeating functions above attempt to match as much as possible (they are greedy). Following any of the repeating metacharacters with a “?” causes the repeat to be non greedy. That is, the match is as short as possible. For example, in the string “It went on and on and on.", “went.{2,}on” matches “went on and on and on” whereas “went.{2,}?on” only matches “went on and on”. |
|
Back reference. Matches the string that was matched by sub-expression ‘n’. Where 'n' is a number from 1 to 9. For example, “(.+)-\1” matches “abc-abc” and “1234-1234” but not “abc-1234”. |
|
Where ‘x’ is a metacharacter this syntax indicates that ‘x’ is to be treated literally and not as a metacharacter. For example, ‘\$’ matches ‘$’. ‘\\’ matches ‘\’. Where ‘x’ is a predefined escape sequence character (or sequence of characters) match the character or character class defined by that escape sequence. For more information about escape sequences, see the table below. |
|
Character set. Matches any one character from the list. For example, “[abc]” matches either “a” or “b” or “c”. The whole character set may be subjected to the repeating metacharacters. For example, ‘[abc]*’ matches ‘aabcac’ but not ‘abcdcba’. |
|
Negated character set. Matches any one character which is not in the character set. For example, “[^bc]” matches “a” and “d” but not “b” or “c”. |
|
Character range. Matches one character in the range ‘x’ to ‘y’. For example, “[a-c]” matches “a”, “b” or “c”. The range endpoints must be in the correct order. That is, the first endpoint must precede the second endpoint in the Unicode codepoint sequence. To include a literal ‘-’ character in a character set enter it as the first or last character. |
|
Character class. A predefined set of characters. Character classes may only be used within a character set. For example, “[[:digit:]]” matches any numeric character. See below for a table of available character classes. |
|
Collating element. A single character or sequence of characters that collates as a single element. Collating elements may only be used within a character set. For example, “[[.ae.]]”. (This assumes that ‘ae’ is a collating element in the current system locale. Collating elements may also be used as a form of escape character as most special characters lose their special significance when used within a character set. For example, to specify a ‘-‘ as a range endpoint, it can be declared as a collating element such as “[!-[.-.]]”. Additionally some characters can be declared in a collating element by referring to the characters symbolic name. See below for a table of available symbolic names. |
|
Equivalence class. A character set of all characters with the same primary sort key as character ‘x’. An equivalence class may only be used within a character set. For example, “[[=e=]]” would match any of the following characters. “eÈÉÊËèéêëĒēĔĕĖėĘęĚě”. Note: This function is locale specific so should be used with caution. Different locales or platforms may result in different behaviour. |
|
Comment. Text between the ‘#’ and the closing ‘)’ are ignored. This can be used to explain how the expression works for future reference. For example, “(?#3 letters)[[:alpha:]]{3}(?#followed by 5 digits)[[:digit:]]{5}”. |
|
Positive lookahead. Returns a match if ‘pattern’ matches. The current point of reference is not moved so subsequent expressions match from the same point. This can be used to logically ‘and’ two or more regular expressions. For example, “(?=.*[[:lower:]])(?=.*[[:upper:]])” confirms that there are upper and lower case characters in the string. |
|
Negative lookahead. Returns a match if ‘pattern’ does not match. The current point of reference is not moved so subsequent expressions match from the same point. |
|
Positive lookbehind. Returns a match if ‘pattern’ matches immediately before the current point of reference. The current point of reference is not moved so subsequent expressions match from the same point. |
|
Negative lookbehind. Returns a match if ‘pattern’ does not match immediately before the current point of reference. The current point of reference is not moved so subsequent expressions match from the same point. |
|
Independent sub-expression. A sub expression that does
not allow backtracking into ‘pattern’ in order to try and satisfy the
larger expression. This function can be used to achieve significant performance
improvements. For example, “([ab]+)[bc]+”
matches “abb”.
But “(? |
|
Conditional expression. If ‘condition’ is true, attempts to match the ‘true’ pattern. If ‘condition’ is false, attempts to match the ‘false’ pattern. The condition may be either a lookahead or the index of a marked sub-expression. |
|
Conditional expression. If ‘condition’ is true, attempts to match the ‘true’ pattern. If ‘condition’ is false the expression returns no match. The condition may be either a lookahead or the index of a marked sub-expression. |
[:alnum:] |
All alphanumeric characters. Note: This is not restricted to the Latin alphabetic characters. |
[:alpha:] |
All alphabetic characters. Note: This is not restricted to the Latin alphabetic characters. |
[:blank:] |
All whitespace characters apart from line separator characters. |
[:cntrl:] |
All control characters. |
[:d:] [:digit:] |
All decimal digit characters. |
[:graph:] |
All graphical characters. |
[:l:] [:lower:] |
All lower case characters. This character class is not affected by configuring the expression to match case insensitively. |
[:print:] |
All printable characters. |
[:punct:] |
All punctuation characters. |
[:s:] [:space:] |
All whitespace characters. |
[:unicode:] |
All extended characters with a code point of greater than 255. |
[:u:] [:upper:] |
All upper case characters. This character class is not affected by configuring the expression to match case insensitively. |
[:w:] [:word:] |
All alphanumeric characters and the underscore character. |
[:xdigit:] |
All hexadecimal digit characters. |
The above may only be used within a character set. |
\a ’ |
‘bell’ character. |
\e |
‘escape’ character. |
\f |
‘form feed’ character. |
\n |
‘newline’ character. |
\r |
‘carriage return’ character. |
\t |
‘tab’ character. |
\v |
‘vertical tab’ character. |
\b |
‘backspace’ character, but only inside a character set declaration. |
\cd |
An ASCII escape sequence – the character whose code point is d % 32. |
\xhh |
A hexadecimal escape sequence – the character whose code point is 0xhh. |
\x{hhhh} |
A hexadecimal escape sequence – the character whose code point is 0xhhhh. |
\0ddd |
(\zero) An octal escape sequence – the character whose code point is 0ddd. |
\N{name} |
Matches the single character which has the symbolic name 'name’. (See below for a table of available symbolic names). |
\d |
Matches any digit character. |
\l |
Matches any lower case character. This escape sequence is not affected by configuring the expression to match case insensitively. |
\s |
Matches any whitespace character. |
\u |
Matches any upper case character. This escape sequence is not affected by configuring the expression to match case insensitively. |
\w |
Matches any alphanumeric character or underscore character. |
\D |
Matches any character that is not a digit. |
\L |
Matches any character that is not lower case. There is a distinction between this and matching any upper case character as some characters do not have case and would therefore match this escape sequence. |
\S |
Matches any character that is not whitespace. |
\U |
Matches any character that is not upper case. There is a distinction between this and matching any lower case character as some characters do not have case and would therefore match this escape sequence. |
\W |
Matches any character that is neither alphanumeric nor an underscore character. |
\px |
Equivalent to the single character, character class “[[:x:]]”. For example, “\pd” matches any digit character. |
\p{name} |
Equivalent to the character class “[[:name:]]”. For example, “\p{punct}” matches any punctuation character. |
\Px |
Equivalent to the negated single character, character class “[[:x:]]”. For example, “\Pd” matches any character that is not a digit. |
\P{name} |
Equivalent to the character class “[[:name:]]”. For example, “\p{punct}” matches any character that is not punctuation. |
\ |
Matches the null string at the beginning boundary of a word. Word in this context is in regular expression terms, not Lexical terms. See character classes for details. |
\ |
Matches the null string at the end boundary of a word. Word in this context is in regular expression terms, not Lexical terms. See character classes for details. |
\b |
Matches the null string at either the beginning or end boundary of a word. Word in this context is in regular expression terms, not Lexical terms. See character classes for details. |
\B |
Matches when not at a word boundary. Word in this context is in regular expression terms, not Lexical terms. See character classes for details. |
\` |
Matches the start of the text being searched. This is slightly different to the ‘^’ anchor which matches the beginning of a line. |
\’ |
Matches the end of the text being searched. This is slightly different to the ‘$’ anchor which matches the end of a line. |
\A |
Matches the start of the text being searched. This is identical in function to ‘\`’. |
\z |
Matches the end of the text being searched. This is identical in function to ‘\’’. |
\C |
Matches any single code point. This is identical in function to ‘.’. |
The escape sequences listed above are case-sensitive. All other escape sequences, other than escaped metacharacters, are undefined and may result in unexpected behavior; they should not be used. |
Name | Character |
---|---|
NUL |
\x00 |
SOH |
\x01 |
STX |
\x02 |
ETX |
\x03 |
EOT |
\x04 |
ENQ |
\x05 |
ACK |
\x06 |
alert |
\x07 |
backspace |
\x08 |
tab |
\t |
newline |
\n |
vertical-tab |
\v |
form-feed |
\f |
carriage-return |
\r |
SO |
\xE |
SI |
\xF |
DLE |
\x10 |
DC1 |
\x11 |
DC2 |
\x12 |
DC3 |
\x13 |
DC4 |
\x14 |
NAK |
\x15 |
SYN |
\x16 |
ETB |
\x17 |
CAN |
\x18 |
EM |
\x19 |
SUB |
\x1A |
ESC |
\x1B |
IS4 |
\x1C |
IS3 |
\x1D |
IS2 |
\x1E |
IS1 |
\x1F |
space |
\x20 |
exclamation-mark |
! |
quotation-mark |
" |
number-sign |
# |
dollar-sign |
$ |
percent-sign |
% |
ampersand |
& |
apostrophe |
' |
left-parenthesis |
( |
right-parenthesis |
) |
asterisk |
* |
plus-sign |
+ |
comma |
, |
hyphen |
- |
period |
. |
slash |
/ |
zero |
0 |
one |
1 |
two |
2 |
three |
3 |
four |
4 |
five |
5 |
six |
6 |
seven |
7 |
eight |
8 |
nine |
9 |
colon |
: |
semicolon |
: |
less-than-sign |
|
equals-sign |
= |
greater-than-sign |
|
question-mark |
? |
commercial-at |
@ |
left-square-bracket |
[ |
backslash |
\ |
right-square-bracket |
] |
circumflex |
~ |
underscore |
_ |
grave-accent |
` |
left-curly-bracket |
{ |
vertical-line |
| |
right-curly-bracket |
} |
tilde |
~ |
tilde DEL |
\x7F |
Symbolic names may only be used with a collating element. |
© 1995–2018 Clearswift Ltd.