User Defined Entities
You can create user-defined text entities, allowing you to re-use commonly encountered words or phrases. You can also apply PERL/POSIX expressions to configure user-defined text entities. This topic can be used as an introduction and reference guide for constructing regular expressions.
Regular expressions can have a significant effect on system performance if used excessively. It is recommended that you avoid the use of regular expressions where possible. |
Tell me about...
-
Terminology used in this topic
Atom
A single character or an expression surrounded by ‘()’, or ‘[]’. These can be wholly affected by repeating metacharacters such as ‘*’ and by bounds ‘{}’.
Expression
A simple keyword, an ordered list of words within a phrase or a regular expression.
Normal
Alphanumeric characters (A-Z, a-z, 0-9) (used in user-defined lexical expressions).
Punctuation
Any characters other than those classified as normal or whitespace. This class includes the hyphen and underscore characters. (used in user defined lexical expressions).
Whitespace Whitespace character. Avoid building custom expressions which begin or end with a whitespace. Word
A string of alphanumeric characters delimited by whitespace or punctuation (or the beginning or end of the string). This leads to words in plain-text expressions or in user-defined expressions containing punctuation being split into multiple ‘Lex words’.
The regular expression definition for a ‘word’ character is slightly different. See Character classes for details.
The following characters can be used to configure regular expressions:
-
PERL/POSIX Regular Expression Syntax
.
Any single character.
^
Anchor character, matches the start of a line.
$
Anchor character, matches the end of a line.
()
Sub-expression. Separates the expression within the brackets allowing it to be subjected to the repeating metacharacters or referenced with the back reference function.
See below for details of these functions.
|
Or operator. Matches either the expression preceding or succeeding the operator.
For example, “a|b” matches either “a” or “b”.
*
Zero or more occurrences of the preceding atom.
For example, “ab*c” matches “ac”, “abc”, “abbbbbbbbc” etc... “(he)*” matches “he”, “hehehehehehe” etc...
+
One or more occurrences of the preceding atom.
?
Zero or one occurrences of the preceding atom.
{x}
Bounded repeat. Matches exactly ‘x’ occurrences of the preceding atom.
{x,y}
Matches between ‘x’ and ‘y’ (inclusive) occurrences of the preceding atom.
{x,}
Matches ‘x’ or more (inclusive) occurrences of the preceding atom.
x?
Non greedy repeat of repeating metacharacter ‘x’.
The repeating functions above attempt to match as much as possible (they are greedy). Following any of the repeating metacharacters with a “?” causes the repeat to be non greedy. That is, the match is as short as possible.
For example, in the string “It went on and on and on.", “went.{2,}on” matches “went on and on and on” whereas “went.{2,}?on” only matches “went on and on”.
\n
Back reference. Matches the string that was matched by sub-expression ‘n’. Where 'n' is a number from 1 to 9.
For example, “(.+)-\1” matches “abc-abc” and “1234-1234” but not “abc-1234”.
\x
Where ‘x’ is a metacharacter this syntax indicates that ‘x’ is to be treated literally and not as a metacharacter. For example, ‘\$’ matches ‘$’. ‘\\’ matches ‘\’.
Where ‘x’ is a predefined escape sequence character (or sequence of characters) match the character or character class defined by that escape sequence.
For more information about escape sequences, see the table below.
[]
Character set. Matches any one character from the list. For example, “[abc]” matches either “a” or “b” or “c”.
The whole character set may be subjected to the repeating metacharacters. For example, ‘[abc]*’ matches ‘aabcac’ but not ‘abcdcba’.
[^]
Negated character set. Matches any one character which is not in the character set.
For example, “[^bc]” matches “a” and “d” but not “b” or “c”.
[x-y]
Character range. Matches one character in the range ‘x’ to ‘y’. For example, “[a-c]” matches “a”, “b” or “c”.
The range endpoints must be in the correct order. That is, the first endpoint must precede the second endpoint in the Unicode codepoint sequence. To include a literal ‘-’ character in a character set enter it as the first or last character.
[:x:]
Character class. A predefined set of characters. Character classes may only be used within a character set. For example, “[[:digit:]]” matches any numeric character.
See below for a table of available character classes.
(?#comment)
Comment. Text between the ‘#’ and the closing ‘)’ are ignored. This can be used to explain how the expression works for future reference. For example, “(?#3 letters)[[:alpha:]]{3}(?#followed by 5 digits)[[:digit:]]{5}”.
(?=pattern)
Positive lookahead. Returns a match if ‘pattern’ matches. The current point of reference is not moved so subsequent expressions match from the same point. This can be used to logically ‘and’ two or more regular expressions. For example, “(?=.*[[:lower:]])(?=.*[[:upper:]])” confirms that there are upper and lower case characters in the string.
(?!pattern)
Negative lookahead. Returns a match if ‘pattern’ does not match. The current point of reference is not moved so subsequent expressions match from the same point.
(?<=pattern)
Positive lookbehind. Returns a match if ‘pattern’ matches immediately before the current point of reference. The current point of reference is not moved so subsequent expressions match from the same point.
(?<!pattern)
Negative lookbehind. Returns a match if ‘pattern’ does not match immediately before the current point of reference. The current point of reference is not moved so subsequent expressions match from the same point.
(?>pattern)
Independent sub-expression. A sub expression that does not allow backtracking into ‘pattern’ in order to try and satisfy the larger expression. This function can be used to achieve significant performance improvements. For example, “([ab]+)[bc]+” matches “abb”. But “(?
>
[ab]+)[bc]+” does not match “abb”. It does however match “abc”.(?(condition)true|false)
Conditional expression. If ‘condition’ is true, attempts to match the ‘true’ pattern. If ‘condition’ is false, attempts to match the ‘false’ pattern. The condition may be either a lookahead or the index of a marked sub-expression.
(?(condition)true)
Conditional expression. If ‘condition’ is true, attempts to match the ‘true’ pattern. If ‘condition’ is false the expression returns no match. The condition may be either a lookahead or the index of a marked sub-expression.
-
Character Classes
[:alnum:]
All alphanumeric characters.
Note: This is not restricted to the Latin alphabetic characters.
[:alpha:]
All alphabetic characters.
Note: This is not restricted to the Latin alphabetic characters.
[:blank:]
All whitespace characters apart from line separator characters.
[:cntrl:]
All control characters.
[:d:]
[:digit:]
All decimal digit characters.
[:graph:]
All graphical characters.
[:l:]
[:lower:]
All lower case characters. This character class is not affected by configuring the expression to match case insensitively.
[:print:]
All printable characters.
[:punct:]
All punctuation characters.
[:s:]
[:space:]
All whitespace characters.
[:unicode:]
All extended characters with a code point of greater than 255.
[:u:]
[:upper:]
All upper case characters. This character class is not affected by configuring the expression to match case insensitively.
[:w:]
[:word:]
All alphanumeric characters and the underscore character.
[:xdigit:]
All hexadecimal digit characters.
The above may only be used within a character set. -
Escape Sequences
\a ’
‘bell’ character.
\e
‘escape’ character.
\f
‘form feed’ character.
\n
‘newline’ character.
\r
‘carriage return’ character.
\t
‘tab’ character.
\v
‘vertical tab’ character.
\b
‘backspace’ character, but only inside a character set declaration.
\cd
An ASCII escape sequence – the character whose code point is d % 32.
\xhh
A hexadecimal escape sequence – the character whose code point is 0xhh.
\x{hhhh}
A hexadecimal escape sequence – the character whose code point is 0xhhhh.
\0ddd
(\zero) An octal escape sequence – the character whose code point is 0ddd.
\N{name}
Matches the single character which has the symbolic name 'name’. (See below for a table of available symbolic names).
\d
Matches any digit character.
\l
Matches any lower case character. This escape sequence is not affected by configuring the expression to match case insensitively.
\s
Matches any whitespace character.
\u
Matches any upper case character. This escape sequence is not affected by configuring the expression to match case insensitively.
\w
Matches any alphanumeric character or underscore character.
\D
Matches any character that is not a digit.
\L
Matches any character that is not lower case. There is a distinction between this and matching any upper case character as some characters do not have case and would therefore match this escape sequence.
\S
Matches any character that is not whitespace.
\U
Matches any character that is not upper case. There is a distinction between this and matching any lower case character as some characters do not have case and would therefore match this escape sequence.
\W
Matches any character that is neither alphanumeric nor an underscore character.
\px
Equivalent to the single character, character class “[[:x:]]”. For example, “\pd” matches any digit character.
\p{name}
Equivalent to the character class “[[:name:]]”. For example, “\p{punct}” matches any punctuation character.
\Px
Equivalent to the negated single character, character class “[[:x:]]”. For example, “\Pd” matches any character that is not a digit.
\P{name}
Equivalent to the character class “[[:name:]]”. For example, “\p{punct}” matches any character that is not punctuation.
\
<
Matches the null string at the beginning boundary of a word. Word in this context is in regular expression terms, not Lexical terms. See character classes for details.
\
>
Matches the null string at the end boundary of a word. Word in this context is in regular expression terms, not Lexical terms.
See character classes for details.
\b
Matches the null string at either the beginning or end boundary of a word. Word in this context is in regular expression terms, not Lexical terms. See character classes for details.
\B
Matches when not at a word boundary. Word in this context is in regular expression terms, not Lexical terms. See character classes for details.
\`
Matches the start of the text being searched. This is slightly different to the ‘^’ anchor which matches the beginning of a line.
\’
Matches the end of the text being searched. This is slightly different to the ‘$’ anchor which matches the end of a line.
\A
Matches the start of the text being searched. This is identical in function to ‘\`’.
\z
Matches the end of the text being searched. This is identical in function to ‘\’’.
\C
Matches any single code point. This is identical in function to ‘.’.
The escape sequences listed above are case-sensitive.
All other escape sequences, other than escaped metacharacters, are undefined and may result in unexpected behavior; they should not be used.
-
Symbolic Names
Name Character NUL
\x00
SOH
\x01
STX
\x02
ETX
\x03
EOT
\x04
ENQ
\x05
ACK
\x06
alert
\x07
backspace
\x08
tab
\t
newline
\n
vertical-tab
\v
form-feed
\f
carriage-return
\r
SO
\xE
SI
\xF
DLE
\x10
DC1
\x11
DC2
\x12
DC3
\x13
DC4
\x14
NAK
\x15
SYN
\x16
ETB
\x17
CAN
\x18
EM
\x19
SUB
\x1A
ESC
\x1B
IS4
\x1C
IS3
\x1D
IS2
\x1E
IS1
\x1F
space
\x20
exclamation-mark
!
quotation-mark
"
number-sign
#
dollar-sign
$
percent-sign
%
ampersand
&
apostrophe
'
left-parenthesis
(
right-parenthesis
)
asterisk
*
plus-sign
+
comma
,
hyphen
-
period
.
slash
/
zero
0
one
1
two
2
three
3
four
4
five
5
six
6
seven
7
eight
8
nine
9
colon
:
semicolon
:
less-than-sign
<
equals-sign
=
greater-than-sign
>
question-mark
?
commercial-at
@
left-square-bracket
[
backslash
\
right-square-bracket
]
circumflex
~
underscore
_
grave-accent
`
left-curly-bracket
{
vertical-line
|
right-curly-bracket
}
tilde
~
tilde DEL
\x7F
Symbolic names may only be used with a collating element.