Java Regex - Java Regular Expressions

Jakob Jenkov
Last update: 2019-03-05

Java regex is the official Java regular expression API. The term Java regex is an abbreviation of Java regular expression. The Java regex API is located in the java.util.regex package which has been part of standard Java (JSE) since Java 1.4. This Java regex tutorial will explain how to use this API to match regular expressions against text.

Although Java regex has been part of standard Java since Java 1.4, this Java regex tutorial covers the Java regex API released with Java 8.

Regular Expressions

A regular expression is a textual pattern used to search in text. You do so by "matching" the regular expression against the text. The result of matching a regular expression against a text is either:

  • A true / false specifying if the regular expression matched the text.
  • A set of matches - one match for every occurrence of the regular expression found in the text.

For instance, you could use a regular expression to search an Java String for email addresses, URLs, telephone numbers, dates etc. This would be done by matching different regular expressions against the String. The result of matching each regular expression against the String would be a set of matches - one set of matches for each regular expression (each regular expression may match more than one time).

I will show you some examples of how to match regular expressions against text with the Java regex API further down this page. But first I will introduce the core classes of the Java regex API in the following section.

Java Regex Core Classes

The Java regex API consists of two core classes. These are:

The Pattern class is used to create patterns (regular expressions). A pattern is precompiled regular expression in object form (as a Pattern instance), capable of matching itself against a text.

The Matcher class is used to match a given regular expression (Pattern instance) against a text multiple times. In other words, to look for multiple occurrences of the regular expression in the text. The Matcher will tell you where in the text (character index) it found the occurrences. You can obtain a Matcher instance from a Pattern instance.

Both the Pattern and Matcher classes are covered in detail in their own texts. See links above, or in the top left of every page in this Java regex tutorial trail.

Java Regular Expression Example

As mentioned above the Java regex API can either tell you if a regular expression matches a certain String, or return all the matches of that regular expression in the String. The following sections will show you examples of both of these ways to use the Java regex API.

Pattern Example

Here is a simple java regex example that uses a regular expression to check if a text contains the substring http:// :

String text    =
        "This is the text to be searched " +
        "for occurrences of the http:// pattern.";

String regex = ".*http://.*";

boolean matches = Pattern.matches(regex, text);

System.out.println("matches = " + matches);

The text variable contains the text to be checked with the regular expression.

The pattern variable contains the regular expression as a String. The regular expression matches all texts which contains one or more characters (.*) followed by the text http:// followed by one or more characters (.*).

The third line uses the Pattern.matches() static method to check if the regular expression (pattern) matches the text. If the regular expression matches the text, then Pattern.matches() returns true. If the regular expression does not match the text Pattern.matches() returns false.

The example does not actually check if the found http:// string is part of a valid URL, with domain name and suffix (.com, .net etc.). The regular expression just checks for an occurrence of the string http://.

Matcher Example

Here is another Java regex example which uses the Matcher class to locate multiple occurrences of the substring "is" inside a text:

String text    =
        "This is the text which is to be searched " +
        "for occurrences of the word 'is'.";

String regex = "is";

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);

int count = 0;
while(matcher.find()) {
    count++;
    System.out.println("found: " + count + " : "
            + matcher.start() + " - " + matcher.end());
}

From the Pattern instance a Matcher instance is obtained. Via this Matcher instance the example finds all occurrences of the regular expression in the text.

Java Regular Expression Syntax

A key aspect of regular expressions is the regular expression syntax. Java is not the only programming language that has support for regular expressions. Most modern programming languages supports regular expressions. The syntax used in each language define regular expressions is not exactly the same, though. Therefore you will need to learn the syntax used by your programming language.

In the following sections of this Java regex tutorial I will give you examples of the Java regular expression syntax, to get you started with the Java regex API and regular expressions in general. The regular expression syntax used by the Java regex API is covered in detail in the text about the Java regular expression syntax

Matching Characters

The first thing to look at is how to write a regular expression that matches characters against a given text. For instance, the regular expression defined here:

String regex = "http://";

will match all strings that are exactly the same as the regular expression. There can be no characters before or after the http:// - or the regular expression will not match the text. For instance, the above regex will match this text:

String text1 = "http://";

But not this text:

String text2 = "The URL is: http://mydomain.com";

The second string contains characters both before and after the http:// that is matched against.

Metacharacters

Metacharacters are characters in a regular expression that are interpreted to have special meanings. These metacharacters are:

CharacterDescription
<
>
(
)
[
]
{
}
\
^
-
=
$
!
|
?
*
+
.

What exactly these metacharacters mean will be explained further down this Java Regex tutorial. Just keep in mind that if you include e.g. a "." (fullstop) in a regular expression it will not match a fullstop character, but match something else which is defined by that metacharacter (also explained later).

Escaping Characters

As mentioned above, metacharacters in Java regular expressions have a special meaning. If you really want to match these characters in their literal form, and not their metacharacter meaning, you must "escape" the metacharacer you want to match. To escape a metacharacter you use the Java regular expression escape character - the backslash character. Escaping a character means preceding it with the backslash character. For instance, like this:

\.

In this example the . character is preceded (escaped) by the \ character. When escaped the fullstop character will actually match a fullstop character in the input text. The special metacharacter meaning of an escaped metacharacter is ignored - only its actual literal value (e.g. a fullstop) is used.

Java regular expression syntax uses the backslash character as escape character, just like Java Strings do. This gives a little challenge when writing a regular expression in a Java string. Look at this regular expression example:

String regex = "\\.";

Notice that the regular expression String contains two backslashes after each other, and then a . . The reason is, that first the Java compiler interprets the two \\ characters as an escaped Java String character. After the Java compiler is done, only one \ is left, as \\ means the character \. The string thus looks like this:

\.

Now the Java regular expression interpreter kicks in, and interprets the remaining backslash as an escape character. The following character . is now interpreted to mean an actual full stop, not to have the special regular expression meaning it otherwise has. The remaining regular expression thus matches for the full stop character and nothing more.

Several characters have a special meaning in the Java regular expression syntax. If you want to match for that explicit character and not use it with its special meaning, you need to escape it with the backslash character first. For instance, to match for the full stop character, you need to write:

String regex = "\\.";

To match for the backslash character itself, you need to write:

String regex = "\\\\";

Getting the escaping of characters right in regular expressions can be tricky. For advanced regular expressions you might have to play around with it a while before you get it right.

Matching Any Character

So far we have only seen how to match specific characters like "h", "t", "p" etc. However, you can also just match any character without regard to what character it is. The Java regular expression syntax lets you do that using the . character (period / full stop). Here is an example regular expression that matches any character:

String regex = ".";

This regular expression matches a single character, no matter what character it is.

The . character can be combined with other characters to create more advanced regular expressions. Here is an example:

String regex = "H.llo";

This regular expression will match any Java string that contains the characters "H" followed by any character, followed by the characters "llo". Thus, this regular expression will match all of the strings "Hello", "Hallo", "Hullo", "Hxllo" etc.

Matching Any of a Set of Characters

Java regular expressions support matching any of a specified set of characters using what is referred to as character classes. Here is a character class example:

String regex = "H[ae]llo";

The character class (set of characters to match) is enclosed in the square brackets - the [ae] part of the regular expression, in other words. The square brackets are not matched - only the characters inside them.

The character class will match one of the enclosed characters regardless of which, but no mor than one. Thus, the regular expression above will match any of the two strings "Hallo" or "Hello", but no other strings. Only an "a" or an "e" is allowed between the "H" and the "llo".

You can match a range of characters by specifying the first and the last character in the range with a dash in between. For instance, the character class [a-z] will match all characters between a lowercase a and a lowercase z, both a and z included.

You can have more than one character range within a character class. For instance, the character class [a-zA-Z] will match all letters between a and z or between A and Z .

You can also use ranges for digits. For instance, the character class [0-9] will match the characters between 0 and 9, both included.

If you want to actually match one of the square brackets in a text, you will need to escape them. Here is how escaping the square brackets look:

String regex = "H\\[llo";

The \\[ is the escaped square left bracket. This regular expression will match the string "H[llo".

If you want to match the square brackets inside a character class, here is how that looks:

String regex = "H[\\[\\]]llo";

The character class is this part: [\\[\\]]. The character class contains the two square brackets escaped (\\[ and \\]).

This regular expression will match the strings "H[llo" and "H]llo".

Matching a Range of Characters

The Java regex API allows you to specify a range of characters to match. Specifying a range of characters is easier than explicitly specifying each character to match. For instance, you can match the characters a to z like this:

String regex = "[a-z]";

This regular expression will match any single character from a to z in the alphabet.

The character classes are case sensitive. To match all characters from a to z regardless of case, you must include both uppercase and lowercase character ranges. Here is how that looks:

String regex = "[a-zA-Z]";

Matching Digits

You can match digits of a number with the predefined character class with the code \d. The digit character class corresponds to the character class [0-9].

Since the \ character is also an escape character in Java, you need two backslashes in the Java string to get a \d in the regular expression. Here is how such a regular expression string looks:

String regex = "Hi\\d";

This regular expression will match strings starting with "Hi" followed by a digit (0 to 9). Thus, it will match the string "Hi5" but not the string "Hip".

Matching Non-digits

Matching non-digits can be done with the predefined character class [\D] (uppercase D). Here is an regular expression containing the non-digit character class:

String regex = "Hi\\D";

This regular expression will match any string which starts with "Hi" followed by one character which is not a digit.

Matching Word Characters

You can match word characters with the predefined character class with the code \w . The word character class corresponds to the character class [a-zA-Z_0-9].

String regex = "Hi\\w";

This regular expression will match any string that starts with "Hi" followed by a single word character.

Matching Non-word Characters

You can match non-word characters with the predefined character class [\W] (uppercase W). Since the \ character is also an escape character in Java, you need two backslashes in the Java string to get a \w in the regular expression. Here is how such a regular expression string looks:

Here is a regular expression example using the non-word character class:

String regex = "Hi\\W";

Boundaries

The Java Regex API can also match boundaries in a string. A boundary could be the beginning of a string, the end of a string, the beginning of a word etc. The Java Regex API supports the following boundaries:

The end of the input
SymbolDescription
^The beginning of a line.
$The end of a line.
\bA word boundary (where a word starts or ends, e.g. space, tab etc.).
\BA non-word boundary.
\AThe beginning of the input.
\GThe end of the previous match.
\ZThe end of the input but for the final terminator (if any).
\z

Some of these boundary matchers are explained below.

Beginning of Line (or String)

The ^ boundary matcher matches the beginning of a line according to the Java API specification. However, in practice it seems to only be matching the beginning of a String. For instance, the following example only gets a single match at index 0:

String text = "Line 1\nLine2\nLine3";

Pattern pattern = Pattern.compile("^");
Matcher matcher = pattern.matcher(text);

while(matcher.find()){
    System.out.println("Found match at: "  + matcher.start() + " to " + matcher.end());
}

Even if the input string contains several line breaks, the ^ character only matches the beginning of the input string, not the beginning of each line (after each line break).

The beginning of line / string matcher is often used in combination with other characters, to check if a string begins with a certain substring. For instance, this example checks if the input string starts with the substring http:// :

String text = "http://jenkov.com";

Pattern pattern = Pattern.compile("^http://");
Matcher matcher = pattern.matcher(text);

while(matcher.find()){
    System.out.println("Found match at: "  + matcher.start() + " to " + matcher.end());
}

This example finds a single match of the substring http:// from index 0 to index 7 in the input stream. Even if the input string had contained more instances of the substring http:// they would not have been matched by this regular expression, since the regular expression started with the ^ character.

End of Line (or String)

The $ boundary matcher matches the end of the line according to the Java specification. In practice, however, it looks like it only matches the end of the input string.

The beginning of line (or string) matcher is often used in combination with other characters, most commonly to check if a string ends with a certain substring. Here is an example of the end of line / string matcher:

String text = "http://jenkov.com";

Pattern pattern = Pattern.compile(".com$");
Matcher matcher = pattern.matcher(text);

while(matcher.find()){
    System.out.println("Found match at: "  + matcher.start() + " to " + matcher.end());
}

This example will find a single match at the end of the input string.

Word Boundaries

The \b boundary matcher matches a word boundary, meaning a location in an input string where a word either starts or ends.

Here is a Java regex word boundary example:

String text = "Mary had a little lamb";

Pattern pattern = Pattern.compile("\\b");
Matcher matcher = pattern.matcher(text);

while(matcher.find()){
    System.out.println("Found match at: "  + matcher.start() + " to " + matcher.end());
}

This example matches all word boundaries found in the input string. Notice how the word boundary matcher is written as \\b - with two \\ (backslash) characters. The reason for this is explained in the section about escaping characters. The Java compiler uses \ as an escape character, and thus requires two backslash characters after each other in order to insert a single backslash character into the string.

The output of running this example would be:

Found match at: 0 to 0
Found match at: 4 to 4
Found match at: 5 to 5
Found match at: 8 to 8
Found match at: 9 to 9
Found match at: 10 to 10
Found match at: 11 to 11
Found match at: 17 to 17
Found match at: 18 to 18
Found match at: 22 to 22

The output lists all the locations where a word either starts or ends in the input string. As you can see, the indices of word beginnings point to the first character of the word, whereas endings of a word points to the first character after the word.

You can combine the word boundary matcher with other characters to search for words beginning with specific characters. Here is an example:

String text = "Mary had a little lamb";

Pattern pattern = Pattern.compile("\\bl");
Matcher matcher = pattern.matcher(text);

while(matcher.find()){
    System.out.println("Found match at: "  + matcher.start() + " to " + matcher.end());
}

This example will find all the locations where a word starts with the letter l (lowercase). In fact it will also find the ends of these matches, meaning the last character of the pattern, which is the lowercase l letter.

Non-word Boundaries

The \B boundary matcher matches non-word boundaries. A non-word boundary is a boundary between two characters which are both part of the same word. In other words, the character combination is not word-to-non-word character sequence (which is a word boundary). Here is a simple Java regex non-word boundary matcher example:

String text = "Mary had a little lamb";

Pattern pattern = Pattern.compile("\\B");
Matcher matcher = pattern.matcher(text);

while(matcher.find()){
    System.out.println("Found match at: "  + matcher.start() + " to " + matcher.end());
}

This example will give the following output:

Found match at: 1 to 1
Found match at: 2 to 2
Found match at: 3 to 3
Found match at: 6 to 6
Found match at: 7 to 7
Found match at: 12 to 12
Found match at: 13 to 13
Found match at: 14 to 14
Found match at: 15 to 15
Found match at: 16 to 16
Found match at: 19 to 19
Found match at: 20 to 20
Found match at: 21 to 21

Notice how these match indexes corresponds to boundaries between characters within the same word.

Quantifiers

Quantifiers can be used to match characters more than once. There are several types of quantifiers which are listed in the Java Regex Syntax. I will introduce some of the most commonly used quantifiers here.

The first two quantifiers are the * and + characters. You put one of these characters after the character you want to match multiple times. Here is a regular expression with a quantifier:

String regex = "Hello*";

This regular expression matches strings with the text "Hell" followed by zero or more o characters. Thus, the regular expression will match "Hell", "Hello", "Helloo" etc.

If the quantifier had been the + character instead of the * character, the string would have had to end with 1 or more o characters.

If you want to match any of the two quantifier characters you will need to escape them. Here is an example of escaping the + quantifier:

String regex = "Hell\\+";

This regular expression will match the string "Hell+";

You can also match an exact number of a specific character using the {n} quantifier, where n is the number of characters you want to match. Here is an example:

String regex = "Hello{2}";

This regular expression will match the string "Helloo" (with two o characters in the end).

You can set an upper and a lower bound on the number of characters you want to match, like this:

String regex = "Hello{2,4}";

This regular expression will match the strings "Helloo", "Hellooo" and "Helloooo". In other words, the string "Hell" with 2, 3 or 4 o characters in the end.

Logical Operators

The Java Regex API supports a set of logical operators which can be used to combine multiple subpatterns within a single regular expression. The Java Regex API supports two logical operators: The and operator and the or operator.

The and operator is implicit. If two characters (or other subpatterns) follow each other in a regular expression, that means that both the first and the second subpattern much match the target string. Here is an example of a regular expression that uses an implicit and operator:

String text = "Cindarella and Sleeping Beauty sat in a tree";

Pattern pattern = Pattern.compile("[Cc][Ii].*");
Matcher matcher = pattern.matcher(text);

System.out.println("matcher.matches() = " + matcher.matches());

Notice the 3 subpatterns [Cc], [Ii] and .*

Since there are no characters between these subpatterns in the regular expression, there is implicitly an and operator in between them. This means, that the target string must match all 3 subpatterns in the given order to match the regular expression as a whole. As you can see from the string, the expression matches the string. The string should start with either an uppercase or lowercase C, followed by an uppercase or lowercase I and then zero or more characters. The string meets these criteria.

The or operator is explicit and is represented by the pipe character |. Here is an example of a regular expression that contains two subexpression with the logical or operator in between:

String text = "Cindarella and Sleeping Beauty sat in a tree";

Pattern pattern = Pattern.compile(".*Ariel.*|.*Sleeping Beauty.*");
Matcher matcher = pattern.matcher(text);

System.out.println("matcher.matches() = " + matcher.matches());

As you can see, the pattern will match either the subpattern Ariel or the subpattern Sleeping Beauty somewhere in the target string. Since the target string contains the text Sleeping Beauty, the regular expression matches the target string.

Java String Regex Methods

The Java String class has a few regular expression methods too. I will cover some of those here:

matches()

The Java String matches() method takes a regular expression as parameter, and returns true if the regular expression matches the string, and false if not.

Here is a matches() example:

String text = "one two three two one";

boolean matches = text.matches(".*two.*");

split()

The Java String split() method splits the string into N substrings and returns a String array with these substrings. The split() method takes a regular expression as parameter and splits the string at all positions in the string where the regular expression matches a part of the string. The regular expression is not returned as part of the returned substrings.

Here is a split() example:

String text = "one two three two one";

String[] twos = text.split("two");

This example will return the three strings "one", " three" and " one".

replaceFirst()

The Java String replaceFirst() method returns a new String with the first match of the regular expression passed as first parameter with the string value of the second parameter.

Here is a replaceFirst() example:

String text = "one two three two one";

String s = text.replaceFirst("two", "five");

This example will return the string "one five three two one".

replaceAll()

The Java String replaceAll() method returns a new String with all matches of the regular expression passed as first parameter with the string value of the second parameter.

Here is a replaceAll() example:

String text = "one two three two one";

String t = text.replaceAll("two", "five");

This example will return the string "one five three five one".

Jakob Jenkov

Featured Videos

Java ForkJoinPool

P2P Networks Introduction




















Advertisements

High-Performance
Java Persistence
Close TOC
All Tutorial Trails
All Trails
Table of contents (TOC) for this tutorial trail
Trail TOC
Table of contents (TOC) for this tutorial
Page TOC
Previous tutorial in this tutorial trail
Previous
Next tutorial in this tutorial trail
Next