Groups

Groups, as the name suggests, are meant to be used to “group” components of regular expressions. These groups can be used to:

Extract subsets of matches
Repeat groups an arbitrary number of times
Refer to previously matched substrings
Enhance readability
Allow complex alternations

We’ll see how to do a lot of this in later chapters, but learning how groups work will allow us to study some great examples in these later chapters.

Capturing groups

Capturing groups are denoted by ( … ). Here’s an expository example:

/a(bcd)e/g

1 matchabcde
1 matchabcdefg?
1 matchabcde

Capturing groups allow extracting parts of matches.

/\{([^{}]*)\}/g

1 match{braces}
2 matches{two} {pairs}
1 match{ {nested} }
1 match{ incomplete } }
1 match{}
0 matches{unmatched

Using your language’s regex functions, you would be able to extract the text between the matched braces for each of these strings.

Capturing groups can also be used to group regex parts for ease of repetition of said group. While we will cover repetition in detail in chapters that follow, here’s an example that demonstrates the utility of groups.

/a(bcd)+e/g

1 matchabcdefg
1 matchabcdbcde
1 matchabcdbcdbcdef
0 matchesae

Other times, they are used to group logically similar parts of the regex for readability.

/(\d\d\d\d)-W(\d\d)/g

1 match2020-W12
1 match1970-W01
1 match2050-W50-6
1 match12050-W50

Backreferences

Backreferences allow referring to previously captured substrings.

The match from the first group would be \1, that from the second would be \2, and so on…

/([abc])=\1=\1/g

1 matcha=a=a
1 matchab=b=b
0 matchesa=b=c

Backreferences cannot be used to reduce duplication in regexes. They refer to the match of groups, not the pattern.

/[abc][abc][abc]/g

1 matchabc
1 matcha cable
1 matchaaa
1 matchbbb
1 matchccc

/([abc])\1\1/g

0 matchesabc
0 matchesa cable
1 matchaaa
1 matchbbb
1 matchccc

Here’s an example that demonstrates a common use-case:

/\w+([,|])\w+\1\w+/g

1 matchcomma,separated,values
1 matchpipe|separated|values
0 matcheswb|mixed,delimiters
0 matcheswb,mixed|delimiters

This cannot be achieved with a repeated character classes.

/\w+[,|]\w+[,|]\w+/g

1 matchcomma,separated,values
1 matchpipe|separated|values
1 matchwb|mixed,delimiters
1 matchwb,mixed|delimiters

Non-capturing groups

Non-capturing groups are very similar to capturing groups, except that they don’t create “captures”. They take the form (?: … ).

Non-capturing groups are usually used in conjunction with capturing groups. Perhaps you are attempting to extract some parts of the matches using capturing groups. You may wish to use a group without messing up the order of the captures. This is where non-capturing groups come handy.

Examples

Query String Parameters

/^\?(\w+)=(\w+)(?:&(\w+)=(\w+))*$/g

0 matches
0 matches?
1 match?a=b
1 match?a=b&foo=bar

We match the first key-value pair separately because that allows us to use &, the separator, as part of the repeating group.

(Basic) HTML tags

As a rule of thumb, do not use regex to match XML/HTML.¹²³⁴

However, it’s a relevant example:

/<([a-z]+)+>(.*)<\/\1>/gi

1 matchparagraph
1 match<li>list item</li>
1 matchnesting
0 matcheshmm</li>
1 matchnot clever

Names

Find: \b(\w+) (\w+)\b

Replace: $2, $1⁵

Before

John Doe
Jane Doe
Sven Svensson
Janez Novak
Janez Kranjski
Tim Joe

After

Doe, John
Doe, Jane
Svensson, Sven
Novak, Janez
Kranjski, Janez
Joe, Tim

Backreferences and plurals

Find: \bword(s?)\b

Replace: phrase$1⁵

Before

This is a paragraph with some words.

Some instances of the word "word" are in their plural form: "words".

Yet, some are in their singular form: "word".

After

This is a paragraph with some phrases.

Some instances of the phrase "phrase" are in their plural form: "phrases".

Yet, some are in their singular form: "phrase".

https://stackoverflow.com/a/590789 ↩
https://stackoverflow.com/a/6751339 ↩
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/↩
https://web.archive.org/web/20071018202901/http://oubliette.alpha-geek.com/2004/01/12/bring_me_your_regexs_i_will_create_html_to_break_them ↩
In replacement contexts, $1, $2, … are usually used in place of \1, \2, … to refer to captured strings.↩