Learn Regex: A Beginner’s Guide

    Michael Wanyoike
    Share

    In this guide, you’ll learn regex, or regular expression syntax. By the end, you’ll be able to apply regex solutions in most scenarios that call for it in your web development work.

    What is Regex?

    Regex, or regular expressions, are special sequences used to find or match patterns in strings. These sequences use metacharacters and other syntax to represent sets, ranges, or specific characters. For example, the expression [0-9] matches the range of numbers between 0 and 9, and humor|humour matches both the strings “humor” and “humour”.

    Regular expressions have many real world uses cases, which include:

    • form input validation
    • web scraping
    • search and replace
    • filtering for information in massive text files such as logs

    They may look complicated and intimidating for new users. Take a look at this example:

    /^[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/
    

    It just look like garbled text. But don’t despair, there’s method behind this madness.

    In this guide I’ll show you how to master regular expressions but first, let’s clarify the terminology used in this guide:

    • pattern: regular expression pattern
    • string: test string used to match the pattern
    • digit: 0-9
    • letter: a-z, A-Z
    • symbol: !$%^&*()_+|~-=`{}[]:”;'<>?,./
    • space: single white space, tab
    • character: refers to a letter, digit or symbol

    xkcd.com/208

    Credit: xkcd

    Basics

    To learn regex quickly with this guide, visit Regex101, where you can build regex patterns and test them against strings (text) that you supply.

    When you open the site, you’ll need to select the JavaScript flavor, as that’s what we’ll be using for this guide. (Regex syntax is mostly the same for all languages, but there are some minor differences.)

    Next, you need to disable the global and multi line flags in Regex101. We’ll cover them in the next section. For now, we’ll look at the simplest form of regular expression we can build. Input the following:

    • regex input field: cat
    • test string: rat bat cat sat fat cats eat tat cat mat CAT

    basic regex

    Take note that regular expressions in JavaScript start and end with /. If you were to write a regular expression in JavaScript code, it would look like this: /cat/ without any quotation marks. In the above state, the regular expression matches the string “cat”. However, as you can see in the image above, there are several “cat” strings that are not matched. In the next section, we’ll look at why.

    Global and Case Insensitive Regex Flags

    By default, a regex pattern will only return the first match it finds. If you’d like to return additional matches, you need to enable the global flag, denoted as g. Regex patterns are also case sensitive by default. You can override this behavior by enabling the insensitive flag, denoted by i. The updated regex pattern is now fully expressed as /cat/gi. As you can see below, all “cat” strings have been matched including the one with a different case.

    global insensitive flag

    Character Sets

    In the previous example, we learned how to perform exact case-sensitive matches. What if we wanted to match “bat”, “cat”, and “fat”. We can do this by using character sets, denoted with < brackets — code>[]. Basically, you put in multiple characters that you want to get matched. For example, [bcf]at will match multiple strings as follows:

    character sets

    Character sets also work with digits.

    Ranges

    Let’s assume we want to match all words that end with at. We could supply the full alphabet inside the character set, but that would be tedious. The solution is to use ranges like this [a-z]at:

    ranges

    Here’s the full string that’s being tested: rat bat cat sat fat cats eat tat cat dog mat CAT.

    As you can see, all words are matching as expected. I’ve added the word dog just to throw in an invalid match. Here are other ways you can use ranges:

    • Partial range: selections such as [a-f] or [g-p].
    • Capitalized range: [A-Z].
    • Digit range: [0-9].
    • Symbol range: for example, [#$%&@].
    • Mixed range: for example, [a-zA-Z0-9] includes all digits, lower and upper case letters. Do note that a range only specifies multiple alternatives for a single character in a pattern.To further understand how to define a range, it’s best to look at the full ASCII table in order to see how characters are ordered.

    ascii table

    Repeating Characters

    Let’s say you’d like to match all three-letter words. You’d probably do it like this:

    [a-z][a-z][a-z]
    

    This would match all three-letter words. But what if you want to match a five- or eight-character word. The above method is tedious. There’s a better way to express such a pattern using the {} curly braces notation. All you have to do is specify the number of repeating characters. Here are examples:

    • a{5} will match “aaaaa”.
    • n{3} will match “nnn”.
    • [a-z]{4} will match any four-letter word such as “door”, “room” or “book”.
    • [a-z]{6,} will match any word with six or more letters.
    • [a-z]{8,11} will match any word between eight and 11 letters. Basic password validation can be done this way.
    • [0-9]{11} will match an 11-digit number. Basic international phone validation can be done this way.

    Metacharacters

    Metacharacters allow you to write regular expression patterns that are even more compact. Let’s go through them one by one:

    • \d matches any digit that is the same as [0-9]
    • \w matches any letter, digit and underscore character
    • \s matches a whitespace character — that is, a space or tab
    • \t matches a tab character only

    From what we’ve learned so far, we can write regular expressions like this:

    • \w{5} matches any five-letter word or a five-digit number
    • \d{11} matches an 11-digit number such as a phone number

    Special Characters

    Special characters take us a step further into writing more advanced pattern expressions:

    • +: One or more quantifiers (preceding character must exist and can be optionally duplicated). For example, the expression c+at will match “cat”, “ccat” and “ccccccccat”. You can repeat the preceding character as many times as you like and you’ll still get a match.
    • ?: Zero or one quantifier (preceding character is optional). For example, the expression c?at will only match “cat” or “at”.
    • *: Zero or more quantifier (preceding character is optional and can be optionally duplicated). For example, the expression c*at will match “at”, “cat” and “ccccccat”. It’s like the combination of + and ?.
    • \: this “escape character” is used when we want to use a special character literally. For example, c\* will exactly match “c*” and not “ccccccc”.
    • [^]: this “negate” notation is used to indicate a character that should not be matched within a range. For example, the expression b[^a-c]ld will not match “bald” or “bbld” because the second letters a to c are negative. However, the pattern will match “beld”, “bild”, “bold” and so forth.
    • .: this “do” notation will match any digit, letter or symbol except newline. For example, .{8} will match a an eight-character password consisting of letters, numbers and symbols. for example, “password” and “P@ssw0rd” will both match.

    From what we’ve learned so far, we can create an interesting variety of compact but powerful regular expressions. For example:

    • .+ matches one or an unlimited number of characters. For example, “c” , “cc” and “bcd#.670” will all match.
    • [a-z]+ will match all lowercase letter words irrespective of length, as long as they contain at least one letter. For example, “book” and “boardroom” will both match.

    Groups

    All the special characters we just mentioned only affect a single character or a range set. What if we wanted the effect to apply to a section of the expression? We can do this by creating groups using round brackets — (). For example, the pattern book(.com)? will match both “book” and “book.com”, since we’ve made the “.com” part optional.

    Here’s a more complex example that would be used in a realistic scenario such as email validation:

    • pattern: @\w+\.\w{2,3}(\.\w{2,3})?
    • test string: abc.com abc@mail @mail.com @mail.co.ke

    partial email validation

    Alternate Characters

    In regex, we can specify alternate characters using the “pipe” symbol — |. This is different from the special characters we showed earlier as it affects all the characters on each side of the pipe symbol. For example, the pattern sat|sit will match both “sat” and “sit” strings. We can rewrite the pattern as s(a|i)t to match the same strings.

    The above pattern can be expressed as s(a|i)t by using () parentheses.

    Starting and Ending Patterns

    You may have noticed that some positive matches are a result of partial matching. For example, if I wrote a pattern to match the string “boo”, the string “book” will get a positive match as well, despite not being an exact match. To remedy this, we’ll use the following notations:

    • ^: placed at the start, this character matches a pattern at the start of a string.
    • $: placed at the end, this character matches a pattern at the end of the string.

    To fix the above situation, we can write our pattern as boo$. This will ensure that the last three characters match the pattern. However, there’s one problem we haven’t considered yet, as the following image shows:

    start-end notations

    The string “sboo” gets a match because it still fulfills the current pattern matching requirements. To fix this, we can update the pattern as follows: ^boo$. This will strictly match the word “boo”. If you use both of them, both rules are enforced. For example, ^[a-z]{5}$ strictly matches a five-letter word. If the string has more than five letters, the pattern doesn’t match.

    Regex in JavaScript

    // Example 1
    const regex1=/a-z/ig
    
    //Example 2
    const regex2= new RegExp(/[a-z]/, 'ig')
    

    If you have Node.js installed on your machine, open a terminal and execute the command node to launch the Node.js shell interpreter. Next, execute as follows:

    Node regex interactive

    Feel free to play with more regex patterns. When done, use the command .exit to quit the shell.

    Real World Example: Email Validation with regex

    As we conclude this guide, let’s look at a popular usage of regex, email validation. For example, we might want to check that an email address a user has entered into a form is a valid email address.

    This subject is more complicated than you might think. The email address syntax is quite simple: {name}@{domain}. In theory, an email address can contain a limited number of symbols such as #-@&%. etc. However, the placement of these symbols matters. Mail servers also have different rules on the use of symbols. For example, some servers treat the + symbol as invalid. In other mail servers, the symbol is used for email subaddressing.

    As a challenge to test your knowledge, try to build a regular expression pattern that matches only the valid email addresses marked below:

    # invalid email
    abc
    abc.com
    
    # valid email address
    abc@mail.com
    abc@mail.nz
    abc@mail.co.nz
    abc123@mail.com
    abc.def@music.com
    
    # invalid email prefix
    abc-@mail.com
    abc..def@mail.com
    .abc@mail.com
    abc#def@mail.com
    
    # valid email prefix
    abc-d@mail.com
    abc.def@mail.com
    abc@mail.com
    abc_def@mail.com
    
    # invalid domain suffix
    abc.def@mail.c
    abc.def@mail#archive.com
    abc.def@mail
    abc.def@mail..com
    
    # valid domain suffix
    abc.def@mail.cc
    abc.def@mail-archive.com
    abc.def@mail.org
    abc.def@mail.com
    fully-qualified-domain@example.com
    

    Do note some email addresses marked as valid may be invalid for certain organizations, while some that are marked as invalid may actually be allowed in other organizations. Either way, learning to build custom regular expressions for the organizations you work for is paramount in order to cater for their needs. In case you get stuck, you can look at the following possible solutions. Do note that none of them will give you a 100% match on the above valid email test strings.

    • Possible Solution 1:
    ^\w*(\-\w)?(\.\w*)?@\w*(-\w*)?\.\w{2,3}(\.\w{2,3})?$
    
    • Possible Solution 2:
    ^(([^<>()\[\]\\.,;:\s@"]+(\.[^<>()\[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$
    

    Summary

    I hope you’ve now learned the basics of regular expressions. We haven’t covered all regex features in this quick beginner guide, but you should have enough information to tackle most problems that call for a regex solution. To learn more, read our guide on best practices for the practical application of regex in real-world scenarios.