Category: C and C++

  • Regular Expressions in C++

    In an earlier post I made, I discussed how regular expressions could be used. Now, I will show you how to implement them in your own C++ program.

    The regex Library

    The regular expression library was added in C++ 11, so you should have support for it by now. We can start with some basic boilerplate, importing our regular expression library (along with some other libraries that will make our lives easier) and the standard IO library.

    C++
    #include <iostream>
    #include <string>
    #include <vector>
    #include <regex>
    
    using namespace std;
    
    int main() {
      return 0;
    }

    Compiling a Regular Expression

    In C++, regular expressions must be compiled before they are used. When I say I am going to be passing a regex string or a regex to a function, I am actually going to be talking about this compiled expression. All regexes must be compiled before use.

    It is actually very easy to compile regexes, and below, we are compiling regex <html>.+</html> and assigning the compiled expression to a variable called re.

    Then, we will do the same thing I mentioned above but naming the expression reg. The reason I am doing this twice is because I want to show the two methods you can use, assigning by value or assigning using the regex class’s constructor.

    C++
    #include <iostream>
    #include <string>
    #include <vector>
    #include <regex>
    
    using namespace std;
    
    int main() {
      cout << "Compiling regex 1..." << endl;
      regex re = regex("<html>.+</html>");
      cout << "Compiled regex 1!" << endl;
      
      cout << "Compiling regex 2..." << endl;
      regex reg("<html>.+</html>");
      cout << "Compiled regex 2!" << endl;
    
      return 0;
    }

    Determining if a Regular Expression Matches an Entire String

    The regex_match function will determine whether an entire string is matched by a certain regex. For example, if we pass the regex hi to it and match it with the string hi, the function will return true, as the regular expression provided matches the entire target string of hi.

    However, if we kept the regex the same but changed the target string to shi, the function would return false because while shi contains hi, the regex hi does not match the entirety of shi.

    Let’s use an example. I have given one below.

    C++
    #include <iostream>
    #include <string>
    #include <vector>
    #include <regex>
    
    using namespace std;
    
    int main() {
      string reStr;
      cout<<"Enter a regular expression to use the regex_match function on:\n>";
      cin>>reStr;
      
      string target;
      cout<<"Enter a target string to use the regex_match function on:\n>";
      cin>>target;
      
      regex reCompiled = regex(reStr); // Compiling our regex
      
      // Actual matching process
      if (regex_match(target,reCompiled)) {
        cout<<"\nRegex Matched Entirely!\n";
        return 0;
      }
      else {
        cout<<"\nRegex Did Not Match Entirely!\n";
      }
    
      return -1;
    }

    A Quick Note: Capturing Groups

    Capturing groups in regexes are denoted by parenthesis and are often returned as lists. To make things simpler, let’s use the regex (sub)(mar)(ine). Here we can see that sub, mar, and ine each have their own capturing groups.

    Now if we were to use this on the text submarinesubmarine, the regex would match on both submarines separately, so we would get two matches.

    Let’s take a closer look at the matches.

    These matches would end up having three submatches each due to these capturing groups. If we were to visualize this in hierarchy, we would get the following:

    In C++, matching with capturing groups is represented as a list of matches containing lists of each capturing group for each match. For example, if we wanted to get match one, capturing group one, of a list of matches (you will learn about the smatches type in the next section), we would use the code below:

    C++
    string m1c1 = matches[0][0];

    A Quick Note: The smatches Type

    The smatches type is used for storing a list of strings as regex matches. It is sort of like a vector, but the shape is fixed to either vector<string> without capturing groups, or vector<vector<string>> with capturing groups.

    Determining if a Regular Expression Matches any Substrings

    Remember how above, I said that the regex_match function only tells you whether the entire string is matched by a regex? Well, if we want to include substrings, it can get a little more complicated (this is coming from someone with a Python background, where we are pampered with the re library).

    For this part of the guide, we will be using the regex_search function, which will tell you if

    The regex_search function typically takes three to four arguments. Let’s look at the first method of calling it.

    For this method, the function takes three parameters and outputs a Boolean. The parameters are below.

    • Target (std::string) – This is the string you want to match the regex against
    • Match Results (std::smatch) – This is the variable of type smatch that will store match results. We will not be using it in this example
    • Regex (std::basic_regex) – This is the compiled regex that the target is being matched against

    The function will return true if any substring of the target string matches the regex, and false otherwise.

    C++
    #include <iostream>
    #include <string>
    #include <regex>
    
    using namespace std;
    
    int main() {
        string s="this variable is called s";
        smatch m;
        regex e = regex("s");
        if (regex_search(s,m,e) /* Will return true */) {
            cout<<"Matched! (but not always the entire string)"<<endl;
        }
    
        return 0;
    }

    We can also call regex_search using another method, whose parameters are listed below. In this method, we are not only telling the user whether the program is

    • String Begin and String End – Tells the function to only search the substring in between the string beginning and string ending
    • Match Results – This is the smatch that will store match results
    • Regex – The compiled regex that will be used to match against the target string

    The function will return true using the same conditions I stated in the previous method, but here what we care about is the fact that the match results are being stored.

    The code below will print the first match of the regex, check if there are any matches other than the one it returned, and print the first capturing group. It will keep doing this until there are no other matches. I highly recommend you read the comments in the code below for a better understanding of what it’s doing.

    C++
    #include <iostream>
    #include <string>
    #include <regex>
    
    using namespace std;
    
    int main() {
      string target = "submarine submarine submarine";
      regex re = regex("(sub)(mar)(ine)");
      smatch m;
      
      string::const_iterator searchFrom = string::const_iterator(target.cbegin());
      
      // Begin iterating
      while (regex_search(searchFrom,target.cend(),m,re)) {
        
        // We don't want to keep returning the same match every time, so the code below will exclude this match from the future iterations 
        searchFrom = m.suffix().first;
        
        // It is important to know that m[0] would return the entire string ("submarine"), so m[1] will return the first capturing group ("sub")
        cout<<"We have got ourselves a match! \""<<m[1].str() /* First capturing group of match */ <<"\"\n";
        
      }  
    }

    Regular Expression Find and Replace

    The regex_replace function will find and replace all sequences that match the regex.

    In the example below, we are telling it to replace all words (including the spaces around them) with “and”

    We are also giving it three parameters.

    • Target – The text that will be replaced accordingly
    • Regex – The compiled regular expression that will be used on the target
    • Replace With – The text to replace the matches of the regex with against the target
    C++
    #include <iostream>
    #include <string>
    #include <regex>
    
    using namespace std;
    
    int main() {
      regex re("([^ ]+)"); // Matches every word
      cout<<"ORIGINAL: this is text\n";
      cout<<regex_replace("this is text",re,"and"); // prints "and and and"
      return 0;
    }

    You can also use formatters to incorporate exactly what was replaced using the table below.

    FormatterExampleExplanation
    $number (where “number” is replaced by any positive number less than 100)$2Replaced with the match of the numberth capturing sequence that triggered the replace (starting from 1, such that $1 will get the first capturing group, not $0) at runtime

    Example: Replacing regex matches of “(sub)(.+)” with “2nd CG: $2” using a target string of “submarine” will yield a result of “2nd CG: marine”
    $&$&A copy of the entire original string, regardless of capturing groups.

    Example: Replacing regex matches of “(sub)(.+)” with “String: $&” using the same target string above will result in “String: submarine”
    $`$`Replaced with whatever came before the match at runtime

    Example: When we have a regex of “sub” with target string “a submarine goes underwater”, “$`” will get replaced with “a “
    $’$’Replaced with whatever came after the match at runtime

    Example: When we have a regex of “sub” with target string “a submarine goes underwater”, “$’” will get replaced with “marine goes underwater”
    $$$$I wouldn’t call it a formatter exactly; it’s more of an escape sequence. Used when you don’t want the compiler to mistake the literal character “$” with a formatter.

    Used when you want to literally type “$” as the text to replace, type “$$”

    For example, the code below will put the letters “t” and “e” in parenthesis.

    C++
    // regex_replace example
    #include <iostream>
    #include <string>
    #include <regex>
    
    using namespace std;
    
    int main ()
    {
        regex re("([te])"); // Matches either "t" or "e"
        cout<<"ORIGINAL: thetechmaker.com\n";
        cout<<regex_replace("thetechmaker.com",re,"($&)"); // Prints "(t)h(e)(t)(e)chmak(e)r.com"
        return 0;
    }