Category: C and C++

  • How Does Base64 Decoding Work?

    Base64 is an ASCII representation of binary data often used to pass raw binary data in environments that don’t support it, such as SMTP. I covered Base64 encoding earlier and wrote my own C++ implementation of a Base64 encoder in a post, but this article will explain how Base64 decoding works.

    Following the Base64 Decoding Process

    Let’s decode SGVsbG8gV29ybGQ=.

    We start by splitting it up by four-letter groups.

    • SGVs
    • bG8g
    • V29y
    • bGQ=

    The Base64 standard alphabet can be used to convert every letter back into a number. For example, using the first quartet (S, G, V, and s) and dereferencing it against that alphabet gives you 18, 6, 21, and 44.

    Then, you take these numbers and convert them to binary which gives you 010010, 000110, 010101, and 101100. Concentrating these binary numbers together (in other words, just combining them) gives us 010010000110010101101100.

    Do the same for all groups and add the binary strings together. This will leave us with 0100100001100101011011000110110001101111001000000101011101101111011100100110110001100100.

    We are now left with a long string of binary data. Converting that to ASCII gives us the text Hello World.

    If the final quartet has padding at the end, the number of padding characters tells you how many bytes in that last quartet contain actual data and not stuff added to pad the encoded string.

    • One byte of padding (=): output only 2 bytes from the final quartet – the rest is padding
    • Two bytes (==): output just one byte
    • None: output all three bytes

    Writing a C++ Base64 Decoder

    Let’s begin by including iostream for writing to the console.

    C++
    #include <iostream>

    Then for reading and writing files, we can use fstream.

    C++
    #include <fstream>

    After that, we can use cstdint to give us better control over the numbers we store.

    C++
    #include <cstdint>

    For keeping the Base64 input, we can use a string.

    C++
    #include <string>

    Now, let’s define our main function.

    C++
    int main() {
      return 0;
    }

    Within our main function, we can hardcode a decoding table for the Base64 standard alphabet.

    C++
    // constexpr will tell the compiler to include the variable directly in the executable instad of storing it at runtime, which will make the program faster
    // static will tell the compiler to keep space allocated for the variable throughout the entire program's runtime, which will also make things faster
    static constexpr unsigned char padChar = '=';
    
    // C++ can easily convert a single letter into a number, so for every numerical representation of a number (according to the ASCII specification), the decode table will have its index in the Base64 alphabet
    // Here, 0xFF is used to denote a charcater not part of the Base64 alphabet
    static constexpr uint8_t decodeTable[256] = {
      // 0x00–0x0F
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      // 0x10–0x1F
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      // 0x20–0x2F   ' ' … '/'
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      0xFF,0xFF,0xFF,62  ,0xFF,0xFF,0xFF,63  , // '+'=62 '/'=63
      // 0x30–0x3F   '0' … '?'
      52  ,53  ,54  ,55  ,56  ,57  ,58  ,59  ,
      60  ,61  ,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      // 0x40–0x4F   '@' … 'O'
      0xFF,0   ,1   ,2   ,3   ,4   ,5   ,6   ,
      7   ,8   ,9   ,10  ,11  ,12  ,13  ,14  ,
      // 0x50–0x5F   'P' … '_'
      15  ,16  ,17  ,18  ,19  ,20  ,21  ,22  ,
      23  ,24  ,25  ,0xFF,0xFF,0xFF,0xFF,0xFF,
      // 0x60–0x6F   '`' … 'o'
      0xFF,26  ,27  ,28  ,29  ,30  ,31  ,32  ,
      33  ,34  ,35  ,36  ,37  ,38  ,39  ,40  ,
      // 0x70–0x7F   'p' … DEL
      41  ,42  ,43  ,44  ,45  ,46  ,47  ,48  ,
      49  ,50  ,51  ,0xFF,0xFF,0xFF,0xFF,0xFF,
      // 0x80–0x8F
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      // 0x90–0x9F
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      // 0xA0–0xAF
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      // 0xB0–0xBF
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      // 0xC0–0xCF
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      // 0xD0–0xDF
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      // 0xE0–0xEF
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      // 0xF0–0xFF
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
      0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF
    };
    

    Then, we can prompt the user.

    C++
    std::string inputPath, outputPath;
    std::cout << "Enter input file path (Base64 text)\n>";
    std::getline(std::cin, inputPath);
    std::cout << "Enter output file path (binary)\n>";
    std::getline(std::cin, outputPath);

    After that, we can open the input and output files.

    C++
    std::ifstream in(inputPath, std::ios::in);
    if (!in.is_open()) {
      std::cerr << "Failed to open input file\n";
      return 1;
    }
    std::ofstream out(outputPath, std::ios::binary);
    if (!out.is_open()) {
      std::cerr << "Failed to create output file\n";
      return 1;
    }

    Next, we can make variables to keep track of the current quartet we are processing and the current index.

    C++
    char quartet[4];
    size_t qIndex = 0;

    Next, we can enter a loop that will process our quartets.

    C++
    while (true) {
    	
    }

    Within our while loop, we need to keep track of the current character.

    C++
    char c;
    if (!in.get(c)) break; // in.get(c) will place the next character (reading the file from left to right) into the variable c. If it returns false, we have reached the end of the file and need to break out of our while loop

    We then should see if the current character is whitespace (a new line, a space, or a tab) and skip the character’s processing if it is.

    C++
    // The continue keyword will skip all further instructions in the loop and immediately move to the next iteration (in this case, we are immediately moving to the next character in the file)
    if (c == '\n' || c == '\r' || c == ' ' || c == '\t') continue;

    We also need to update the quartet variable with the current character.

    C++
    // Using qIndex++ as an expression like this increments qIndex by 1 and returns the updated value
    quartet[qIndex++] = c;

    If qIndex is four, we have reached the end of the current quartet and need to process it.

    C++
    if (qIndex == 4) {
    
    }

    Within this if statement, we can declare a variable to collect four Base64 six-bit chunks before processing them (since Base64 works on groups of six).

    C++
    uint8_t v[4];

    We can also keep track of the padding we have encountered so far.

    C++
    int padCount = 0;

    Then we can loop over v and keep track of it against the current quartet.

    C++
    for (int i = 0; i < 4; i++) {
    
    }

    Within this loop, we can check if the current character is padding.

    C++
    if (quartet[i] == padChar) {
    
    } else {
    
    }

    Within the first branch of this if statement (the code to be executed if the current character is padding), we can set v at the current index to zero because we will not need to use it later and setting it to zero would be better than undefined garbage data.

    C++
    v[i] = 0;

    We can then increment padCount so the next iteration of the for loop can use it.

    C++
    padCount++;

    In the second branch, we can add the Base64 alphabet index to v according to the decoding table.

    C++
    uint8_t val = decodeTable[(unsigned char)quartet[i]];
    if (val == 0xFF) { std::cerr << "Invalid Base64 char\n"; return 1; } // Earlier, we used 0xFF for characters not part of the Base64 alphabet
    v[i] = val;

    Outside of this if statement and after the for loop, we can take the entire quartet and combine it.

    C++
    uint32_t triple = (v[0] << 18) | (v[1] << 12) | (v[2] << 6) | v[3]; // We basically concentrate the groups together with the classic binary shift and OR gate method

    The padding chooses how many groups are actual binary data and not inserted for correction, so we can extract data from the full triple based on that.

    C++
    // The reverse binary shift and the AND logic gate is the exact opposite of what we did before. We essentailly "peel apart" the groups based on the number of padding characters.
    if (padCount < 3) {
        char b1 = (triple >> 16) & 0xFF; // Shifts the top 8 bits (bits 23-16) down into the lowest byte position. Then, we mask with 0xFF so only those 8 bits remain. This gives you the first decoded byte.
        out.put(b1);
        std::cout.put(b1);
    }
    if (padCount < 2) {
        char b2 = (triple >> 8) & 0xFF; // Shifts bits 15..8 down. Then masks with 0xFF. This is the second decoded byte.
        out.put(b2);
        std::cout.put(b2);
    }
    if (padCount < 1) {
        char b3 = triple & 0xFF; // Takes bits 7..0 directly. No mask is needed here because we have reached the end of the triple. That gives the third decoded byte.
        out.put(b3);
        std::cout.put(b3);
    }

    Then we can reset qIndex.

    C++
    qIndex = 0;

    After the while loop (so we are back in top-level scope now), we can close the input and output files and exit with code zero (for “success”).

    C++
    in.close();
    out.close();
    return 0;

    The Result

    This is what your final code should look like.

    C++
    #include <iostream>
    #include <fstream>
    #include <cstdint>
    #include <string>
    
    int main() {
        static constexpr uint8_t decodeTable[256] = {
          // 0x00–0x0F
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          // 0x10–0x1F
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          // 0x20–0x2F   ' ' … '/'
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          0xFF,0xFF,0xFF,62  ,0xFF,0xFF,0xFF,63  , // '+'=62 '/'=63
          // 0x30–0x3F   '0' … '?'
          52  ,53  ,54  ,55  ,56  ,57  ,58  ,59  ,
          60  ,61  ,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          // 0x40–0x4F   '@' … 'O'
          0xFF,0   ,1   ,2   ,3   ,4   ,5   ,6   ,
          7   ,8   ,9   ,10  ,11  ,12  ,13  ,14  ,
          // 0x50–0x5F   'P' … '_'
          15  ,16  ,17  ,18  ,19  ,20  ,21  ,22  ,
          23  ,24  ,25  ,0xFF,0xFF,0xFF,0xFF,0xFF,
          // 0x60–0x6F   '`' … 'o'
          0xFF,26  ,27  ,28  ,29  ,30  ,31  ,32  ,
          33  ,34  ,35  ,36  ,37  ,38  ,39  ,40  ,
          // 0x70–0x7F   'p' … DEL
          41  ,42  ,43  ,44  ,45  ,46  ,47  ,48  ,
          49  ,50  ,51  ,0xFF,0xFF,0xFF,0xFF,0xFF,
          // 0x80–0x8F
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          // 0x90–0x9F
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          // 0xA0–0xAF
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          // 0xB0–0xBF
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          // 0xC0–0xCF
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          // 0xD0–0xDF
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          // 0xE0–0xEF
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          // 0xF0–0xFF
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
          0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF
        };
    
        std::string inputPath, outputPath;
        std::cout << "Enter input file path (Base64 text)\n>";
        std::getline(std::cin, inputPath);
        std::cout << "Enter output file path (binary)\n>";
        std::getline(std::cin, outputPath);
    
        std::ifstream in(inputPath, std::ios::in);
        if (!in.is_open()) {
            std::cerr << "Failed to open input file\n";
            return 1;
        }
        std::ofstream out(outputPath, std::ios::binary);
        if (!out.is_open()) {
            std::cerr << "Failed to create output file\n";
            return 1;
        }
    
        char quartet[4];
        size_t qIndex = 0;
        while (true) {
            char c;
            if (!in.get(c)) break;
    
            if (c == '\n' || c == '\r' || c == ' ' || c == '\t') continue;
    
            quartet[qIndex++] = c;
            if (qIndex == 4) {
                uint8_t v[4];
                int padCount = 0;
                for (int i = 0; i < 4; i++) {
                    if (quartet[i] == '=') {
                        v[i] = 0;
                        padCount++;
                    } else {
                        uint8_t val = decodeTable[(unsigned char)quartet[i]];
                        if (val == 0xFF) { std::cerr << "Invalid Base64 char\n"; return 1; }
                        v[i] = val;
                    }
                }
    
                uint32_t triple = (v[0] << 18) | (v[1] << 12) | (v[2] << 6) | v[3];
    
                if (padCount < 3) {
                    char b1 = (triple >> 16) & 0xFF;
                    out.put(b1);
                    std::cout.put(b1);
                }
                if (padCount < 2) {
                    char b2 = (triple >> 8) & 0xFF;
                    out.put(b2);
                    std::cout.put(b2);
                }
                if (padCount < 1) {
                    char b3 = triple & 0xFF;
                    out.put(b3);
                    std::cout.put(b3);
                }
                qIndex = 0;
            }
        }
    
        in.close();
        out.close();
        return 0;
    }
    

  • Writing a Base64 Encoder in C++

    This is a follow-up to an earlier post where I explained how Base64 encoding and decoding works. If you have not read that yet, I would recommend you do as it provides helpful background information.

    C++ Implementation of Base64 Encoding

    Now that we know how Base64 encoding works, we can write our own encoder using C++.

    I would like to be able to use files while I am encoding it, so this encoder will store the entire file in a vector, and the decode it from there. This is not really RAM-efficient, however, but I won’t be encoding large files.

    First, let’s import our standard IO.

    C++
    #include <iostream>

    Since we will be reading from the file to be encoded and writing to the file that needs to store all of the Base64 as text, we need to import an inbuilt library used for file operations.

    C++
    #include <fstream>

    Since we will be working with bits, we should also be using bit sets in our program. The type needs to be imported from an inbuilt library, however, so let’s import it.

    C++
    #include <bitset>

    Then, since we will be storing the data in a vector, we will need to import that type.

    C++
    #include <vector>

    I want to work with strings, not C-style arrays of characters, so let’s import that.

    C++
    #include <string>

    Because we will be using typedefs (type aliases) like uint8_t that will give us more granular control over how much space each number we store takes up in RAM, we will also need to include cstdint.

    C++
    #include <cstdint>

    Now, we can start with the main function.

    C++
    int main() {
      return 0;
    }

    Within this main function, we should first start by asking the user which file they should read. We could use std::cin, but that considers all whitespace characters terminating, meaning that it will not read anything after a whitespace character. So, if there are spaces in our filename, we won’t be able to read it properly.

    Instead, we should use std::getline; something like that will only consider newline characters terminating, which is useful because newlines aren’t allowed in file paths anyways.

    Below, we collect the input and output files.

    C++
    // Standard Base64 alphabet
    std::string base64Standard = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"; // Values 0-63
    const char padChar = "="; // Technically, the padding character is NOT part of the alphabet and so including it above would be misleading
    
    // Variable that holds our input file
    std::string filePathInput;
    
    std::cout << "Enter a file path to read from\n>";
    
    // Read the line and place it directly in filePathInput
    std::getline(std::cin, filePathInput);
    
    // Variable that holds our output file
    std::string filePathOutput;
    
    std::cout << "Enter a file path to write to\n>";
    
    // Read the line and place it directly in filePathOutput
    std::getline(std::cin, filePathOutput);

    After that, we can try opening the file that we need to read from.

    C++
    std::cout << "\nAttempting to open input file...\n";
    
    // Opening a read-only file stream to whichever path the user specified, in binary mode (since we are reading ones and zeroes, not text)
    // To be more specific: On Windows systems, binary mode will prevent the OS from turning \n into \r\n, which will mess up the encoding and result in incorrect output
    std::ifstream inputFileStream(filePathInput, std::ios::binary);

    Then, we will see if our file stream has actually been opened. If it hasn’t, that likely means some error occurred that prevented the file from opening.

    C++
    if (!inputFileStream.is_open()) {
      std::cerr << "Failed to open file!\n";
      return 1;
    }
    std::cout << "File open!\n";

    In order to maintain speed, we should pre-allocate the file data that will be stored in a vector. To do this, we seek to the end of the file, and then use the tellg function to tell us how far we have seeked from the beginning of the file. Then, we store this a variable.

    C++
    std::cout << "Detecting file size...\n";
    inputFileStream.seekg(0, std::ios::end);
    
    // There is a dedicated type for file stream positions in C++
    const std::streampos endPos = inputFileStream.tellg();
    if (endPos < 0) { std::cerr << "Failed to get file size\n"; return 1; }
    const size_t fileSizeBytes = static_cast<size_t>(endPos); // We must convert std::streampos to the dedicated type I mentioned above, size_t
    
    std::cout << "Detected file size: " << fileSizeBytes << " bytes\n";
    inputFileStream.seekg(0, std::ios::beg);

    Now that the file stream is open, we can load the entire thing into a vector, and then close it as we have no use for it anymore.

    C++
    // Each character in the vector holds raw bytes of the file
    std::vector<uint8_t> fileData(fileSizeBytes);
    
    std::cout << "Reading file...\n";
    
    // read(...) takes char* and std::streamsize.
    // In C++, a static cast is a cast (type conversion) where there is native support for the two types to convert from and to each other
    // We use one here to convert the result of fileData.size() into std::streamsize
    // A reinterpret cast tells the compiler that we know for sure the data coming in can be converted into a certain type (std::streamsize in this case)
    // This essentially causes the type conversion to happen during runtime
    // This is very unsafe and is not usually recommended
    inputFileStream.read(reinterpret_cast<char*>(fileData.data()),
                         static_cast<std::streamsize>(fileData.size()));
    inputFileStream.close();
    
    std::cout << "File data read! It is now safe to modify or delete the file!\n";

    After that, we can create the output file.

    C++
    std::cout << "Creating output file...\n";
    std::ofstream encodedDataOutputFile(filePathOutput, std::ios::binary);
    
    // Fires if an error occurs during the file creation process
    if (!encodedDataOutputFile.is_open()) {
      std::cerr<< "Error opening output file\n";
      return 1;
    }

    Now, we can process each and every bit to convert it into a Base64 character.

    C++
    size_t totalBits = fileData.size() * 8; // There are 8 bits in a byte
    
    std::cout << "Processing "
      << totalBits 
      << " bits...\n";
      
     size_t bitsProcessed = 0; // Let's keep track of the bits processed here
     
     while (bitsProcessed < totalBits) {
       
     }

    Within this while loop comes the actual processing of bits.

    Let’s start by keeping track of the current position of the byte being processed and the bit being processed, which will be useful for future calculations.

    Keeping track of the byte index is simple – you just divide the number of bits processed by 8, since there are 8 bits in a byte.

    Keeping track of the bit index required some more thinking for me. We need to use the bit index to determine the number of bits that spill over. Finding the remainder of the number of bits processed should be good enough for this use case.

    C++
    size_t byteIndex = bitsProcessed / 8;
    size_t bitIndex = bitsProcessed % 8; // This tells us how many bits are overflowing into the next byte

    Next, we will keep track of the current byte and the next byte, since it is possible for six bits to span across two bytes.

    C++
    uint8_t currentByte = fileData[byteIndex];
    uint8_t nextByte = (byteIndex + 1 < fileData.size()) ? fileData[byteIndex + 1] : 0; // Six bits may span across two bytes, so we need to keep track of the next byte

    Next, we extract six bits to cross-reference it to our alphabet.

    To do this, we put two bytes (16 bits), store it in a variable called combined, and extract six bits from it. We do it this way because, like I said before, six bits can span across two bytes.

    C++
    uint16_t combined = (static_cast<uint16_t>(currentByte) << 8) | nextByte;
    uint8_t sixBits = (combined >> (10 - bitIndex)) & 0x3F; // Extracting six bits

    Then, we can finally cross-reference the number created from this six-bit value with the alphabet we will be using.

    C++
    char resultChar = base64Standard[static_cast<int>(sixBits)];

    Last but not least, we output this character to both the console and the output file the user specified at the beginning and increment the bitsProcessed by 6.

    C++
    encodedDataOutputFile << resultChar;
    std::cout << resultChar;
    bitsProcessed += 6; // Advance the loop

    C++ Implementation: Solving the Padding Issues

    Now, we just need to add padding to our C++ program.

    Keep in mind that we have now moved on from the while loop, and any future snippets of code will take place in the main function.

    Let’s start by computing the amount of padding characters we need using the implementation I showed you at the beginning of this tutorial.

    C++
    size_t remaining = fileData.size() /*The size of the input file, in bits*/ % 3;
    size_t padCount = (3 - remaining) % 3; // yields 0, 1, or 2

    Then, we add however many padding characters we need to the console and the output file.

    C++
    for (size_t i = 0; i < padCount; ++i) {
        encodedDataOutputFile << padChar;
        std::cout << padChar;
    }

    Final Code

    And we are done! Your code should look something like this:

    C++
    #include <iostream>
    
    #include <fstream>
    
    #include <cstdint>
    
    #include <bitset>
    
    #include <vector>
    
    #include <string>
    
    int main() {
        // Standard Base64 alphabet
        std::string base64Standard = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"; // Values 0-63
        const char* padChar = "="; // Technically, the padding character is NOT part of the alphabet and so including it above would be misleading
    
        // Variable that holds our input file
        std::string filePathInput;
    
        std::cout << "Enter a file path to read from\n>";
    
        // Read the line and place it directly in filePathInput
        std::getline(std::cin, filePathInput);
    
        // Variable that holds our output file
        std::string filePathOutput;
    
        std::cout << "Enter a file path to write to\n>";
    
        // Read the line and place it directly in filePathOutput
        std::getline(std::cin, filePathOutput);
        std::cout << "\nAttempting to open input file...\n";
    
        // Opening a read-only file stream to whichever path the user specified, in binary mode (since we are reading ones and zeroes, not text)
        // To be more specific: On Windows systems, binary mode will prevent the OS from turning \n into \r\n, which will mess up the encoding and result in incorrect output
        std::ifstream inputFileStream(filePathInput, std::ios::binary);
        if (!inputFileStream.is_open()) {
            std::cerr << "Failed to open file!\n";
            return 1;
        }
        std::cout << "File open!\n";
        std::cout << "Detecting file size...\n";
        inputFileStream.seekg(0, std::ios::end);
    
        // There is a dedicated type for file stream positions in C++
        const std::streampos endPos = inputFileStream.tellg();
        if (endPos < 0) {
            std::cerr << "Failed to get file size\n";
            return 1;
        }
        const size_t fileSizeBytes = static_cast < size_t > (endPos); // We must convert std::streampos to the dedicated type I mentioned above, size_t
    
        std::cout << "Detected file size: " << fileSizeBytes << " bytes\n";
        inputFileStream.seekg(0, std::ios::beg);
        // Each character in the vector holds raw bytes of the file
        std::vector < uint8_t > fileData(fileSizeBytes);
    
        std::cout << "Reading file...\n";
    
        // read(...) takes char* and std::streamsize.
        // In C++, a static cast is a cast (type conversion) where there is native support for the two types to convert from and to each other
        // We use one here to convert the result of fileData.size() into std::streamsize
        // A reinterpret cast tells the compiler that we know for sure the data coming in can be converted into a certain type (std::streamsize in this case)
        // This essentially causes the type conversion to happen during runtime
        // This is very unsafe and is not usually recommended
        inputFileStream.read(reinterpret_cast < char * > (fileData.data()),
            static_cast < std::streamsize > (fileData.size()));
        inputFileStream.close();
    
        std::cout << "File data read! It is now safe to modify or delete the file!\n";
        std::cout << "Creating output file...\n";
        std::ofstream encodedDataOutputFile(filePathOutput, std::ios::binary);
    
        // Fires if an error occurs during the file creation process
        if (!encodedDataOutputFile.is_open()) {
            std::cerr << "Error opening output file\n";
            return 1;
        }
        size_t totalBits = fileData.size() * 8; // There are 8 bits in a byte
    
        std::cout << "Processing " <<
            totalBits <<
            " bits...\n";
    
        size_t bitsProcessed = 0; // Let's keep track of the bits processed here
    
        while (bitsProcessed < totalBits) {
            size_t byteIndex = bitsProcessed / 8;
            size_t bitIndex = bitsProcessed % 8; // This tells us how many bits are overflowing into the next byte
            uint8_t currentByte = fileData[byteIndex];
            uint8_t nextByte = (byteIndex + 1 < fileData.size()) ? fileData[byteIndex + 1] : 0; // Six bits may span across two bytes, so we need to keep track of the next byte
            uint16_t combined = (static_cast < uint16_t > (currentByte) << 8) | nextByte;
            uint8_t sixBits = (combined >> (10 - bitIndex)) & 0x3F; // Extracting six bits
            char resultChar = base64Standard[static_cast < int > (sixBits)];
            encodedDataOutputFile << resultChar;
            std::cout << resultChar;
            bitsProcessed += 6; // Advance the loop
        }
        size_t remaining = fileData.size() /* <-- The size of the input file, in bits */ % 3;
        size_t padCount = (3 - remaining) % 3; // yields 0, 1, or 2 (for number of padding characters we need to add)
        for (size_t i = 0; i < padCount; ++i) {
            encodedDataOutputFile << padChar;
            std::cout << padChar;
        }
        return 0;
    }

    Performance Testing

    I wanted to compare this implementation to [Convert]::ToBase64String on PowerShell.

    PowerShell
    $TIME = Measure-Command {[Convert]::ToBase64String([System.IO.File]::ReadAllBytes("./hello.txt"))}
    $TIME.TotalMilliseconds
    # 7.0512
    $DIY_IMPLEMENTATION = Measure-Command {echo "./hello.txt`n./res.txt`n" | ./a.exe}
    $DIY_IMPLEMENTATION.TotalMilliseconds
    # 31.5084

    Disappointing but not at all surprising. Not only is our implementation made for readability, not efficiency, our implementation writes to a file (which is a lot slower than you think).

  • Regular Expressions in C++

    In an earlier post I made, I discussed how regular expressions could be used. Now, I will show you how to implement them in your own C++ program.

    The regex Library

    The regular expression library was added in C++ 11, so you should have support for it by now. We can start with some basic boilerplate, importing our regular expression library (along with some other libraries that will make our lives easier) and the standard IO library.

    C++
    #include <iostream>
    #include <string>
    #include <vector>
    #include <regex>
    
    using namespace std;
    
    int main() {
      return 0;
    }

    Compiling a Regular Expression

    In C++, regular expressions must be compiled before they are used. When I say I am going to be passing a regex string or a regex to a function, I am actually going to be talking about this compiled expression. All regexes must be compiled before use.

    It is actually very easy to compile regexes, and below, we are compiling regex <html>.+</html> and assigning the compiled expression to a variable called re.

    Then, we will do the same thing I mentioned above but naming the expression reg. The reason I am doing this twice is because I want to show the two methods you can use, assigning by value or assigning using the regex class’s constructor.

    C++
    #include <iostream>
    #include <string>
    #include <vector>
    #include <regex>
    
    using namespace std;
    
    int main() {
      cout << "Compiling regex 1..." << endl;
      regex re = regex("<html>.+</html>");
      cout << "Compiled regex 1!" << endl;
      
      cout << "Compiling regex 2..." << endl;
      regex reg("<html>.+</html>");
      cout << "Compiled regex 2!" << endl;
    
      return 0;
    }

    Determining if a Regular Expression Matches an Entire String

    The regex_match function will determine whether an entire string is matched by a certain regex. For example, if we pass the regex hi to it and match it with the string hi, the function will return true, as the regular expression provided matches the entire target string of hi.

    However, if we kept the regex the same but changed the target string to shi, the function would return false because while shi contains hi, the regex hi does not match the entirety of shi.

    Let’s use an example. I have given one below.

    C++
    #include <iostream>
    #include <string>
    #include <vector>
    #include <regex>
    
    using namespace std;
    
    int main() {
      string reStr;
      cout<<"Enter a regular expression to use the regex_match function on:\n>";
      cin>>reStr;
      
      string target;
      cout<<"Enter a target string to use the regex_match function on:\n>";
      cin>>target;
      
      regex reCompiled = regex(reStr); // Compiling our regex
      
      // Actual matching process
      if (regex_match(target,reCompiled)) {
        cout<<"\nRegex Matched Entirely!\n";
        return 0;
      }
      else {
        cout<<"\nRegex Did Not Match Entirely!\n";
      }
    
      return -1;
    }

    A Quick Note: Capturing Groups

    Capturing groups in regexes are denoted by parenthesis and are often returned as lists. To make things simpler, let’s use the regex (sub)(mar)(ine). Here we can see that sub, mar, and ine each have their own capturing groups.

    Now if we were to use this on the text submarinesubmarine, the regex would match on both submarines separately, so we would get two matches.

    Let’s take a closer look at the matches.

    These matches would end up having three submatches each due to these capturing groups. If we were to visualize this in hierarchy, we would get the following:

    In C++, matching with capturing groups is represented as a list of matches containing lists of each capturing group for each match. For example, if we wanted to get match one, capturing group one, of a list of matches (you will learn about the smatches type in the next section), we would use the code below:

    C++
    string m1c1 = matches[0][0];

    A Quick Note: The smatches Type

    The smatches type is used for storing a list of strings as regex matches. It is sort of like a vector, but the shape is fixed to either vector<string> without capturing groups, or vector<vector<string>> with capturing groups.

    Determining if a Regular Expression Matches any Substrings

    Remember how above, I said that the regex_match function only tells you whether the entire string is matched by a regex? Well, if we want to include substrings, it can get a little more complicated (this is coming from someone with a Python background, where we are pampered with the re library).

    For this part of the guide, we will be using the regex_search function, which will tell you if

    The regex_search function typically takes three to four arguments. Let’s look at the first method of calling it.

    For this method, the function takes three parameters and outputs a Boolean. The parameters are below.

    • Target (std::string) – This is the string you want to match the regex against
    • Match Results (std::smatch) – This is the variable of type smatch that will store match results. We will not be using it in this example
    • Regex (std::basic_regex) – This is the compiled regex that the target is being matched against

    The function will return true if any substring of the target string matches the regex, and false otherwise.

    C++
    #include <iostream>
    #include <string>
    #include <regex>
    
    using namespace std;
    
    int main() {
        string s="this variable is called s";
        smatch m;
        regex e = regex("s");
        if (regex_search(s,m,e) /* Will return true */) {
            cout<<"Matched! (but not always the entire string)"<<endl;
        }
    
        return 0;
    }

    We can also call regex_search using another method, whose parameters are listed below. In this method, we are not only telling the user whether the program is

    • String Begin and String End – Tells the function to only search the substring in between the string beginning and string ending
    • Match Results – This is the smatch that will store match results
    • Regex – The compiled regex that will be used to match against the target string

    The function will return true using the same conditions I stated in the previous method, but here what we care about is the fact that the match results are being stored.

    The code below will print the first match of the regex, check if there are any matches other than the one it returned, and print the first capturing group. It will keep doing this until there are no other matches. I highly recommend you read the comments in the code below for a better understanding of what it’s doing.

    C++
    #include <iostream>
    #include <string>
    #include <regex>
    
    using namespace std;
    
    int main() {
      string target = "submarine submarine submarine";
      regex re = regex("(sub)(mar)(ine)");
      smatch m;
      
      string::const_iterator searchFrom = string::const_iterator(target.cbegin());
      
      // Begin iterating
      while (regex_search(searchFrom,target.cend(),m,re)) {
        
        // We don't want to keep returning the same match every time, so the code below will exclude this match from the future iterations 
        searchFrom = m.suffix().first;
        
        // It is important to know that m[0] would return the entire string ("submarine"), so m[1] will return the first capturing group ("sub")
        cout<<"We have got ourselves a match! \""<<m[1].str() /* First capturing group of match */ <<"\"\n";
        
      }  
    }

    Regular Expression Find and Replace

    The regex_replace function will find and replace all sequences that match the regex.

    In the example below, we are telling it to replace all words (including the spaces around them) with “and”

    We are also giving it three parameters.

    • Target – The text that will be replaced accordingly
    • Regex – The compiled regular expression that will be used on the target
    • Replace With – The text to replace the matches of the regex with against the target
    C++
    #include <iostream>
    #include <string>
    #include <regex>
    
    using namespace std;
    
    int main() {
      regex re("([^ ]+)"); // Matches every word
      cout<<"ORIGINAL: this is text\n";
      cout<<regex_replace("this is text",re,"and"); // prints "and and and"
      return 0;
    }

    You can also use formatters to incorporate exactly what was replaced using the table below.

    FormatterExampleExplanation
    $number (where “number” is replaced by any positive number less than 100)$2Replaced with the match of the numberth capturing sequence that triggered the replace (starting from 1, such that $1 will get the first capturing group, not $0) at runtime

    Example: Replacing regex matches of “(sub)(.+)” with “2nd CG: $2” using a target string of “submarine” will yield a result of “2nd CG: marine”
    $&$&A copy of the entire original string, regardless of capturing groups.

    Example: Replacing regex matches of “(sub)(.+)” with “String: $&” using the same target string above will result in “String: submarine”
    $`$`Replaced with whatever came before the match at runtime

    Example: When we have a regex of “sub” with target string “a submarine goes underwater”, “$`” will get replaced with “a “
    $’$’Replaced with whatever came after the match at runtime

    Example: When we have a regex of “sub” with target string “a submarine goes underwater”, “$’” will get replaced with “marine goes underwater”
    $$$$I wouldn’t call it a formatter exactly; it’s more of an escape sequence. Used when you don’t want the compiler to mistake the literal character “$” with a formatter.

    Used when you want to literally type “$” as the text to replace, type “$$”

    For example, the code below will put the letters “t” and “e” in parenthesis.

    C++
    // regex_replace example
    #include <iostream>
    #include <string>
    #include <regex>
    
    using namespace std;
    
    int main ()
    {
        regex re("([te])"); // Matches either "t" or "e"
        cout<<"ORIGINAL: thetechmaker.com\n";
        cout<<regex_replace("thetechmaker.com",re,"($&)"); // Prints "(t)h(e)(t)(e)chmak(e)r.com"
        return 0;
    }