We wrote and are publishing GPU-accelerated software to crack Master Password Hashes extracted from a Bitwarden Server. As a reminder, “cracking a hash” simply means to brute-force it: to successively hash many many potential passwords and compare the output digests against the target.

Caution: This post is code-heavy. Feel free to skip ahead to the final post that zooms back out to an overview.

It would be trivial to write a quick Python script to calculate hashes and try to crack them. In fact, here’s one for debugging:

import base64
import hashlib


def inner_hash(rounds1, email, password):
    masterkey = hashlib.pbkdf2_hmac("sha256", password, email, rounds1)
    masterpasswordhash = hashlib.pbkdf2_hmac("sha256", masterkey, password, 1)
    return masterpasswordhash


def outer_hash(password, inner_rounds, outer_rounds, email, salt):
    subkey = base64.b64encode(inner_hash(inner_rounds, email, password))
    keyHash = hashlib.pbkdf2_hmac("sha256", subkey, salt, outer_rounds)
    return keyHash


with open(DICTIONARY_FILENAME) as f:
    for password_guess in f:
        if outer_hash(password_guess[:-1].encode(), inner_rounds, outer_rounds, email, outer_salt) == target_hash:
            print(password_guess[:-1])

The glaring problem with this method is that it could only run 3.4 hash iterations per second (3.4 H/s) on our test laptop. That means that it would take well over a month to run it against the smallest real-world password dictionary, “rockyou.txt”. We can do better. And we did, with hashcat.

Hashcat is the leading password cracking software. Each password hash algorithm is implemented as an independent module, consisting of a C module to parse the hash, an OpenCL kernel to compute the hash, and a Perl test-suite. There is an existing plugin development guide, but writing a plugin was still daunting for me. Before embarking on this project, I had no experience writing a hashcat plugin and only marginally more familiarity with writing OpenCL or any accelerated code. In this post I will attempt to simplify the process for others who also have little experience in this area. Fair warning, though: because I am new to this area myself, there may be misconceptions or errors.

Note: in this entire post, we’re only talking about slow, pure (unoptimized) kernels. That sounds bad, but it’s actually normal for a modern hash algorithm. First, hashcat plugins are split into fast and slow hash types. Fast hashes are plugins that compute only one or a few iterations of a cryptographic primitive. Examples include raw MD5 and SHA256 hashes. Fast kernels can benefit from computational tricks that wouldn’t be worth it for slower hashes, like moving candidate generation from the CPU to the GPU in the kernel implementation. In a slow hash, the bottleneck is always the actual hash computation, but in a fast hash, the bottleneck could actually be in data transfer to and from the GPU with naive implementations. Second, pure kernels are kernels that can compute the hash algorithm for any input. Some hash algorithms can be implemented for specific cases that are only valid for certain inputs—for example if you can take a shortcut for short passwords or salts—and those would be optimized kernels. Again, this post is going to ignore the existence of fast and optimized kernels, so be sure to read the plugin guide if you think you might want to implement one of them.

Plugin

The goal of the rest of this post is to help others write hashcat plugins by using this plugin as an example.

Let’s start by looking at a simplified data flow when cracking a hash. This is intended to be high level and only cover the most common cases:

  1. A user has some hashed or encrypted data.
  2. If needed (depending on the hash type), the user runs a Python script from hashcat/tools/ to convert the raw data to a standardized (per hash type) ASCII hash line.
  3. The user launches a hashcat job to crack the hash. In the C module specific to the hash type, the module_hash_decode function parses the hash line and stores the hash data (commonly the digest, salt, and round count) in hashcat memory buffers.
  4. hashcat launches the hash-type-specific OpenCL kernel on the target device (typically a GPU) and copies the hash buffers to it. The following steps happen on that device for each candidate password guess:
    1. hashcat calls the kernel _init function. This function sets up the state of necessary primitives before the main loop starts. All kernel functions have access to all hash buffers.
    2. hashcat calls the kernel _loop function multiple times. It performs multiple steps of computation on the hash buffers for each loop.
    3. If needed (depending on the hash type), hashcat calls the kernel _init2 function, which cleans up after the first loop and sets up for the second loop.
    4. If _init2 was called, hashcat calls the kernel _loop2 function in a way analogous to the first loop.
    5. hashcat calls the kernel _comp function, which compares the output from the buffers to the target hash. For hash digests, this is very simple, but for schemes involving encryption, it can be complicated.
  5. Each candidate guess kernel execution returns its result to the host hashcat process.
  6. If a guess was correct (the hash was cracked), hashcat calls the C module’s module_hash_encode to re-format the hash’s data buffers into the textual hash line. hashcat prints that along with the cracked password.

Here is a tree diagram of the hashcat source code file layout, but pared down to only files relevant to our plugin:

hashcat/
├── OpenCL/
│   ├── inc_common.cl
│   ├── inc_common.h
│   ├── inc_hash_sha256.cl
│   ├── inc_hash_sha256.h
│   ├── inc_types.h
│   ├── m23400-pure.cl
│   └── m23410-pure.cl
├── src/
│   └── modules/
│       ├── module_23400.c
│       └── module_23410.c
└── tools/
    ├── bitwarden2hashcat.py
    ├── bitwardenserver2hashcat.py
    ├── test_modules/
    │   ├── m23400.pm
    │   └── m23410.pm
    ├── test.pl
    └── test.sh

The rest of this post will be walking through the bitwardenserver2hashcat.py tool, the module_23410.c module, the m23410-pure.cl kernel, and the m23410.pm test script.

Tool

Going in order of the data flow, we start with the hash line extraction and formatting tool, bitwardenserver2hashcat.py. It extracts the raw data from the server and generates synthetic (as in: I made it up) hash lines that look like:

$bitwarden-server$1*10000*10000*bm9yZXBseUBoYXNoY2F0Lm5ldA==*0ANf7tTj2zwkl3/hf5Zohg==*5ynxBxAsJJ95vgEL/uDVcXseLbyIsHRGFdTXCVvbRlo=
^                 ^ ^     ^     ^                            ^                        ^
type              | |     |     email/inner salt (base64)    outer salt (base64)      digest (base64)
                  | |     outer round count
                  | inner round count
                  version

To run the tool, you need access to both the Bitwarden server database (MSSQL) or its contents and the Data Protection key (covered in the next post) associated with the database. The intended normal setup is that you have the files for a server and you launch a MSSQL server instance at which you can point bitwardenserver2hashcat.py. I am unaware of a native Linux tool to pull data from an offline MSSQL database.

Module

The entry point of a hash type is its C module. In our case it is module_23410.c. The number in the filename is the hash mode number, an arbitrary unique number. Since the Bitwarden client hash mode is 23400, we incremented it by 10 to get 23410. The source code is here. There are many lines of code, but about half are the standardized module template.

The sections of the module are:

  1. Imports
  2. Config options as constants
  3. Functions that return a relevant constant. These functions are all part of a module struct referenced at the end
  4. Structs specific to the hash type
  5. Functions that contain all the logic of the module. These are also part of the module struct
  6. The function module_init which constructs and returns the module struct

Now we’ll go through and explain the interesting bits here with source code extracts.

typedef struct bitwarden_server_double_salt
{
  u32 inner_salt_buf[64];
  int inner_salt_len;

  u32 outer_salt_buf[64];
  int outer_salt_len;

} bitwarden_server_double_salt_t;
typedef struct bitwarden_tmp
{
  u32 ipad[8];
  u32 opad[8];

  u32 dgst[8];
  u32 out[8];

} bitwarden_tmp_t;

These custom buffer types are filled in in the C module and are accessible from the kernel.

Because the Bitwarden server hash has more than a single salt, we create a custom salt buffer (called an “esalt”) instead of using the standard salt_t structure. It stores both salts and their lengths.

The “tmp” structure is for temporary kernel data. It is initialized in the kernel _init function, stores the output of each _loop, and is what is checked against the target in the kernel _comp function. In our case, ipad and opad are part of PBKDF-SHA256 calculations (explained in the kernel section), and dgst and out are loop outputs.

int module_hash_decode (MAYBE_UNUSED const hashconfig_t *hashconfig, MAYBE_UNUSED void *digest_buf, MAYBE_UNUSED salt_t *salt, MAYBE_UNUSED void *esalt_buf, MAYBE_UNUSED void *hook_salt_buf, MAYBE_UNUSED hashinfo_t *hash_info, const char *line_buf, MAYBE_UNUSED const int line_len)
{
  u32 *digest = (u32 *) digest_buf;

  bitwarden_server_double_salt_t *bitwarden_server_double_salt = (bitwarden_server_double_salt_t *) esalt_buf;

The point of the module_hash_decode function is to take the ASCII hash line from line_buf and decode it into digest_buf for the raw hash digest, esalt_buf for the esalt structure containing both salts, and salt for the loop iteration counts.

The actual parsing code uses the hashcat tokenizer, which is fairly straightforward to use, and so I won’t go through it.

  // make salt sorter happy

  md5_ctx_t md5_ctx;

  md5_init   (&md5_ctx);
  md5_update (&md5_ctx, bitwarden_server_double_salt->inner_salt_buf, bitwarden_server_double_salt->inner_salt_len);
  md5_update (&md5_ctx, bitwarden_server_double_salt->outer_salt_buf, bitwarden_server_double_salt->outer_salt_len);
  md5_final  (&md5_ctx);

  salt->salt_buf[0] = md5_ctx.h[0];
  salt->salt_buf[1] = md5_ctx.h[1];
  salt->salt_buf[2] = md5_ctx.h[2];
  salt->salt_buf[3] = md5_ctx.h[3];

  salt->salt_len = 16;

Even though the code uses an esalt structure to pass the actual salt, hashcat still uses the value of the salt_t salt->salt_buf to keep track of digest uniqueness. We use a synthetic MD5 hash of the actual salts to make a fixed-length value that can be used to determine if salts are the same or not.

  int tmp_len = base64_decode (base64_to_int, hash_pos, hash_len, tmp_buf);

  if (tmp_len != 32) return (PARSER_HASH_LENGTH);

  memcpy (digest, tmp_buf, 32);

  digest[0] = byte_swap_32 (digest[0]);
  digest[1] = byte_swap_32 (digest[1]);
  digest[2] = byte_swap_32 (digest[2]);
  digest[3] = byte_swap_32 (digest[3]);
  digest[4] = byte_swap_32 (digest[4]);
  digest[5] = byte_swap_32 (digest[5]);
  digest[6] = byte_swap_32 (digest[6]);
  digest[7] = byte_swap_32 (digest[7]);

  return (PARSER_OK);
}

The decoding function finishes by base64-decoding the actual hash digest and copying it to digest (which is the function argument digest_buf). Notably, because GPUs store data as 32-bit registers, the OpenCL kernel is written to accept buffers of 32-bit integers. In order to convert a byte string to this format, each 32-bit integer chunk has to be byte-swapped into the proper endianness.

int module_hash_encode (MAYBE_UNUSED const hashconfig_t *hashconfig, MAYBE_UNUSED const void *digest_buf, MAYBE_UNUSED const salt_t *salt, MAYBE_UNUSED const void *esalt_buf, MAYBE_UNUSED const void *hook_salt_buf, MAYBE_UNUSED const hashinfo_t *hash_info, char *line_buf, MAYBE_UNUSED const int line_size)

When a kernel cracks a hash by computing the hash of a password candidate to be equal to the target hash digest, it needs to communicate both the password and the hash that was cracked back to the user (hashcat jobs are expected to be run against multiple hashes at a time). Instead of storing the hash lines in memory and trying to match up the results of the kernel computation to them somehow, hashcat modules take the raw digest and salt values and recompute the hash line. If you only crack a single hash at a time and don’t want to store the cracked hash in the potfile, then this function could output empty hash lines and you wouldn’t miss anything. Practically though, you need this function.

The function is mainly stuffing data into a snprintf call, so I won’t go into the details.

Kernel

The kernel is written in OpenCL, which is a C-like language that I’d never written anything in before. Fortunately, but also unfortunately, hashcat provides so many macros that kernel OpenCL code is almost unrecognizably different from OpenCL examples I found online, at least to my inexperienced eye. I was able to copy and paste from other kernels and try various things until it somehow all worked. The plugin I based mine on (23400) didn’t use vector data types. I spent a while looking at kernels that did, and I still have no clue how those are supposed to work.

Cryptographic primitives

Here is a reminder of the full algorithm the kernel needs to implement:

Server Password Hash(Master Password, Email Address):
    return PBKDF2-256(
        password=base64(PBKDF2-SHA256(
            password=PBKDF2-SHA256(
                password=Master Password,
                key=Email Address,
                rounds=600000
            ),
            key=Master Password,
            rounds=1,
        )),
        key=Email Address,
        rounds=10000
    )

Before we look at the actual kernel code, let’s review the definitions of some cryptographic functions. Hashcat doesn’t have built-in functions to compute PBKDF2-SHA256, so we’ll have to break it down into its cryptographic primitives.

Password-based key derivation function 2, or PBKDF2, is a way to take a user-submitted password and derive an arbitrary-length byte sequence suitable for use as a cryptographic key. More background information, including KDF properties, is outside the scope of this post, but can be found elsewhere. The full PBKDF2 definition is also outside the scope of this post; we will only consider PBKDF2-SHA256 (which is short for PBKDF2-HMAC-SHA256) and take the special case of a single-block output instead of the full arbitrary-length version:

PBKDF2-SHA256(Password, Salt, Rounds):
    Buffer = HMAC-SHA256(Password, Salt + big-endian-int(1))
    repeat Rounds times:
        Buffer = Buffer XOR HMAC-SHA256(Password, Buffer)
    return Buffer

But that definition still depends on another cryptographic function: HMAC-SHA256. Hashcat has functions that can help compute HMACs, but using the functions requires some knowledge of what they’re actually doing. Hash-based Method Authentication Code, or HMAC, is a way to generate a cryptographic hash of some data that also proves knowledge of a shared secret. Again, the full rationale and cryptographic properties of HMACs are outside the scope of this post, but they are suited for KDFs. HMAC-SHA256 can be calculated with:

HMAC-SHA256(Key=Password, Message=Buffer):
    ipad = 0x36 repeated 64 times
    opad = 0x5c repeated 64 times
    return SHA256((Key XOR opad) + SHA256((Key XOR ipad) + Message))

Kernel algorithm

So now we have Bitwarden’s full hash algorithm, knowledge of how it breaks down into cryptographic primitives, and a rough understanding of hashcat kernel architecture. Now we have to divide up the calculation into kernel functions.

  1. _init: Computes the first step of the innermost PBKDF2, namely calculating the special case first HMAC iteration with the constant 1.
  2. _loop: Computes the other rounds (600000-1=599999 by default) of the innermost PBKDF2, which consists of a HMAC-SHA256 computation each round.
  3. _init2: Takes the output of the first loop, computes an additional PBKDF2 (which is just a single HMAC-SHA256), base64-encodes the output of that, and computes the first step of the outermost PBKDF2.
  4. _loop2: Computes the other rounds of the outermost PBKDF2.
  5. _comp: Copies output to comparison buffer.

Code

OpenCL code involves more manual operations than normal C code. The main reason is to avoid data-dependent branching. I don’t have the background to tell you exactly way (broadly, it’s for performance), but I can talk about the effects: If-statements, for-loops, and other control flow are minimized. As much as possible, everything is done as a series of assignments with mathematical operations. For example, instead of a memcpy, you write an assignment for each 64-bit int in the buffer.

Hashcat does define helper functions for cryptographic primitives, but there is still more grunt-work than I expected. Let’s talk about it.

KERNEL_FQ void m23410_init (KERN_ATTR_TMPS_ESALT (bitwarden_tmp_t, bitwarden_server_double_salt_t))
{

This is the kernel entry point. KERN_ATTR_TMPS_ESALT is a macro that passes a tmp structure and an esalt to a function; there are other macros if you need fewer or other arguments.

  const u64 gid = get_global_id (0);

  if (gid >= GID_CNT) return;

gid is a number unique to a particular password candidate. Each instance of buffers like tmps and esalts is passed to the GPU as part of one large array of data containing all the tmps and esalts for the entire cracking job, and each kernel execution can read and modify all of it. The point of a gid is to index those arrays to only reference data allotted to a particular candidate’s execution.

  sha256_hmac_ctx_t sha256_hmac_ctx;

  sha256_hmac_init_global_swap (&sha256_hmac_ctx, pws[gid].i, pws[gid].pw_len);

  tmps[gid].ipad[0] = sha256_hmac_ctx.ipad.h[0];
  tmps[gid].ipad[1] = sha256_hmac_ctx.ipad.h[1];
  tmps[gid].ipad[2] = sha256_hmac_ctx.ipad.h[2];
  tmps[gid].ipad[3] = sha256_hmac_ctx.ipad.h[3];
  tmps[gid].ipad[4] = sha256_hmac_ctx.ipad.h[4];
  tmps[gid].ipad[5] = sha256_hmac_ctx.ipad.h[5];
  tmps[gid].ipad[6] = sha256_hmac_ctx.ipad.h[6];
  tmps[gid].ipad[7] = sha256_hmac_ctx.ipad.h[7];

  tmps[gid].opad[0] = sha256_hmac_ctx.opad.h[0];
  tmps[gid].opad[1] = sha256_hmac_ctx.opad.h[1];
  tmps[gid].opad[2] = sha256_hmac_ctx.opad.h[2];
  tmps[gid].opad[3] = sha256_hmac_ctx.opad.h[3];
  tmps[gid].opad[4] = sha256_hmac_ctx.opad.h[4];
  tmps[gid].opad[5] = sha256_hmac_ctx.opad.h[5];
  tmps[gid].opad[6] = sha256_hmac_ctx.opad.h[6];
  tmps[gid].opad[7] = sha256_hmac_ctx.opad.h[7];

Recall that the _init function is supposed to compute the first special-case HMAC to set up the _loop function to compute the successive PBKDF2 rounds. But we’re not there yet. sha256_hmac_init_global_swap is a hashcat function that copies the password into a context struct for later usage, but it doesn’t actually calculate anything yet. It also sets up ipad and opad buffers with the right constants. We save those constant values into our tmps buf to save time later. The ctx struct’s ipad and opad values get overridden when computing the result, so instead of having to call an initialization function every loop, we can just copy the saved values back in then.

  sha256_hmac_update_global_swap (&sha256_hmac_ctx, esalt_bufs[SALT_POS_HOST].inner_salt_buf, esalt_bufs[SALT_POS_HOST].inner_salt_len);

  sha256_hmac_ctx_t sha256_hmac_ctx2 = sha256_hmac_ctx;

  u32 w0[4];
  u32 w1[4];
  u32 w2[4];
  u32 w3[4];

  w0[0] = 1;
  w0[1] = 0;
  w0[2] = 0;
  w0[3] = 0;
  w1[0] = 0;
  w1[1] = 0;
  w1[2] = 0;
  w1[3] = 0;
  w2[0] = 0;
  w2[1] = 0;
  w2[2] = 0;
  w2[3] = 0;
  w3[0] = 0;
  w3[1] = 0;
  w3[2] = 0;
  w3[3] = 0;

  sha256_hmac_update_64 (&sha256_hmac_ctx2, w0, w1, w2, w3, 4);

  sha256_hmac_final (&sha256_hmac_ctx2);

  tmps[gid].dgst[0] = sha256_hmac_ctx2.opad.h[0];
  tmps[gid].dgst[1] = sha256_hmac_ctx2.opad.h[1];
  tmps[gid].dgst[2] = sha256_hmac_ctx2.opad.h[2];
  tmps[gid].dgst[3] = sha256_hmac_ctx2.opad.h[3];
  tmps[gid].dgst[4] = sha256_hmac_ctx2.opad.h[4];
  tmps[gid].dgst[5] = sha256_hmac_ctx2.opad.h[5];
  tmps[gid].dgst[6] = sha256_hmac_ctx2.opad.h[6];
  tmps[gid].dgst[7] = sha256_hmac_ctx2.opad.h[7];

  tmps[gid].out[0] = tmps[gid].dgst[0];
  tmps[gid].out[1] = tmps[gid].dgst[1];
  tmps[gid].out[2] = tmps[gid].dgst[2];
  tmps[gid].out[3] = tmps[gid].dgst[3];
  tmps[gid].out[4] = tmps[gid].dgst[4];
  tmps[gid].out[5] = tmps[gid].dgst[5];
  tmps[gid].out[6] = tmps[gid].dgst[6];
  tmps[gid].out[7] = tmps[gid].dgst[7];
}

Now we actually compute our first HMAC-SHA256. First we load the salt in with sha256_hmac_update_global_swap, then we append 0x1000000 (1 as a big-endian encoded 32-bit int stored in the w registers) with sha256_hmac_update_64, and then we finish it with sha256_hmac_final.

As we alluded to previously, the opad field of the context struct gets overridden—in fact, the hash digest is stored there. So we copy it out into our tmps buffer, in both dgst and out fields for later.

To recap the values of the tmps struct at the end of _init:

  • tmps.ipad contains the HMAC ipad constant
  • tmps.opad contains the HMAC opad constant
  • tmps.dgst contains the digest of the first round HMAC of the innermost PBKDF2
  • tmps.out contains the same thing
KERNEL_FQ void m23410_loop (KERN_ATTR_TMPS_ESALT (bitwarden_tmp_t, bitwarden_server_double_salt_t))
{

Next up is the first loop, executed right after the _init function. _loop kernel functions compute multiple steps of the main computation. The reason that they take multiple invocations, and the loops don’t just iterate through the total round count, is so that the process doesn’t hog the GPU. On a desktop system, X or whatever graphics stack still needs to execute on the GPU as well, and a bunch of long-running loops could cause screen lag. So each _loop invocation gets allocated a certain number of iterations, given by the LOOP_CNT variable. This is another area where the hashcat plugin guide wasn’t comprehensive: the guide didn’t mention LOOP_CNT at all.

  u32x ipad[8];
  u32x opad[8];

  ipad[0] = packv (tmps, ipad, gid, 0);
  ipad[1] = packv (tmps, ipad, gid, 1);
  ipad[2] = packv (tmps, ipad, gid, 2);
  ipad[3] = packv (tmps, ipad, gid, 3);
  ipad[4] = packv (tmps, ipad, gid, 4);
  ipad[5] = packv (tmps, ipad, gid, 5);
  ipad[6] = packv (tmps, ipad, gid, 6);
  ipad[7] = packv (tmps, ipad, gid, 7);

  opad[0] = packv (tmps, opad, gid, 0);
  opad[1] = packv (tmps, opad, gid, 1);
  opad[2] = packv (tmps, opad, gid, 2);
  opad[3] = packv (tmps, opad, gid, 3);
  opad[4] = packv (tmps, opad, gid, 4);
  opad[5] = packv (tmps, opad, gid, 5);
  opad[6] = packv (tmps, opad, gid, 6);
  opad[7] = packv (tmps, opad, gid, 7);

  u32x dgst[8];
  u32x out[8];

  dgst[0] = packv (tmps, dgst, gid, 0);
  dgst[1] = packv (tmps, dgst, gid, 1);
  dgst[2] = packv (tmps, dgst, gid, 2);
  dgst[3] = packv (tmps, dgst, gid, 3);
  dgst[4] = packv (tmps, dgst, gid, 4);
  dgst[5] = packv (tmps, dgst, gid, 5);
  dgst[6] = packv (tmps, dgst, gid, 6);
  dgst[7] = packv (tmps, dgst, gid, 7);

  out[0] = packv (tmps, out, gid, 0);
  out[1] = packv (tmps, out, gid, 1);
  out[2] = packv (tmps, out, gid, 2);
  out[3] = packv (tmps, out, gid, 3);
  out[4] = packv (tmps, out, gid, 4);
  out[5] = packv (tmps, out, gid, 5);
  out[6] = packv (tmps, out, gid, 6);
  out[7] = packv (tmps, out, gid, 7);

I believe that the packv/unpackv calls have something to do with vector data types, but I don’t dare touch them. The gist of this section is just copying all the buffers from tmps out to local variables.

  for (u32 j = 0; j < LOOP_CNT; j++)
  {
    u32x w0[4];
    u32x w1[4];
    u32x w2[4];
    u32x w3[4];

    w0[0] = dgst[0];
    w0[1] = dgst[1];
    w0[2] = dgst[2];
    w0[3] = dgst[3];
    w1[0] = dgst[4];
    w1[1] = dgst[5];
    w1[2] = dgst[6];
    w1[3] = dgst[7];
    w2[0] = 0x80000000;
    w2[1] = 0;
    w2[2] = 0;
    w2[3] = 0;
    w3[0] = 0;
    w3[1] = 0;
    w3[2] = 0;
    w3[3] = (64 + 32) * 8;

    hmac_sha256_run_V (w0, w1, w2, w3, ipad, opad, dgst);

    out[0] ^= dgst[0];
    out[1] ^= dgst[1];
    out[2] ^= dgst[2];
    out[3] ^= dgst[3];
    out[4] ^= dgst[4];
    out[5] ^= dgst[5];
    out[6] ^= dgst[6];
    out[7] ^= dgst[7];
  }

Finally, the first PBKDF2 loop. This is the first, innermost, PBKDF2, which, as a reminder, is calculating PBKDF2-SHA256(password=Master Password, key=Email Address, rounds=600000). And remembering the definition of PBKDF2, each iteration is calculating Buffer XOR HMAC-SHA256(Password, Buffer), where out is being used as the buffer between loops.

Being completely honest, I don’t fully understand how the actual HMAC calculation works. hmac_sha256_run_V is a helper function that was used in the original Bitwarden hash plugin, and I just copied it and its usage. I decided that I could stop at this level of abstraction, especially since it uses vector datatypes internally.

  unpackv (tmps, dgst, gid, 0, dgst[0]);
  unpackv (tmps, dgst, gid, 1, dgst[1]);
  unpackv (tmps, dgst, gid, 2, dgst[2]);
  unpackv (tmps, dgst, gid, 3, dgst[3]);
  unpackv (tmps, dgst, gid, 4, dgst[4]);
  unpackv (tmps, dgst, gid, 5, dgst[5]);
  unpackv (tmps, dgst, gid, 6, dgst[6]);
  unpackv (tmps, dgst, gid, 7, dgst[7]);

  unpackv (tmps, out, gid, 0, out[0]);
  unpackv (tmps, out, gid, 1, out[1]);
  unpackv (tmps, out, gid, 2, out[2]);
  unpackv (tmps, out, gid, 3, out[3]);
  unpackv (tmps, out, gid, 4, out[4]);
  unpackv (tmps, out, gid, 5, out[5]);
  unpackv (tmps, out, gid, 6, out[6]);
  unpackv (tmps, out, gid, 7, out[7]);
}

The end of the _loop function just copies the local values of dgst and out back to tmps, but with another function that probably has something to do with vectors. tmps->ipad and tmps->opad are still the values that _init set.

KERNEL_FQ void m23410_init2 (KERN_ATTR_TMPS_ESALT (bitwarden_tmp_t, bitwarden_server_double_salt_t))
{

The _init2 function is the most complicated—code-wise, not computation-wise—because it has to do multiple different things:

  1. Compute an additional PBKDF2 (which is just a single HMAC-SHA256). At this point we have completed calculation of a Master Password Hash (which I’ve called the inner hash in some places), which is what the existing Bitwarden plugin (mode 23400) does.
  2. Base64-encode the output of that.
  3. Compute the first step of the outermost PBKDF2 to set up _loop2.
  sha256_hmac_ctx_t sha256_hmac_ctx_inner2;

  sha256_hmac_init (&sha256_hmac_ctx_inner2, out, 32);

  u32x ipad[8];
  u32x opad[8];
  ipad[0] = sha256_hmac_ctx_inner2.ipad.h[0];
  ipad[1] = sha256_hmac_ctx_inner2.ipad.h[1];
  ipad[2] = sha256_hmac_ctx_inner2.ipad.h[2];
  ipad[3] = sha256_hmac_ctx_inner2.ipad.h[3];
  ipad[4] = sha256_hmac_ctx_inner2.ipad.h[4];
  ipad[5] = sha256_hmac_ctx_inner2.ipad.h[5];
  ipad[6] = sha256_hmac_ctx_inner2.ipad.h[6];
  ipad[7] = sha256_hmac_ctx_inner2.ipad.h[7];

  opad[0] = sha256_hmac_ctx_inner2.opad.h[0];
  opad[1] = sha256_hmac_ctx_inner2.opad.h[1];
  opad[2] = sha256_hmac_ctx_inner2.opad.h[2];
  opad[3] = sha256_hmac_ctx_inner2.opad.h[3];
  opad[4] = sha256_hmac_ctx_inner2.opad.h[4];
  opad[5] = sha256_hmac_ctx_inner2.opad.h[5];
  opad[6] = sha256_hmac_ctx_inner2.opad.h[6];
  opad[7] = sha256_hmac_ctx_inner2.opad.h[7];

  sha256_hmac_update_global_swap (&sha256_hmac_ctx_inner2, pws[gid].i, pws[gid].pw_len);

  sha256_hmac_ctx_t sha256_hmac_ctx2_inner2 = sha256_hmac_ctx_inner2;

  u32 w0[4];
  u32 w1[4];
  u32 w2[4];
  u32 w3[4];

  w0[0] = 1;
  w0[1] = 0;
  w0[2] = 0;
  w0[3] = 0;
  w1[0] = 0;
  w1[1] = 0;
  w1[2] = 0;
  w1[3] = 0;
  w2[0] = 0;
  w2[1] = 0;
  w2[2] = 0;
  w2[3] = 0;
  w3[0] = 0;
  w3[1] = 0;
  w3[2] = 0;
  w3[3] = 0;

  sha256_hmac_update_64 (&sha256_hmac_ctx2_inner2, w0, w1, w2, w3, 4);

  sha256_hmac_final (&sha256_hmac_ctx2_inner2);

  dgst[0] = sha256_hmac_ctx2_inner2.opad.h[0];
  dgst[1] = sha256_hmac_ctx2_inner2.opad.h[1];
  dgst[2] = sha256_hmac_ctx2_inner2.opad.h[2];
  dgst[3] = sha256_hmac_ctx2_inner2.opad.h[3];
  dgst[4] = sha256_hmac_ctx2_inner2.opad.h[4];
  dgst[5] = sha256_hmac_ctx2_inner2.opad.h[5];
  dgst[6] = sha256_hmac_ctx2_inner2.opad.h[6];
  dgst[7] = sha256_hmac_ctx2_inner2.opad.h[7];

  out[0] = sha256_hmac_ctx2_inner2.opad.h[0];
  out[1] = sha256_hmac_ctx2_inner2.opad.h[1];
  out[2] = sha256_hmac_ctx2_inner2.opad.h[2];
  out[3] = sha256_hmac_ctx2_inner2.opad.h[3];
  out[4] = sha256_hmac_ctx2_inner2.opad.h[4];
  out[5] = sha256_hmac_ctx2_inner2.opad.h[5];
  out[6] = sha256_hmac_ctx2_inner2.opad.h[6];
  out[7] = sha256_hmac_ctx2_inner2.opad.h[7];

This code block is an entire PBKDF2 implementation! Because it’s only 1 round, it’s really just an HMAC with a constant (1) appended.

  base64_encode_sha256 (b, out);

I pulled this function from OpenCL/m32500-pure.cl. It encodes exactly 32 bytes, the length of a SHA256 digest, into base64. Because it’s a fixed-length input, it can use bit-swizzling shortcuts instead of conditionally iterating through bytes, which makes it much faster on GPU hardware.

  sha256_hmac_init (&sha256_hmac_ctx_outer, b, 44);
/* snip */
  sha256_hmac_update_global_swap (&sha256_hmac_ctx_outer, esalt_bufs[SALT_POS_HOST].outer_salt_buf, esalt_bufs[SALT_POS_HOST].outer_salt_len);
/* snip */
}

The rest of _init2 is the same as _init, except that it operates on the base64 output of the inner hash instead of the password, and it uses the outer salt buffer instead of the inner salt.

KERNEL_FQ void m23410_loop2 (KERN_ATTR_TMPS_ESALT (bitwarden_tmp_t, bitwarden_server_double_salt_t))
{

_loop2 computes the outermost PBKDF2. It is identical to _loop.

Note: to support server hashes created by newer ASP.NET Core versions, as described in our first post, this loop would need to be changed from HMAC-SHA256 to HMAC-SHA512.

KERNEL_FQ void m23410_comp (KERN_ATTR_TMPS_ESALT (bitwarden_tmp_t, bitwarden_server_double_salt_t))
{
  const u64 gid = get_global_id (0);

  if (gid >= GID_CNT) return;

  const u32 r0 = tmps[gid].out[0];
  const u32 r1 = tmps[gid].out[1];
  const u32 r2 = tmps[gid].out[2];
  const u32 r3 = tmps[gid].out[3];

  #define il_pos 0

  #ifdef KERNEL_STATIC
  #include COMPARE_M
  #endif
}

This function uses the standard hashcat macro to compare the resultant digest to the target. Finally, we’re done!

A brief note on debugging: I found that the best practice was to implement the hash algorithm in Python (or your preferred scripting language), using the same cryptographic primitives you’re using in kernel code instead of high-level library functions, and print hex dumps of the intermediate values every step of the way. Then litter your kernel code with the same print statements, e.g.:

printf("start of _loop2 hex: %08x %08x %08x %08x %08x %08x %08x %08x\n", out[0], out[1], out[2], out[3], out[4], out[5], out[6], out[7]);

A high-level scripting implementation like the one at the beginning of this file wasn’t granular enough to help me fix my kernel code as I was writing it.

Tests

The final part of a hashcat plugin is the test module, written in Perl. It has a way to generate valid hash lines (module_generate_hash) and a way to verify that a cracked line (with both the hash and the recovered plaintext) is internally consistent; i.e. that the hash line is equivalent to hashing the cracked password again (module_verify_hash). One part that tripped me up for quite a while is that Crypt::PBKDF2->new(...)->PBKDF2(a, b); actually calculates a PBKDF2 with a as the key and b as the password, which is the opposite of what I expected.

Conclusion

If you remember the start of the post, our naive Python cracking implementation could calculate about 3 H/s on a laptop. Our hashcat implementation, on the other hand, could compute about 300 H/s on the same laptop using its integrated GPU. That’s a 100x speedup on a system that doesn’t even have a dedicated GPU. Our benchmarks on an actual GPU were even better, with another 80x speedup on top of that (details to come).

The code is available at https://github.com/ivision-research/hashcat.

Boy, that was a lot of obtuse code. In our next post, we’ll temporarily leave Bitwarden behind and examine ASP.NET Core’s Data Protection, which is used in our hash extractor script. Finally, stick around for the final post that might render most of this post useless anyway.