[Top] [All Lists]

XFS Filename Hash and metadump

To: xfs@xxxxxxxxxxx
Subject: XFS Filename Hash and metadump
From: Alex Elder <aelder@xxxxxxx>
Date: Tue, 05 Oct 2010 09:55:47 -0500
Reply-to: aelder@xxxxxxx
In xfs_db, the metadump creation procedure optionally allows you to
replace all file names in the dump with obfuscated versions of those
names.  This has to be done in such a way that the hash value
associated with the obfuscated name is the same as the hash assigned
to the original file.

The algorithm that creates these obfuscated file names creates them
by generating a random set of characters (using the full 8-bit
range, only disallowing '/' and '\0') up to the last five.  Along
the way, the hash for the name is computed, and it--along with the
hash for the original file--is used to determine what the last five
characters need to be in order to match.

There are some cases in this scheme, however, where files can't get
an equivalent name using generate_obfuscated_name() as currently
written (one of which Arkadiusz Miśkiewicz found).

Below I'm going to describe a bit more specifically how the hash
works, and why some cases run into trouble.  I also describe a
way to adjust the algorithm to avoid this pitfall.

XFS computes a filename hash by accumulating a 32-bit value by
shifting (rotating) and XOR'ing each character in the name.

- A 32-bit accumulator is initially zero-filled.
Then with each character in the file name:
- Rotate Left the contents of the 32-bit accumulator by 7 bits
- XOR the next byte of the file name into the least-significant byte
  of the accumulator

The algorithm for generating another filename having the same length
and same hash starts by generating a string of random bytes (other
than '\0' and '/') to replace each byte in the original name, for
all but five characters in the name.  The incremental hash for the
new file name is computed as each byte is replaced.  The last five
characters are then selected such that they contribute to the hash
exactly what is needed to produce a hash that matches the original
name's hash.  Note that no attempt is made to come up with an
alternative name for names shorter than five characters.

Arkadiusz Miśkiewicz found that a file named "R\323\257NE" caused an
attempt to create a metadump to hang.  That 5-character file name,
interpreted in hexadecimal, is: 0x52 d3 af 4e 45, and when run
through the hashing algorithm the result is 0x3a4be740.

Since there are only five characters in this name, no random
characters are generated (so the hash value before selecting the
last five characters is zero).  To determine the last five
characters, the accumulated hash (zero in this case) is XOR'd with
the complete hash for the original filename.  The resulting
bits--after appropriate shifting--define exactly the bytes to use so
the new file name has the same hash as the original.  If one of
those bytes happens to be illegal ('\0' or '/'), the process
restarts in hopes a different random string of bytes will produce
one that does work.

The problem in this case is that there are no random characters in
the name.  So if any character dictated by the original hash is
illegal, it will always be illegal and the process will repeat

That is the case here.  In the "R\323\257NE" string, one of the
characters that comes out as a necessary component of the obfuscated
name is '/', which is not allowed.  Hence the algorithm loops
without terminating.

So how do we fix this?  We can tweak the algorithm a bit, based on a
few observations, and the result will still generate filenames
having the desired properties but will also avoid the infinite loop
problem described above.

Let's look at some details.  Here is how the first (of the last
five) byte is computed:
        newname[namelen - 4] = (newhash >> 21) & 0x7f;
The hash for the problem name is 0x3a4b3740.  The result after the
shift is 0x1d2, and after the mask 0x52.  That's an OK character.

The next byte is computes like this:
        newname[namelen - 3] = (newhash >> 14) & 0x7f;
For the problem name the shift gives 0xe92c, and after the mask it's
0x2f.  That is NOT OK, it's the '/' character.  And since there was
never any randomization of characters involved, there's no chance
that retrying the algorithm will come up with any other result.

It turns out that this is just one of a whole class of file names
that will run into this problem.  I worked out an example of each of
        "\120\001\257\116\105", /* Byte 0 (first) must be zero */
        "\122\001\257\116\105", /* Byte 1 must be zero */
        "\122\323\200\116\105", /* Byte 2 must be zero */
        "\122\323\254\001\305", /* Byte 3 must be zero */
        "\122\323\247\116\005", /* Byte 4 (last) must be zero */

        "\172\057\147\116\105", /* Byte 1 must be '/' */
        "\122\320\057\116\105", /* Byte 2 must be '/' */
        "\122\323\247\057\105", /* Byte 3 must be '/' */
        "\002\323\247\116\057", /* Byte 4 (last) must be '/' */
(I didn't come up with one in which the first byte ends up having to
be a '/'.)  All of the above are 5-character names.  It may be that
with longer names the randomization of characters give the
opportunity to avoid this.  And if so, a very simple solution might
be to just extend the length of file names that are not obfuscated
from 4 to (say) 8.  (But that probably means a LOT more files' names
are put in the metadump without obfuscation.)

First observation on the algorithm.  Why is the high bit of each
computed character masked off?  The reason is that the high bit of
each byte is XOR'd with the low bit of its neighbor, and masking the
high bit off allows the low bit of the previous character to be
selected without concern about the effect of the XOR.

Second observation.  Masking off the low bit could be used instead
to achieve the same effect just described.

Third observation.  The two characters that are not allowed in a
file name are '\0' and '/'.  Their hex values are 0x00 and 0x2f.
The current algorithm clears only the high bit to produce a new file
name character, and that allows either of these two be the result of
the mask.  Clearing the low-order bit instead will never produce the
'/' character.

So if we can change the mask, we can eliminate half of the possible
characters that will cause us problems.  We are still left with the
case that an important hunk of the original file's hash is 0x01,
which produces 0x00 when the bottom bit is masked off.

To address that we can flip the low bit and flip the corresponding
high bit in the neighbor byte that it's XOR'd with, ensuring the
result preserves the desired hash value.

We can do this for any original hash value, and we can do this
without any need to ever re-try generating a random name.  I haven't
determined yet whether there still might be a case that requires an
invalid character to resolve.  If we reach that point we can simply
warn that the file is not getting obfuscated and leave the original
file name as-is.

I'll follow up later with a proposed change to implement what I have
described here.

PS  Two more observations:
    - There is really no need for the characters to be truly random.
      Making the generated name unique and different from the
      original is sufficient.  So (with the exception of the last
      five bytes) we can select the characters however we like.
      They could be a sequential series of names, for example,
      rather than computing a random value for each.
    - Similarly we could select (most of) the characters from a
      smaller subset than is currently used.  I.e., rather than
      using any of 254 possible values, we could restrict it to just
      printable characters (or even a subset of those).  The last
      five characters would be computed as above, and they would
      have to be unrestricted in order to produce the right hash.

<Prev in Thread] Current Thread [Next in Thread>