Reverse Engineering Datatypes

Amarok now reads the tags written by Foobar2000 and by mp3gain (written when you call mp3gain without the -a or -r options) from MP3 files.  However, the final part of MP3 support is tricker: the RVA2 tag in the ID3v2.4 spec.

Naturally, the specification leaves out an all-important detail: the format of the peak volume field.  It tells you which bits represent the peak volume, but not how to interpret them.

Luckily, mutagen, a Python audio metadata library, supports this tag, so it’s implementation can serve as a reference.  However, they try to be clever with their implementation, so reverse-engineering it to arrive at the format of the original data requires some work.

The documentation on the Python class implementing the RVA2 frame support says that the peak volume is a float between 0 and 1.  So 0 is silent, 1 is full volume (digital full scale).  This doesn’t seem right to me, because the replay gain specification points out that it is possible to have a peak volume over 1 in some circumstances in a compressed audio file.  But we’ll leave that aside for the moment.

Let’s start with the code.  data contains the raw bytes, the first of which is a number specifying how many bits (not bytes) of the remaining data is occupied by the number representing the peak volume.

        peak = 0
        bits = ord(data[0])
        bytes = min(4, (bits + 7) >> 3)
        # not enough frame data
        if bytes + 1 > len(data): raise ID3JunkFrameError
        shift = ((8 - (bits & 7)) & 7) + (4 - bytes) * 8
        for i in range(1, bytes+1):
            peak *= 256
            peak += ord(data[i])
        peak *= 2**shift
        return (float(peak) / (2**31-1))

Let’s start with bytes.  This is simply bits (the number of bits representing the peak volume) rounded up to the nearest 8, then divided by 8.  So if bits is 8n + k, bytes is n in the case that k = 0 and (n+1) in the case that k > 0.

The next variable is the shift.  This is the first bit of clever magic, and it takes some time spent staring at it (preferably with a pad and paper to hand) to arrive at the following conclusion:

  • if k = 0, shift is 8(4 – n)
  • if k > 0, shift is 8(4 – (n + 1))

Then we read the bits into peak.  Remember that if k > 0, the last (8 – k) bits will be junk.  Now we shift it right (shift is always at least 0, because of our contraint on bytes to be at least 4) so that the first 32 bits are all that remains (I assume here that Python is treating peak as an integer).  Then we turn peak into a float and divide it by (231) – 1.  This contant is a magic number, being the largest value that can be stored in a signed 32-bit integer.

Something that might shed light on this is that, when it writes the peak volume out, it simply writes the value multiplied by 215 as a 16-bit unsigned integer.  This would make interpreting the value as simple as placing a “decimal” point after the first binary digit (so we get 1 digit before the point and 15 after).  Note that this does indeed allow a peak volume greater than 1 (but less than 2).

I’m left with two questions:

  1. Why do we divide the number by MAX_INT_32, rather than simply 231? (I just made up that constant name now, don’t complain that it’s wrong.)
  2. Why does mutagen put a 32-bit minimum on the number, and then write a 16-bit number when it writes out RVA2 tags?

Answers on a postcard (or just in the comments).


Tags: , , , ,

One Response to “Reverse Engineering Datatypes”

  1. TheBlackCat Says:

    I did a quick google search and found this:

    From the last post:

    Heh, I forgot to add that there’s only a few known programs implementing RVA2 we know about in this thread so far, and that’s the normalize program mentioned above and a patch to XMMS written by the same guy.

    That code is treating the peak as the maximum decoded sample value pretty much all the time. It also always makes it a 32 bit value. Comply with that code and you comply with all known implementations.

    So, assuming that source is correct, it looks like it is just a matter of decoding the file, finding the largest value, and then storing it. Unless the file is encoded at something other than 16 bits, in which case you will have to convert it to a 16-bit value (rounding up I assume, since it seems the point of that field is to prevent clipping and so it is better to be a little higher than a little lower).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: