Amarok now reads the tags written by Foobar2000 and by mp3gain (written when you call mp3gain without the -a or -r options) from MP3 files. However, the final part of MP3 support is tricker: the RVA2 tag in the ID3v2.4 spec.

Naturally, the specification leaves out an all-important detail: the format of the peak volume field. It tells you which bits represent the peak volume, but not how to interpret them.

Luckily, mutagen, a Python audio metadata library, supports this tag, so it’s implementation can serve as a reference. However, they try to be clever with their implementation, so reverse-engineering it to arrive at the format of the original data requires some work.

The documentation on the Python class implementing the RVA2 frame support says that the peak volume is a float between 0 and 1. So 0 is silent, 1 is full volume (digital full scale). This doesn’t seem right to me, because the replay gain specification points out that it is possible to have a peak volume over 1 in some circumstances in a compressed audio file. But we’ll leave that aside for the moment.

Let’s start with the code. **data** contains the raw bytes, the first of which is a number specifying how many *bits* (not bytes) of the remaining data is occupied by the number representing the peak volume.

peak = 0 bits = ord(data[0]) bytes = min(4, (bits + 7) >> 3) # not enough frame data if bytes + 1 > len(data): raise ID3JunkFrameError shift = ((8 - (bits & 7)) & 7) + (4 - bytes) * 8 for i in range(1, bytes+1): peak *= 256 peak += ord(data[i]) peak *= 2**shift return (float(peak) / (2**31-1))

Let’s start with **bytes**. This is simply **bits** (the number of bits representing the peak volume) rounded up to the nearest 8, then divided by 8. So if **bits** is *8n + k*, **bytes** is *n* in the case that *k = 0* and *(n+1)* in the case that *k > 0*.

The next variable is the **shift**. This is the first bit of clever magic, and it takes some time spent staring at it (preferably with a pad and paper to hand) to arrive at the following conclusion:

- if
*k = 0*,**shift**is*8(4 – n)* - if
*k > 0*,**shift**is*8(4 – (n + 1))*

Then we read the bits into **peak**. Remember that if *k > 0*, the last *(8 – k)* bits will be junk. Now we shift it right (**shift** is always at least *0*, because of our contraint on **bytes** to be at least *4*) so that the first 32 bits are all that remains (I assume here that Python is treating **peak** as an integer). Then we turn **peak** into a float and divide it by *(2 ^{31}) – 1*. This contant is a magic number, being the largest value that can be stored in a signed 32-bit integer.

Something that might shed light on this is that, when it writes the peak volume out, it simply writes the value multiplied by *2 ^{15}* as a 16-bit unsigned integer. This would make interpreting the value as simple as placing a “decimal” point after the first binary digit (so we get 1 digit before the point and 15 after). Note that this does indeed allow a peak volume greater than

*1*(but less than

*2*).

I’m left with two questions:

- Why do we divide the number by MAX_INT_32, rather than simply
*2*? (I just made up that constant name now, don’t complain that it’s wrong.)^{31} - Why does mutagen put a 32-bit minimum on the number, and then write a 16-bit number when it writes out RVA2 tags?

Answers on a postcard (or just in the comments).

Tags: Amarok, ID3v2, mp3, python, replay gain

17th January 2009 at 4:07 pm |

I did a quick google search and found this:

http://www.hydrogenaudio.org/forums/lofiversion/index.php/t39550.html

From the last post:

Heh, I forgot to add that there’s only a few known programs implementing RVA2 we know about in this thread so far, and that’s the normalize program mentioned above and a patch to XMMS written by the same guy.

That code is treating the peak as the maximum decoded sample value pretty much all the time. It also always makes it a 32 bit value. Comply with that code and you comply with all known implementations.

So, assuming that source is correct, it looks like it is just a matter of decoding the file, finding the largest value, and then storing it. Unless the file is encoded at something other than 16 bits, in which case you will have to convert it to a 16-bit value (rounding up I assume, since it seems the point of that field is to prevent clipping and so it is better to be a little higher than a little lower).