|
|
 |
 |
 |
 |
MD5 algorithm
I'm trying to implement the MD5 algorithm. (Note: I'm not looking for an existing MD5 library; this is an exercise). It compiles and returns something that looks like a hash superficially, but that's clearly the wrong result. Even if there were a Unicode/Ascii problem, the empty string should return the correct result (since it is always padded to one 1-bit and 511 0-bits), but it doesn't. Also, several strings (say, "" and "The") return the same incorrect hash - but by dumping the 128-bit hash in each iteration, I've found only very little in common between the intermediate steps, only the final one is completely identical. This is the function, as implemented from the pseudocode on Wikipedia. http://en.wikipedia.org/wiki/MD5#Pseudocode -------------------------------------------------------- Constants: int[4] s = { 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476 }; int[64] shift= { 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21 }; int[64] k; for (int i=0;i<64;i++) { k[i]=(int)(Math.abs(Math.sin(i+1))*1073741824*4); } private static int[] processFin(int[] message) { int[] result=s.clone(); // initialize // message is already padded, just split: int blocks=message.length/16; for (int i=0;i<blocks;i++) { // for each 16-int block int[] state=result.clone(); // copy state for (int j=0;j<64;j++) { int f,g,temp; if (j<16) { f= state[1] & state[2] | ~state[1] & state[3]; g=j; } else if (j<32) { f=state[3] & state[1] | ~state[3] & state[2]; g=(5*j+1)%16; } else if (j<48) { f=state[1]^state[2]^state[3]; g=(3*j+5)%16; } else { f=state[2]^ (state[1] | ~state[3]); g=(7*j)%16; } temp=state[3]; state[3]=state[2]; state[2]=state[1]; state[1]=(state[0]+f+k[j]+message[i*16+g]) << shift[j]; state[0]=temp; } for(int j=0;j<4;j++) result[i]+=state[i]; } return result; }
--------------------------------------------- The hex string returned for both "" and "The" is 3c00a101efcdab8998badcfe10325476. I just can't see what's wrong with the code - all parts of the pseudocode description seem to be implemented exactly as they should be... unless the signed ints don't wrap modulo 2^32 as unsigned ones should. But they do for all examples I could think of, and anyway, that shouldn't cause those hashes to be exactly the same... --cb
Her i have found some intressting webpages about it. Maybe it helps. http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html
sebbaer schrieb: If I had found something that helped me in the first 10 Google results, I wouldn't have had to ask here, now would I? :) The only thing I could use would be some kind of step-by-step walkthrough for the MD5 calculation of a short string, so I could check it against the intermediate steps in my own calculation --cb
On Mon, 14 May 2007 13:40:38 +0200, Christoph Burschka <christoph.bursc@rwth-aachen.de> wrote, quoted or indirectly quoted someone who said : >The only thing I could use would be some kind of step-by-step >walkthrough for the MD5 calculation of a short string, so I could check >it against the intermediate steps in my own calculation
Can you find Sun's code? Do you have an assembler level bebugger than will let you single step through the code? -- Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
On Mon, 14 May 2007 13:40:38 +0200, Christoph Burschka <christoph.bursc@rwth-aachen.de> wrote, quoted or indirectly quoted someone who said : >The only thing I could use would be some kind of step-by-step >walkthrough for the MD5 calculation of a short string, so I could check >it against the intermediate steps in my own calculation
-- Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Christoph Burschka wrote: > private static int[] processFin(int[] message) > { > int[] result=s.clone(); // initialize > // message is already padded, just split: > int blocks=message.length/16; > for (int i=0;i<blocks;i++) > { // for each 16-int block
It seems you are skipping the last 16 element in messages. Perhaps the "i < blocks" condition should have been "i <= blocks"? Regards, -- Filip Larsen
On Mon, 14 May 2007 12:38:17 +0200, Christoph Burschka
<christoph.bursc @rwth-aachen.de> wrote: >I'm trying to implement the MD5 algorithm. (Note: I'm not looking for an >existing MD5 library; this is an exercise). It compiles and returns >something that looks like a hash superficially, but that's clearly the >wrong result. Even if there were a Unicode/Ascii problem, the empty >string should return the correct result (since it is always padded to >one 1-bit and 511 0-bits), but it doesn't. >Also, several strings (say, "" and "The") return the same incorrect hash >- but by dumping the 128-bit hash in each iteration, I've found only >very little in common between the intermediate steps, only the final one >is completely identical. >This is the function, as implemented from the pseudocode on Wikipedia. >http://en.wikipedia.org/wiki/MD5#Pseudocode >--------------------------------------------------------
[snip] >private static int[] processFin(int[] message) >{ > int[] result=s.clone(); // initialize > // message is already padded, just split:
Have you implemented the padding correctly? To quote RFC 1321: "3.1 Step 1. Append Padding Bits The message is "padded" (extended) so that its length (in bits) is congruent to 448, modulo 512. That is, the message is extended so that it is just 64 bits shy of being a multiple of 512 bits long. Padding is always performed, even if the length of the message is already congruent to 448, modulo 512. Padding is performed as follows: a single "1" bit is appended to the message, and then "0" bits are appended so that the length in bits of the padded message becomes congruent to 448, modulo 512. In all, at least one bit and at most 512 bits are appended. 3.2 Step 2. Append Length A 64-bit representation of b (the length of the message before the padding bits were added) is appended to the result of the previous step. In the unlikely event that b is greater than 2^64, then only the low-order 64 bits of b are used. (These bits are appended as two 32-bit words and appended low-order word first in accordance with the previous conventions.) At this point the resulting message (after padding with bits and with b) has a length that is an exact multiple of 512 bits. Equivalently, this message has a length that is an exact multiple of 16 (32-bit) words. Let M[0 ... N-1] denote the words of the resulting message, where N is a multiple of 16." Note the endianness of the 64 bit message length field, which is not clearly explained in the Wikipedia article - two big-endian 32 bit words in little-endian order. This is not a standard byte order and you will need to implement it carefully. It is also necessary to get all endian issues resolved correctly within the algorithm itself - they can often be a source of problems when implementing cryptographic algorithms. RFC 1321 (http://tools.ietf.org/html/rfc1321) also includes a canonical implementation of MD5 in C, which might be helpful to read. It also contains a few more Test Vectors than the Wikipedia article. rossum
"Christoph Burschka" <christoph.bursc @rwth-aachen.de> wrote in message news:5ar05nF2psib9U1@mid.dfncis.de... > sebbaer schrieb: >> Her i have found some intressting webpages about it. Maybe it helps. >> http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html > If I had found something that helped me in the first 10 Google results, > I wouldn't have had to ask here, now would I? :)
You'd be surprised... - Oliver
Filip Larsen wrote: > Christoph Burschka wrote: >> private static int[] processFin(int[] message) >> { >> int[] result=s.clone(); // initialize >> // message is already padded, just split: >> int blocks=message.length/16; >> for (int i=0;i<blocks;i++) >> { // for each 16-int block > It seems you are skipping the last 16 element in messages. Perhaps the > "i < blocks" condition should have been "i <= blocks"? > Regards,
Mh... you had me there for a few moments. But no, the message is already padded to a multiple of 16, so blocks is exactly the right number of blocks (if the message has 32 elements, blocks will be 2, and the loop will go from block #0 to block #1). If i is the number of the block, the messages in the block are message[16*i .. 16*i+15]. Since blocks is exactly the message length divided by 16, the loop I have now goes from message[0] to message[(message.length / 16 -1) * 16 +15], or message[message.length-1]. If I changed the boundary condition to i<=blocks, I'd be running out of the array boundary. --cb
rossum wrote: > Note the endianness of the 64 bit message length field, which is not > clearly explained in the Wikipedia article - two big-endian 32 bit > words in little-endian order. This is not a standard byte order and > you will need to implement it carefully. > It is also necessary to get all endian issues resolved correctly > within the algorithm itself - they can often be a source of problems > when implementing cryptographic algorithms. > RFC 1321 (http://tools.ietf.org/html/rfc1321) also includes a > canonical implementation of MD5 in C, which might be helpful to read. > It also contains a few more Test Vectors than the Wikipedia article. > rossum
Whoops. I didn't get the little-endian thing. So far I just padded my int[] array to a length of 14 mod 16, then added the length value (long) as two int words in normal big-endian order. So what I should do is just swap the order of the two ints that I split the long value into, leaving the ints themselves unchanged. Mh. But that can't be my only mistake. Consider the zero-length example - "" will always be padded to 0x8000[...]00, and the unpadded length, 0, is the same regardless of endianness. Still, this message gives me the wrong result. Not even mentioning that I'm somewhat skeptic of the possibility that the order of two 32bit words at the end of the message could so badly break the hashing function as to give it the same hash for two random short messages... but I'll fix this and see what happens. --cb
Christoph Burschka wrote: > Mh... you had me there for a few moments. But no, the message is already padded > to a multiple of 16, so blocks is exactly the right number of blocks (if the > message has 32 elements, blocks will be 2, and the loop will go from block #0 to > block #1).
Ok, hard to discount that without seeing the padding code. > for(int j=0;j<4;j++) result[i]+=state[i];
Next bug candidate is this line, where it looks like the arrays should have been indexed by j and not i. Regards, -- Filip Larsen
Christoph Burschka wrote: > rossum wrote: >> Note the endianness of the 64 bit message length field, which is not >> clearly explained in the Wikipedia article - two big-endian 32 bit >> words in little-endian order. This is not a standard byte order and >> you will need to implement it carefully. >> It is also necessary to get all endian issues resolved correctly >> within the algorithm itself - they can often be a source of problems >> when implementing cryptographic algorithms. >> RFC 1321 (http://tools.ietf.org/html/rfc1321) also includes a >> canonical implementation of MD5 in C, which might be helpful to read. >> It also contains a few more Test Vectors than the Wikipedia article. >> rossum > Whoops. I didn't get the little-endian thing. So far I just padded my int[] > array to a length of 14 mod 16, then added the length value (long) as two int > words in normal big-endian order. > So what I should do is just swap the order of the two ints that I split the long > value into, leaving the ints themselves unchanged. > Mh. But that can't be my only mistake. Consider the zero-length example - "" > will always be padded to 0x8000[...]00, and the unpadded length, 0, is the same > regardless of endianness. Still, this message gives me the wrong result. > Not even mentioning that I'm somewhat skeptic of the possibility that the order > of two 32bit words at the end of the message could so badly break the hashing > function as to give it the same hash for two random short messages... but I'll > fix this and see what happens. > --cb
Whoops again. There's a lot more wrong with my padding than I thought. I appear to be adding the message length some 8 words before the actual end, and I don't even know how I managed to do that. Also, the initialization of my sine table is messed up - casting from double to int will apparently just set too-high values to INT_MAX_VALUE. But if I cast the sine to int before multiplying it by 2^32, well... it just ends up at 0, of course. I'll need to do something else here... Aha! Casting first to long to get a whole number, then to int, appears to give me something closer to what I want - the long->int conversion wraps instead of setting to INT_MAX_VALUE. Whether the result is also the correct one remains to be seen... --cb
Filip Larsen wrote: > Christoph Burschka wrote: >> Mh... you had me there for a few moments. But no, the message is >> already padded >> to a multiple of 16, so blocks is exactly the right number of blocks >> (if the >> message has 32 elements, blocks will be 2, and the loop will go from >> block #0 to >> block #1). > Ok, hard to discount that without seeing the padding code. >> for(int j=0;j<4;j++) result[i]+=state[i]; > Next bug candidate is this line, where it looks like the arrays should > have been indexed by j and not i. > Regards,
Wow... that thing is literally crawling with bugs. Unfortunately, that still didn't do the trick - now the result is different again, but the hashes are still wrong and also nearly identical: []: 0112a34159cdab8981f7dcfea9eb2476 [ ]: 0112a34159cdab8981f7dcfea9eb2476 [The]: 8112a34159cdab8981f7dcfea9eb2476 [ttr]: 8112a34159cdab8981f7dcfea9eb2476 [ t t]: 7512a34159cdab8981f7dcfea9eb2476 [The quick brown fox jumps over the lazy dog]: f3b76b417dcdab89cf30dcfe97eb2476 I think I'll sleep on it and look at the code again when I'm more awake. Thanks very much for the help! :) -- cb
Mh... several more problems fixed. For example, I missed another Endianness reversal: The words of the message itself are processed with the lowest byte first. My fix doesn't look too pretty, but I think it's the fastest it gets without heavily optimizing: private static int littleEndian(int in) { int out=0; if (in<0) { in-=Integer.MIN_VALUE; out+=128; } for (int i=0;i<4;i++) { out+=in%256*Math.pow(256,3-i); in/=256; } return out; } Also, I missed a crucial part of the algorithm: b := ((a + f + k[i] + w[g]) leftrotate r[i]) + b And finally, it appears that the bitwise shift operator isn't doing what I thought it would - no rotation; bits to the left are removed, 0s are added to the right. Too bad - I'd hoped I could use the built-in << operator; now I need a new function for it. (i=i*2+(i<0?1:0)) is reasonably elegant for this, however. Oh, and the final problem: I forgot to reverse the Endianness of the hash result after calculation. With all this, I am now finally getting the correct result for the zero-length message (d41d8cd98f00b204e9800998ecf8427e). I'm still getting incorrect results for actual messages, but they have nothing in common and I can be reasonably sure that's because of incorrect padding or character encoding, not the algorithm. -- cb
|
 |
 |
 |
 |
|