Monday, September 12, 2005

A little long division

Suppose p and q are relatively prime integers, and 0 < p < q. If q has any prime factors besides 2 and 5, then a decimal expansion of p/q will eventually start to repeat. How long will the repeating part be?

Looking at the sequence of digits is actually a very difficult way to approach this problem. It's much simpler to look at the sequence of remainders. Remember how long division works. At every step, we have a remainder r, 0 <= r < q. To get the next digit in the expansion, we figure out how many times q goes into 10*r; this gives us a digit and a new remainder. In C++ code, I might write the long division algorithm this way:

  /* Compute ndigits digits of p/q */
  r = p;
  for (i = 0; i < ndigits; ++i) {
    d[i] = (10*r) / q;
    r    = (10*r) % q;
  }
where those unfamiliar with C should remember that C's integer division gives an integer result (in C, we have 1/2 == 0, but 1.0/2 == 0.5). The operator % gives the remainder of the division. If we keep computing digits, we will eventually see one of two things: a zero remainder, or a repeat of a remainder seen in a previous step. In the former case, p/q can be written as a non-repeating decimal; in the latter case, we have a repeating fraction, where the length of the repeating part is the same as the distance between occurrences of the same remainder in the division algorithm.

The operator % is sometimes called the modulo operator, but it's actually an odd name. In much of computer science, we speak of the mod operator, and mean operator that computes the remainder in a division. By contrast, if you're working with abstract algebra or number theory and you say that two things are equivalent modulo something, you mean that there is an underlying relation that divides your structure into equivalence classes, and the two things under investigation happen to lie in the same equivalence class. The usual example of this is time: 11 pm today and 11 pm yesterday are not the same instant in time, but they are equivalent modulo the twenty four hour cycle we use for time keeping. The connection between modulo-as-operation and modulo-an-equivalence is that when one takes a remainder after dividing by q, one gets a single canonical representative of a whole equivalence class of numbers which differ from each other by some multiple of q. It's like using 11 pm today to refer to 11 pm any day; they're all equivalent modulo a 24 hour day.

Anyhow. The C code for long division computes a sequence of remainders, which look like r[k] = (p*10^k) % q. If q has prime factors other than 2 and 5, then at some point we'll encounter a remainder which is prime relative to q; at that point, we've started the repeating part of the fraction. Past that point, the remainders are all units in the ring of integers modulo q (Z_q), which is just a fancy way of saying that all the successive remainders are also prime relative to q (and therefore have inverses in Z_q). For instance, 4 and 7 are relatively prime, so 4 is a unit in Z_7; the inverse is 2, since 4*2 = 7+1 = 1 modulo 7. The set of units in Z_q is an important object in number theory, so important that it's size is given a special name: the number of natural numbers less than q which are prime relative to q is written phi(q); it is called Euler's phi function or Euler's totient function.

Hmm. I assume totient is related to quotient, but I actually don't know the origin of this word. I'll have to look that up.

One reason why phi(q) is so important is that it connects the structure of multiplication and addition in Z_q. For example, for any m which is a unit in Z_q, I have m^phi(q) = 1; for instance, phi(7) = 6, and 4^6 = 4096 = 1 modulo 7. That means, for instance, that I have an easy way to write m^(-1); it's just m^(phi(q)-1). There are all sorts of other entertaining things you can do with this sort of manipulation -- including figuring out something about the question we originally asked, which is how long the repeating part in the decimal expansion of p/q will be. Without loss of generality, I can just consider 1/q; for more general p/q, I will generate a sequence of remainders which has the same period as for 1/q. Anyhow, if I start from the remainder 1 and run long division, the remainders will generate a cyclic subgroup of the group of units in Z_q -- I'll keep generating new remainders until at some point I get the remainder 1 back. Since I'm always multiplying by 10, the cycle thus formed is sometimes called the orbit of 10 in the group of units in Z_q. The order of this cycle -- the number of elements -- is the number of digits in the repeating part of the decimal expansion. There is a wonderfully versatile little theorem, due to Lagrange, which says that the order of a subgroup must divide the order of the group in which it is contained; in this case, that means that the number of digits in the repeating part must divide phi(q).

Cool, no? So the number of digits in the repeating part of a decimal expansion of p/q must divide phi(q). Actually, if you walk through the arguments I just made, you find that this is true for any base. So, for example, take q = 47; since phi(q) = 46 = 2*23, I can take an expansion in any base and find that the length of the repeating part is either 0, 1, 2, 23, or 46. Base 2, base 10, base 16 -- they all have to generate repetitions with a period that divides 46. Which divisor of 46 you actually get depends on the base in which you do the expansion, though.

So while writing the length of the repeating part of p/q is difficult in general, we can quickly say that it divides phi(q). When q is prime, phi(q) = q-1, and so we could potentially get a repeating part as long as q-1 digits. In fact, we often do get a repeating part as long as q-1 digits... which is why writing numbers in terms of repeating decimal isn't a very good choice of representations for most q. If it takes log10(p) + log10(q) < 2*log10(q) digits to represent a rational number in the form p/q, but as many as q-1 digits to represent the number as a repeating decimal, then the repeating decimal representation is exponentially longer than the rational number representation. Not good! At least, it's not good if you want to actually do arithmetic; if you want to play with number theory on the other hand, or with finite state machines, then this is lots of fun.

As an amusing aside, you can use the development above to write a wholly impractical scheme for cracking the RSA cryptosystem. You see, the RSA system is secure because we don't have good algorithms for computing the Euler totient function for large numbers. Yes, I know you've heard that the reason that RSA is secure has something to do with factoring, but the difficulty of factoring large numbers is really just a proxy for the difficulty of computing Euler's totient. The actual connection is that if m and n are large prime numbers then there's a simple formula for the totient of the product: phi(m*n) = (m-1)*(n-1). But if we do long division in several different bases, we're almost inevitably going to run into a base in which the length of the repeating part is equal to phi(q). So the totient function is actually easy to compute, and any talented grade schooler who knows about long division and repeating decimals can be taught how to crack RSA, right? Well, not exactly. The fact that computing phi(m*n) in this way would take time on the order of m*n -- rather than on the order of log(m*n) -- means that your precocious grade schooler would take at least 130 years to crack a 32-bit key, assuming that he never slept and could get a digit a second right.

A fast computer using a more effective attack algorithm could break a 32-bit RSA key in a matter of minutes (maybe less, I'm not sure). But those algorithms tend to run into the same explosive cost increases as the long-division cracking algorithm that I just proposed, so the key doesn't have to get that much longer to be much more secure. Currently, most transactions are guarded by keys at least 128 bits long, and the keys I use for remote login have either 1024 or 2048 bits. So unless you want to go into number theory as a career, you probably will find that the easiest way to compromise my keys is to completely go around them -- by finding my office and kicking me in the shins until I'm willing to hand them over, for instance.