Composable Parallel.Foreach loops

Microsoft has an excellent library for parallel programming in .NET. However, there’s a strange problem with nested parallel loops. Consider this sample code:

      void MultiplyMatrices(int size, double[,] m1, 
                            double[,] m2, double[,] result)
          Parallel.For(0, size, i =>
              for (int j = 0; j < size; j++)
                  result[i, j] = 0;
                  for (int k = 0; k < size; k++)
                      result[i, j] += m1[i, k] * m2[k, j];

Why aren’t all the loops turned into parallel loops? The documentation says that you should not convert the inner loops into parallel loops because it will degrade performance. This makes sense because there’s a significant cost to run the loop body through the scheduler, particularly in this example where it’s doing a cheap multiply. But what happens if the work in the inner loop is expensive? Then you would want to run the inner loops in parallel, too. So it really depends on how expensive the work is relative to the cost of using Parallel.For.

I would like to write a library of matrix operations that I can reuse in my code. Therefore, this MatrixMultiply method should take a delegate parameter and use that in the loop body. Now I won’t know whether or not I should use parallel loops. Or, what if I compose several matrix operations, each of which uses some parallel loops? Performance may actually degrade because I’m using too much parallelism.

The problem with the parallel library is that the operations aren’t composable. To be fair, I would have shipped the same parallel library because I don’t know how to solve this problem either. One crude idea is to limit parallelism to a small number based on the number of available CPUs. You maintain the nesting level of these parallel loops in the environment somehow (like nested transactions). At some point, it’s not worth doing things in parallel anymore, so convert the parallel loops into normal, serial loops. I assume the Fortran folks have already figured out some solution. I’ll post it when I find it.



  1. Bob Adolf

    Coming from someone who has been fighting this very fight for several years, I feel comfortable claiming the Fortran people haven’t found it either (Fortran people = the HPC community in general). That isn’t to say that they haven’t made progress in that direction, but in practice, it generally falls flat. Even the very best parallel compilers (and for the moment, I’m restricting myself to those used in actually programming, not academic languages, but I’ll get back to that) cannot automatically parallelize even simple loops correctly. Which isn’t to say that they don’t do anything; on the contrary, it makes me all warm and fuzzy when my compiler takes my 30-character loop and transforms it into an elegant reduction tree by extracting the associative properties of my operations and turning the whole thing inside out. BUT they still rely heavily on the programmer to give them appropriate hints on what optimizations they can and can’t perform. The “art” of high-performance parallel coding is like a game of jenga with the compiler where the programmer’s job is to remove the compiler’s assumptions about what it’s not allowed to do until it latches on to something that it can recognize and it opens the floodgates to the reservoir of parallel optimizations it has in store.

    Basically, the root of the problem from my vantage point is like this:
    (1) Entrenchment. There exist a surprisingly large number of computational kernels have existed for more than 30 years without significant change. Moreover, in contrast to the rest of us, HPC vendors are expected to provide something that works on existing code and are judged to some extent on how well their new tools perform in that light. The people buying the machines aren’t willing to sink time into rewriting or trying new techniques, which drives the environment along a railroad track of stagnation started back in the days when everything was written by punch card. Which brings me to my second point:
    (2) The community itself is pragmatic to a fault. A large majority is simply not interested in the computer itself. It is a tool, and time spent working on the tool instead of working on the problem with the tool is not welcome. Remember, a lot of these people are not computer scientists. They are physicists, biologists, and economists. Many wish they could press a magic button and transfer their equations into optimized code, but few are willing to actually write tools to make this happen. This is exacerbated by:
    (3) A limited application domain. A large amount of massively parallel code looks very similar regardless of the field. You have a bag of arrays over here, you have a handful of algorithms over here, and you start running one over the other until your eyes bleed. People know what to expect, and the idioms repeat themselves. You can be very successful in solving the problems by being familiar with a very small corner of the CS world. (Disclaimer: this isn’t to say that such applications don’t exist. In fact, I’d go so far as to say that the *reason* these applications aren’t explored more is partially due to the difficult of writing experimental and “toy” problems in the supercomputing world, but that’s my opinion and certainly not a widely-held viewpoint.)

    Anyways, what this all adds up to is a group of people that is more or less content writing large, static codes that need to be minorly re-tuned with each generation of machine that is introduced. Libraries (BLAS, LAPACK, etc.) help, but they’re still really more of the same, just on a larger scale. The vendors comply, and the whole bunch marches on doing lots of interesting science but little interesting computer science.

    At this point, I’ll take a moment because said I’d mention academic approaches to the subject, and I don’t want to belittle some of the people I respect. There has been an enormous amount of fascinating research done on techniques for writing parallel code, much of it in Academia. Moreover, I firmly believe that the reason this problem continues to go unsolved is not because it is too hard. It’s that nobody is pulling from the other end. There are some very bright individuals who are breaking their backs trying to push this boulder up the hill, but the other people, the ones actually using the tools and running the machines are off pushing other rocks up different hills. There is no connection; no teamwork.

    In short, my opinion is that this is absolutely the right problem to be solving in terms of languages and programming tools, especially considering what looks to be the inevitable dominance of massively-parallel hardware in the long term. Sadly, I believe the the HPC community neither currently has a satisfactory solution nor is close to or even interested in developing one.

    Your frustrations are mine. I wish there were more people who thought likewise.

  2. Pinku

    This line in your comment made me optimistic: “Many wish they could press a magic button and transfer their equations into optimized code”. Since “a large amount of massively parallel code looks very similar regardless of the field”, it should be easy to design a very high-level domain specific language that cross-compiles to Fortran and reuses their optimized kernels and libraries. Fortress, a language from Sun Labs, is sort of aiming to do this, but they want to run on the JVM, whereas I think Fortran or C would be better.

    If they were willing to *pay* for that magic button, I’d work on it. But if scientists are reluctanct to use new tools and prefer writing Fortran code, then nothing will improve.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s