The Power of Charm++ and Runtime Systems

Charmworks’ products offer opportunities for energy savings

The first-generation of exascale computers will come online in the Department of Energy’s national labs over the next year or two. Capable of more than one quintillion computational operations per second, they will cost upwards of a half-billion dollars and consume incredible amounts of electricity.

Runtime systems like Charm++ are one way to reduce the power costs associated with

computational simulation — while delivering faster, higher-resolution results. Whether you’re running on an in-house cluster, the cloud, or a supercomputer, they can help you improve performance, reliability, and costs.

Here, Charmworks’ CEO Sanjay Kale answers some questions about runtime systems and Charmworks suite of tools.

How big an issue is power consumption?

In the case of supercomputers, it’s a very big issue. We’re talking about electricity bills that might run a million dollars a month. And we’re talking about machines that have a way of exceeding their original specs in terms of costs. When DOE announced their exascale program six or seven years ago, they targeted a power limit of 20 megawatts. Today, it’s very likely that systems like Aurora, El Capitan, and Frontier will require more like 30 to 40 megawatts.

It’s such a large concern that the Department of Energy Science for the Future Act that was passed by the House of Representatives in July explicitly calls for an Energy Efficient Computing Program to be established by the DOE. The bill describes it as “a program of fundamental research, development, and demonstration of energy efficient computing technologies relevant to advanced computing applications in high performance computing, artificial intelligence, and scientific machine learning,” and it says there should be partnerships among the national labs, industry, and higher-ed to co-design energy efficient hardware, software, and applications.

These are very large costs and important issues, and people are quickly realizing that they have to take them seriously. DoE National Laboratories have done excellent work on these issues with traditional programming models (i.e.. MPI+X); Runtime systems like Charm++ provide a unique way to tackle them automatically.

But what about at a smaller scale?

At a smaller scale, it’s all about efficiency. Of course, saving on electricity bills won't hurt either. A company may not think in terms of energy costs, but they should always be thinking in terms of overall costs. That means not just electricity but the amount of time it takes a simulation to run, the amount of work programmers are doing to run that simulation, and hardware costs. The first two metrics also impact time-to-market if they are designing products based on simulation results, which is of paramount importance. The same elasticity that allows our adaptive runtime systems to minimize energy costs also allows it to impact these metrics positively.

With Charm++ and related software systems, you can optimize performance by way of continuous introspection — constantly and automatically assessing the performance of the computation and changing or reconfiguring it to improve that performance.

What attributes of its programming model allows Charm++ to reconfigure the application while it is running?

Charm++ has three main attributes. The first is overdecomposition, in which the programmer divides an application’s computation requirements into many relatively small objects, each representing a coarse work and/or data unit. The number of such objects greatly exceeds the number of processors. The second attribute is migratability, which is the ability to move these objects among processors. This means the user addresses their communication (i.e. messages) to the logical objects, rather than to physical processors. This gives the runtime system the ability to move these objects across nodes and processors as it sees fit. The third attribute is message-driven execution, which allows the system to select which of objects will run next based on availability of messages. These three attributes enable Charm++ to provide many useful features including dynamic load balancing, fault tolerance, and job malleability.

How does that help with runtime adaptation?

Take load balancing, for example. Any time the division of work among processors is not uniform, you have a load imbalance. If one processor takes longer than the others to complete its part, all others are held up as they wait to synchronize. This waiting leads to inefficiency and saps performance. That inefficiency can change and increase dramatically as the application evolves, for some applications.

Automatically assess and address that imbalance and you’re using power more efficiently, getting results faster, and/or improving the resolution of your simulation in the same amount of time.

Charm++ relies on the principle-of-persistence heuristic. This principle states that, for overdecomposed iterative applications, the task’s or object’s computation load and communication pattern tend to persist over time. The heuristic uses the application’s load statistics collected periodically by the runtime system, which provides an automatic, application-independent way of obtaining load statistics without any user input. If desired, the user can specify predicted loads and thus override system predictions. Using the collected load statistics, Charm++ periodically executes a load-balancing strategy to determine a better objects-to-processors mapping and then migrates objects to their new homes accordingly. Its suite of load balancers includes several centralized, distributed, and hierarchical strategies. Charm++ can also automate the decision of when to call the load balancer, as well as which strategy to use, based on observed application characteristics using a machine-learning model.

But notice that this whole capability was possible because the programming model supported overdecompostion and migratibility.

So now, with that, how can Charm++ improve energy efficiency?

Of course, reducing execution time via load balancing reduces the energy cost of a job. But there are more direct, and at times dramatic, ways in which Charm++ can deal with thermal issues and energy efficiency.

As an example, modern processor chips support “turbo boost”, a feature that allows a chip to run at a faster clock speed if it is not getting too hot. But on many clusters and supercomputers, this feature is turned off, because it would cause an uncontrolled load imbalance, which leaves a lot of computing capability on the table. With an adaptive runtime system, you can recoup this performance by rebalancing loads in proportion to current processor speeds dynamically. This was demonstrated in a paper from my group at UIUC, by Bilge Acun, in ICS 2016.

More generally, there are four metrics of importance: energy usage, power level, temperature of chips, and execution time. Typically, you want to constrain power consumption and temperature while minimizing energy usage and execution time. Charm++’s adaptive runtime system can help do this in various scenarios.

One of the earliest successes in this research was demonstrating how we could save cooling energy. In a way, it is trivial to save cooling energy: walk up to the thermostat and raise the temperature. But you can’t do that because some chips will get too hot and start failing! The point is, different chips have different tolerance for ambient temperature and yet you have to set the thermostat for the worst of them.

With Charm++, we could monitor the chip temperatures, and reduce the frequency of processor chips that start to get hot. Of course this would cause load imbalance, since other processors have to wait for the slowed down chip. But we know how to solve this problem. Just migrate some objects away from the slowed down processor. The system is running almost as fast as before except for the small loss due to one processor’s speed. This idea was demonstrated in a paper (titled aptly as “A ‘cool’ load balancer.”) by Osman Sarood from my group in a paper at Supercomputing 2011.

Why is job malleability important, in this context? On massive systems, you can control power consumption and costs by intelligently scheduling jobs, reallocating resources, and reconfiguring hardware. That’s what you get with job malleability. By shrinking or growing the set of nodes allocated to each job, to match available resources, you can set certain power budgets, and you can also ensure that the maximum number of jobs are running at peak efficiency at all times.

Using dynamic voltage and frequency scaling, you can also throttle cores’ performance to optimize the runs. This improves the hardware’s reliability because you avoid taxing it unnecessarily and by reducing thermal variations that can negatively affect performance.

Data centers of all sizes face major challenges in reliability, power management, and user satisfaction. And a runtime system like Charm++, working with the system resource manager, is a great way to address all of these. You can improve efficiency in power-constrained environments, increase performance with load balancing, better control hardware reliability, and save energy. All while delivering faster time-to-solution and/or higher-resolution simulations This holistic approach was described in our article titled “Power, Reliability, and Performance: One System to Rule them All”, in the October 2016 issue of IEEE Computer.

So everybody’s happy? Well, it’s a lot of hard work, but everyone is happier when you get it right. Users are focused on job performance, and they get improved job performance. System admins are focused on overall job throughput and keeping users satisfied, and they get that. Leaders are focused on costs, person-hours, and reliability, and runtime systems like Charm++ deliver that, too.

Nothing’s a silver bullet. Every improvement requires a trade-off. But the upside of adaptive runtime systems is tremendous.

Recent Posts

See All