Testing Our Ruby and Haskell Implementations Side-By-Side

Mpowered’s service BEEtoolkit helps customers manage, track, and plan their B-BBEE status. At its core, the Empowerment calculation engine is responsible for calculating B-BBEE scores, requesting data from multiple user-managed sources and applying the domain rules and calculations of the B-BBEE act issued by the South African government.

During almost ten years of continuous development, with software developers coming and going, product requirements changing, features being added, and the B-BBEE regulations evolving each year, this core piece of technology has become a maintenance and innovation bottleneck for the Mpowered development team.

The coupling between the scoring engine and the many other parts of the system have grown stronger over time. Bidirectional hidden dependencies and assumptions, database and in-memory state side effects throughout the codebase, and a history of copy-pasted markup and code, together making refactoring and adding new features notoriously hard.

Porting Empowerment

In the Mpowered development team, we decided to extract and replace the Empowerment component with a new solution. Given our competencies and our trust in GHC’s ability to guide us through large refactorings, we decided to write the new service in Haskell, working name Chameleon, and integrate it with the existing BEEtoolkit application that is written in Ruby on Rails. They are deployed together, as a whole, forming a polyglot BEEtoolkit application that we believe will be much less burdensome to maintain in the long run.

Extracting the calculation from what is currently a Ruby gem, to a separate service with an HTTP API, has forced us to pull apart tightly coupled components which shouldn’t have been coupled in the first place. This change has had an overall positive side effect on the legacy codebase, but finding where to “draw the line” is not easy. The Chameleon service still connects to the MySQL database owned by the Ruby on Rails service, for instance. We plan to cut the ties to the legacy implementation, step by step, until the Chameleon service is a self-contained system with its own database schema and user interface.

I won’t cover the details of how we port the Ruby implementation to Haskell in this post, as we plan to cover that in a separate post in the future. Let’s instead focus on testing the old and new implementation.

Offline Testing with Production Data

Our primary approach in cross-testing Empowerment and Chameleon is based on anonymized production data. We have exported an anonymized subset of the MySQL database running in production. The dataset is used for offline testing of the compatibility between our implementations. It doesn’t necessarily cover all cases, notably legacy charters which are deprecated and will no longer be supported in the future. It does, however, give us a broad range of inputs and corner cases that our customers previously have hit, and are likely to hit in the future.

We have written a Rake task that runs the legacy Empowerment scoring engine with the anonymized test database. It produces a golden copy JSON file that includes scorecard information and scores for all the charters we want to support in Chameleon. Using this file, we run the Chameleon scoring engine in a Haskell test suite, using the same scorecards (the input data) as used when generating the golden copy, and compare the results.

To get quick feedback, we only export 10 scoring results per scorecard type, resulting in a golden copy of little over 200 results, but we have run larger tests with up to 16000 scorecards and had them all pass.

When investigating differences in results, we have found many problems with our new implementation, but also a few bugs lurking around in the Ruby code. One particular call to .round() caused incorrect scores across the system, and this was not a new bug. When we plotted the relations between the incorrect scores caused by the rounding bug, and the corresponding correct scores, in a histogram, it formed a nice-looking bell curve ranging between 50% and 150%.

Taking a step back, this reminds me of John Hughes’ paper Testing the Hard Stuff and Staying Sane, where he describes how you iteratively work through errors in both your model and in your implementation. Although that paper focuses on generative property-based testing, the principle still applies.

The scoring results are large trees of values and metadata, and are very hard to compare by eye. The regular test assertions of Hspec would print the entire trees in a failing test, which is not helpful. To make it easier to troubleshoot differences between results, we implemented a contextual difference pretty-printer, showing us any specific difference and the surrounding context:

      
Scorecard
  { FinancialLongTermGenericBravo =
      { BravoManagement =
          Element
            { elementAnnotation =
                AccumulatedResult
                  { accumulatedAchievedPoints =
                      - 11.95666533360486
                      + 11.697545333604863
                  }

            , elementChildren =
                { board_participation =
                    IndicatorGroup
                      { indicatorGroupIndicators =
                          { voting_rights_of_black_people =
                              Indicator
                                { indicatorValue =
                                    ( IndicatorResult
                                        { achieved =
                                            Achieved
                                              { amount =
                                                  - 52.92
                                                  + 37.044
                                              , points =
                                                  - 1.0
                                                  + 0.74088
                                              , progress =
                                                  - 0.5292
                                                  + 0.37044
                                              }
                                        }
                                    , *
                                    )
                                }
                          }
                      }
                }
            }
      }
  }

In the preceding example the achieved amount, points, and progress, together with the accumulated achieved points, are differing. The pretty-printed record is not valid Haskell syntax, like what you would get from a Show instance, but the printed representation of a custom tree data structure that we convert to before comparing.

Online Testing

As a second measure of verifying that the new implementation is behaving correctly, we run both implementations side-by-side in our deployment. We haven’t yet enabled it in production, but we are currently testing it in our staging environment. Whenever the Ruby on Rails application requests scores for a scorecard, it will trigger the legacy calculation to be run, and an HTTP API request to Chameleon.

When both scores are successfully calculated, they are compared using the hashdiff gem. In case of a difference, it’s logged and reported to Honeybadger, a service that tracks runtime exceptions in our staging and production environments. From there we can inspect differences and investigate further why our implementations are differing. Also, when there is a difference, the legacy score will be returned.

The main issue with this approach is the adapter layer between the data from the HTTP request and the Ruby on Rails application. Empowerment has a rich object API, with a multitude of convenience methods, and even a DSL built with method_missing. HTML templates throughout the legacy application depend on these methods and the DSL to traverse the score hierarchy. The integration with Chameleon needs to fulfil the same object API, and we have no good way of automatically testing that the adapter classes are doing so, and no static type system to verify it with. In other words, we have to manually test all views in the web application to be confident that it’s correct. This is what we are currently focusing on in the staging environment.

When we are confident that Chameleon delivers correct results, and that the integration with the legacy Ruby on Rails web application is reliable, we will turn off Empowerment, and eventually delete the code altogether.

Summary

So, what’s the point of this whole exercise? Spending months reimplementing the scoring engine in another programming language, but still having the exact same functionality as before, seems like a huge waste. Right?

Well, at the end of this we’ll have the code supporting the central business value of this system written in a strongly and statically typed language, and have a test suite with a broad range of inputs and expected outputs. With that in place, we can refactor the Haskell code from the mechanically ported Ruby code to something we’d like to have written in the first place, something that we can read, understand, and change. New charters can be written in Haskell in the style we want. Eventually, we can move the user interface for scoring over to Chameleon, and enjoy type-safe HTML templates. We can also move the database schema ownership over to Chameleon, and regard it a self-contained system that can be modified independently.

In addition to the benefits regarding maintenance and evolving the scoring engine, during this work we have uncovered bugs in the legacy implementation, gained a lot of understanding of the system and its accidental complexity, and reduced coupling between various components. In closing, our team consider this project a success on many fronts.