Box and Whiskers Plot

2021-04-24 :: racket, data visualization

The Box and Whiskers plot is a method for depicting groups of numerical data through their quartiles and it is a popular way to depict statistical information about data sets, yet the Racket plot package does not support such a plot type. In this blog post we’ll explore how to add this plot type to the plot package without having to modify the package itself, and we’ll look at some useful techniques of extending the plot package.

Sample Data

The Box-And-Whiskers plot shows information about a set of measurements and, to keep things simple, we’ll use a random sample as our data set, instead of finding some actual measurements to use.

We could construct the random sample by calling random repeatedly, but this will produce a uniform distribution of the data, where each value has an equal chance of showing up. Such a distribution does not represent real-world measurements and would produce unrealistic box-and-whiskers plots.

To model real-world measurements, need to draw random numbers from a normal distribution. Racket provides facilities for working with distributions, and, to create a set of points following a random distribution, we need to construct a distribution object and sample it:

(require math/distribution)

(define mean 2)
(define stddev 5)
(define the-distribution (normal-dist mean stddev))
(define the-samples (sample the-distribution 100))

The above code will create a normal distribution with a mean of 2 and a standard deviation of 5 and 100 random points are sampled from this distribution. We can easily verify that the points actually follow the normal distribution using a plot:

(define (gaussian μ σ)
  (lambda (x)
    (define sqrt-2π (sqrt (* 2 pi)))
    (* (/ 1 (* σ sqrt-2π))
       (exp (* -1/2 (expt (/ (- x μ) σ) 2))))))
       
(parameterize ([plot-pen-color-map 'tab10]
               [plot-x-label       #f]
               [plot-y-label       #f]
               [plot-y-ticks       no-ticks]
               [plot-legend-anchor 'outside-top-left]
               [plot-legend-layout '(rows 1 compact)])
  (plot (list
         (vrule mean                      #:width 2 #:color 1 #:label "μ = 2")
         (function (gaussian mean stddev) #:width 4 #:color 0 #:label "μ = 2, σ = 5")
         (density the-samples             #:width 4 #:color 3 #:label "samples"))
        #:width 800 #:height 300 #:x-min -30 #:x-max +30))

The data points do not follow the Gaussian exactly, but this is only because we took a small sample of only 100 points — the more points are drawn from the distribution, the more closely it will resemble a normal distribution. In our case, however, a small sample is sufficient, and has the added benefit that it will produce an asymmetric box-and-whiskers plot, which will look more interesting.

Quantiles

The box-and-whiskers plot requires five values from the input data set: the median (or 50% quantile), the or 25% quantile (Q1) , the 75% quantile (Q2), the lower and upper whiskers (which are the lowest and highest values which are not outliers) and a list of outlier points.

Wikipedia provides a detailed description on the quantiles, but, since here we are only interested in plotting some data, we’ll just use a simplified definition that a quantile represents a value from the data which is higher than a percentage of the rest of the values (e.g. the 75% quantile represents the value which is higher than 75% of all the points in the data set). We don’t even need to worry about determining these quantiles, since Racket provides a quantile function for this purpose:

(require math/distribution math/statistics)

(define the-distribution (normal-dist 2 5))
(define the-samples (sample the-distribution 100))

(quantile 0.25 < the-samples) ;; => -1.3059228859987178
(quantile 0.5 < the-samples)  ;; => 2.100636503124511
(quantile 0.75 < the-samples) ;; => 4.959056403217783

Note that, if you run the example above, it will produce different values each time, since the samples are random.

Determining the lower and upper whisker values is somewhat more complicated: they are supposed to be the lowest and highest values which are not outliers, so, first, we need to determine what is an outlier. According to the Wikipedia article on box plots, outliers are points which are more than 1.5 times the inter-quantile range above the third or below the first quantile, and the interquantile range is simply the difference between the third and first quantiles.

All this is easier to put in code: First, we’ll define a structure bnw-data to hold all the information needed for the plot, than we’ll define a function, samples->bnw-data to calculate all the required elements. Box plots use a scaling factor of 1.5 for the inter-quantile-range, but our function will use an argument allowing the user to change it.

Also, the lower-whisker and upper-whisker values are positions which must exist in our data set, it is not sufficient to set, for example, the upper-whisker to q3 + 1.5 * iqr as this might not be a value present in the data set. Instead, the whisker values are determined by iterating over the data set and determining the lowest or highest value which are within the upper and lower bounds defined by the inter-quantile range.

Finally, the outlier points are determined as the values from the data set which are outside the lower-upper whisker range:

(struct bnw-data
  (q1 median q3 lower-whisker uppwer-whisker outliers)
  #:transparent)

(define (samples->bnw-data vs #:iqr-scale [iqr-scale 1.5])
  (let* ([q1 (quantile 0.25 < vs)]
         [median (quantile 0.5 < vs)]
         [q3 (quantile 0.75 < vs)]
         [iqr (- q3 q1)]
         [lower-limit (- q1 (* iqr-scale iqr))]
         [upper-limit (+ q3 (* iqr-scale iqr))]
         [lower-whisker (sequence-fold
                         (lambda (a sample)
                           (if (>= sample lower-limit) (min a sample) a))
                         q1 vs)]
         [upper-whisker (sequence-fold
                         (lambda (a sample)
                           (if (<= sample upper-limit) (max a sample) a))
                         q3 vs)]
         [outliers (sequence-filter
                    (lambda (sample) (or (> sample upper-whisker)
                                         (< sample lower-whisker))) vs)])
    (bnw-data q1 median q3 lower-whisker upper-whisker (sequence->list outliers))))

Initial Renderer

Since we won’t modify the plot package to add support for this plot type, we need to construct one out of existing components. Doing this requires two observations:

the basic elements of the box-and-whiskers plot are already supported by the plot package: they are the rectangles, lines and points renderers.
the various plot commands accept a a “renderer tree”, not just a simple list of renderers.

The second point is worth clarifying a bit. Most people familiar with the plot package know that you can plot two or more data sets by putting them in a list. For example, this will plot the sine and cosine functions on the same plot:

1	(plot (list (function sin) (function cos)) #:x-min -5 #:x-max +5)

However, the plot function accepts arbitrarily nested lists of renderers, so, we can add a sublist with three horizontal lines to the plot:

1
2
3

(plot (list (function sin) (function cos)
            (list (hrule 0) (hrule 0.5) (hrule -0.5)))
      #:x-min -5 #:x-max +5)

Also, things like function and hrule are functions which produce renderer values, and they can be given names and returned from functions themselves:

(define the-sine (function sin))
(define the-cos (function cos))
(define (make-horizontal-markers low mid high)
  (list (hrule low) (hrule mid) (hrule high)))

(plot (list the-sine the-cos (make-horizontal-markers -0.5 0 0.5))
      #:x-min -5 #:x-max 5)

This means that we can construct a function which returns a list of renderers and this would be able to be used in plot together with other built-in renderers.

We also need to decide on the coordinate system to use — the box-and-whiskers is a one dimensional plot: the position and height of the plot depends on the values in the data set, so we will use that scale on the Y axis. However, there is no actual X value for the plot. Instead we will use an arbitrary X value when plotting elements. This value can be specified by the user and will allow placing several plots next to each other. For now, we will use the value 0 for the X value. While the plot is one dimensional, it does have a width, otherwise the box and whiskers would not be distinguishable from a line. The basis of the width we will be the discrete-histogram-gap value, since the histogram renderers use a similar technique. This means that the user can place several box-and-whiskers plots next to each other by placing them discrete-histogram-gap width apart on the X axis.

The box-and-whiskers plot is composed of four components and we’ll use a separate renderer for each of them:

the box is a rectangle drawn with rectangles
the median and whiskers are lines
the outliers are rendered using points

Here is the first version of the box-and-whiskers renderer. The function extracts the data from a bnw-data structure and returns a list of renderers to construct the plot. Perhaps the most unusual renderer is the whiskers renderer: lines do not need to specify a data series that is continuous: gaps can be introduced by having a point with +nan.0 coordinates, also the lines can be in arbitrary directions and do not need to follow horizontal path:

(define (box-and-whiskers-0
         data
         #:x [x 0]
         #:gap [gap (discrete-histogram-gap)])
  (match-define (bnw-data q1 median q3 lower-whisker upper-whisker outliers) data)
  (define half-width (* 1/2 (- 1 gap)))
  (define quater-width (* 1/4 (- 1 gap)))
  (define skip (vector +nan.0 +nan.0))
  (list
   ;; Box
   (rectangles
    (list (vector (ivl (- x half-width) (+ x half-width)) (ivl q1 q3)))
   ;; Median line
    (lines (list (vector (- x half-width) median)
                 (vector (+ x half-width) median)))
    ;; Whiskers
    (lines (list (vector x lower-whisker) (vector x q1)
                 skip
                 (vector x q3) (vector x upper-whisker)
                 skip
                 (vector (- x quater-width) lower-whisker)
                 (vector (+ x quater-width) lower-whisker)
                 skip
                 (vector (- x quater-width) upper-whisker)
                 (vector (+ x quater-width) upper-whisker)))
    ;; Outliers
    (points (for/list ([o outliers]) (vector x o))))))

Here is how to use the new renderer, placing three box-and-whiskers plots next to each other:

(plot-pen-color-map 'tab20)  ;; use more interesting colors...
(plot-brush-color-map 'tab20)

(plot (list (box-and-whiskers-0 bnw #:x 0)
            (box-and-whiskers-0 bnw #:x 1)
            (box-and-whiskers-0 bnw #:x 2))
      #:x-min -1 #:x-max 3 #:y-min -10 #:y-max 25)

Horizontal Layout

The initial box-and-whiskers renderer will construct a vertical plot, but some people prefer a horizontal layout with plots stacked on top of each other. This layout can be achieved simply by swapping the coordinates of all the points in the plot, that is, changing all instances of (vector x y) to (vector y x). Of course, this needs to be a parameter for the box-and-whiskers renderer and the swap needs to be made conditionally. We could replace every instance of vector with an if conditional, but it is simpler if we define a maybe-invert function, which swaps the parameters conditionally and replace all uses of vector with this new function:

(define (box-and-whiskers-1
         data
         #:x [x 0]
         #:gap [gap (discrete-histogram-gap)]
         #:invert? [invert? #f])
  (match-define (bnw-data q1 median q3 lower-whisker upper-whisker outliers) data)
  (define half-width (* 1/2 (- 1 gap)))
  (define quater-width (* 1/4 (- 1 gap)))
  (define skip (vector +nan.0 +nan.0))
  (define maybe-invert (if invert? (lambda (x y) (vector y x)) vector))
  (list
   ;; Box
   (rectangles
    (list (maybe-invert (ivl (- x half-width) (+ x half-width)) (ivl q1 q3))))
   ;; Median line
   (lines (list (maybe-invert (- x half-width) median)
                (maybe-invert (+ x half-width) median)))
   ;; Whiskers
   (lines (list (maybe-invert x lower-whisker) (maybe-invert x q1)
                skip
                (maybe-invert x q3) (maybe-invert x upper-whisker)
                skip
                (maybe-invert (- x quater-width) lower-whisker)
                (maybe-invert (+ x quater-width) lower-whisker)
                skip
                (maybe-invert (- x quater-width) upper-whisker)
                (maybe-invert (+ x quater-width) upper-whisker)))
   ;; outliers
   (points (for/list ([o outliers]) (maybe-invert x o)))))

Here is how the plot looks when inverted. Note that we also had to change the range of the X and Y axes since the range of the data set is now on the X axis, while the plots themselves are stacked on the Y coordinate:

(plot (list (box-and-whiskers-1 bnw #:x 0 #:invert? #t)
            (box-and-whiskers-1 bnw #:x 1 #:invert? #t)
            (box-and-whiskers-1 bnw #:x 2 #:invert? #t))
    #:y-min -1 #:y-max 3 #:x-min -12 #:x-max 25)

Labels

The X axis (or the Y axis if the box plots are inverted) show number positions by default, since each box plot is at a specific X position. However, these numbers have no meaning and they can be replaced with labels to better reflect the meaning of the data in each box plot.

The plot package allows configuring the labels on the axes using the axis ticks parameters and has some predefined ticks types. In addition to the default linear-ticks: there are layouts for dates, times and currencies.

We can define our own ticks to assign a label to the X or Y coordinate of a box plot, which would be used like this:

(parameterize ([plot-y-label #f]
               [plot-y-ticks (bnw-ticks '((0 . "experiment 1")
                                          (1 . "experiment 2")
                                          (2 . "experiment 3")))])
  (plot-file (list (box-and-whiskers-1 bnw #:x 0 #:invert? #t)
                   (box-and-whiskers-1 bnw #:x 1 #:invert? #t)
                   (box-and-whiskers-1 bnw #:x 2 #:invert? #t))
             #:y-min -1 #:y-max 3 #:x-min -12 #:x-max 25))

The bnw-ticks function constructs a ticks structure, which is used by the plot package to create and format labels on the plot axes. The ticks structure has two functions:

a layout function is responsible for providing a list of values where ticks should appear. It receives as arguments the “low” and “high” values for the axes (these come from the #:x-min, #:x-max, #:y-min, #:y-max parameters to the plot function)
a format function is responsible for providing a string representation for each tick in a list. The function receives a list of ticks (actually pre-tick structures) and returns a list of strings representing the labels.

(define (bnw-ticks labels)

  (define (layout low high)
    (for/list ([label (in-list labels)]
               #:when (let ([v (car label)])
                        (and (>= v low) (< v high))))
      (pre-tick (car label) #t)))

  (define (format low high pre-ticks)
    (for/list ([t (in-list pre-ticks)])
      (dict-ref labels (pre-tick-value t)
                (lambda () (~a (pre-tick-value t))))))

  (ticks layout format))

The bnw-ticks function receives an association list of axis positions and their corresponding labels and returns a ticks structure with the layout and format functions. The layout function is straightforward: when called, it constructs pre-ticks containing all the positions which are in the range from low to high. The format function will simply reference the label for each received pre-tick, and, if a label is not found, simply format the value using ~a. This last aspect is needed, because the ticks are also used to label special positions on the axes (such as when selecting a region of the plot), and the format function cannot rely on the fact that it only needs to format pre-tics produced by the corresponding layout function.

Colors and Line Styles

The box-and-whiskers renderer implementation presented so far is fully functional, but it is missing options to control the look of the plot: all items are drawn using the default color and line width. All plot renderers accept optional keyword arguments to control the color, like width and other aspects of the plot itself. While the number of these arguments is quite large and provided as a convenience to the user.

The box-and-whiskers plot has lots of components, each with separate rendering options, so the box-and-whiskers renderer will need lots of optional keyword arguments, although all these will have convenient defaults. Adding all these arguments to the renderer function will significantly increase its size, but, since this is change is just a set of trivial changes, it will not be shown here. If you are interested in the final version you can have a look at the full implementation.

Final Thoughts

Why is this not part of the plot package? Perhaps it should be, although the plot package cannot contain every possible plot type and it is sometimes useful to know some techniques which allow someone to extend the plot package without having to modify it. Even if you don’t need a Box and Whiskers plot, you might find the techniques presented here useful.