We frequently run into questions that are best answered by visualizations of part-to-whole relationships: total revenue by country or product, total incidents by turbine, genome read variants by chromosome, &c.
The most well known graphic depicting part-to-whole relationships is the pie chart. Unfortunately, pie charts cannot be accurately deciphered by the human perceptual system, so we cannot reccomend it to our current clients. (For an excellent primer on why to avoid pie charts, see Stephen Few’s article.)
As an alternative to pies, we typically employ a trusty bar chart:
Proportion of total revenue by country
This chart emphasizes that it’s a part-to-whole comparison via the title (“Proportion of…”) and by the use of percentages on the axis rather than absolute values (e.g., dollars).
However, bar charts do not immediately signal “part-to-whole comparisons ahead!” as loudly as pie charts do — in a quick glance at the chart above, one could easily fail to realize that the percentages sum to 100% and assume that the graph depicts some other proportion (e.g., percentage of citizens who had a beer at lunch today).
This article introduces mosaic plots, a visualization that illustrates parts-to-whole relationships using area. While the visual comparisons within mosaic plots (area to area) are not as robust as those within bar charts (length along a common baseline), mosaic plots are useful in situations where:
- part-to-whole or part-to-part-to-whole relationships should be emphasized
- exact values can be retrieved via some other method (e.g., a table) if necessary
- space is limited and/or small multiples comparisons are appropriate.
I originally came across formal discussion of mosaic plots in 2011 when my friend Hadley Wickham published a paper describing “product plots”, a framework that encompasses mosaic plots, bar charts, treemaps, and other area-based visualizations.
We’ll first cover the (light) statistical theory behind the operation of mosaic plots, and then cover the particularly straightforward implementation of attractive, responsive mosaic plots on modern browsers via CSS flexbox.
Statistical distributions 101
Mosaic plots excel at depicting part-to-whole relationships. From a whole dataset, there tend to be many dimensions across which to split the data into parts.
Consider surveying your closest 10,000 friends: What’s the highest education level they’ve received? Have they been married? Are they in good health? Are they happy? (This article’s data is the same data used by Hadley in his original paper, from the General Social Survey — as provided by Hadley on Github.)
We can break the whole set of responses into parts, starting with a simple question: What proportion of people are happy?
|not too happy||5,629|
Absolute counts aren’t illuminating much here (beyond the fact that the GSS is a big survey), so lets switch to proportions:
|not too happy||0.12|
This is the marginal distribution of happiness:
I’m happy to notice that the majority of people are either “pretty happy” or “very happy”!
Lets look at these happiness data together with biological sex.
In the joint distribution,
f(happy, sex), each cell shows the proportion of the entire survey corresponding to those values:
|not too happy||0.05||0.07|
So, for example, 5% of respondents are not too happy males. (Note that all of the cells sum to 1.) This display allows you to make observations like: “There were twice as many very happy males as not too happy females”.
Summing the values along a row or column eliminates the variable on the other dimension. For example, adding all of the values on the first row essentially means, “I don’t care about sex; I just want to know what proportion of people are not too happy”. Doing this on all of the rows and columns yields the marginals:
|not too happy||0.12|
The final kind of distribution we’ll see are the conditionals.
The conditional distribution
f(happy | sex),
|not too happy||pretty happy||very happy|
answers questions like: “What is the chance that I’m very happy, given that I’m female?” Note that the rows sum to 1.
We can also flip the variables and examine
f(sex | happy),
|not too happy||0.43||0.57|
which answers questions like: “What is the chance that I’m female, given that I’m pretty happy?”
(This question may sound absurd; how is “chance” involved in an adult’s biological sex? A helpful way of thinking about chance is not as a property of the universe but rather as a property of your knowledge base. When we say that a coin has “50/50 chance”, what we really mean is that we don’t have the right measuring equipment — with a high-speed camera and computer on hand, we could reliably predict tosses of the same “50/50 chance” coin.)
These three types of distributions — joint, marginal, and conditional — support different kinds of questions about our data. Thus far, we’ve depicted each distribution as a plain ol’ table. Next we’ll see how to depict these distributions visually.
The key idea of mosaic plots is that we can map the proportions of a distribution to the areas of a graphic.
For example, take the marginal distribution of happiness,
|not too happy||0.12|
The same approach also works for conditional distributions like
f(sex | happy):
|not too happy||0.43||0.57|
where the graphic has been evenly partitioned into three vertical spines, one for each level of the categorical variable “happiness”. Each vertical spine is then divided into two horizontal spines, corresponding to “male” and “female”.
(Note: horizontal and vertical refers to the direction the spines expand, not to their long axis, which depends on the aspect ratio of the mosaic plot.)
From these two mosaic plots we can visually illustrate the mathematical fact
f(happy) × f(sex | happy) = f(sex, happy):
Each one of the six disjoint segments of the rightmost mosaic plot has area proportional to the corresponding joint probability:
|not too happy||0.05||0.07|
On a mosaic plot the marginals can be quickly estimated by looking at a single row or color. In contrast, these same data would require six bars on a bar chart, and one would need to locate and mentally “stack” bars together to make the same comparisons.
Mosaic plots are a disjoint partitioning of a rectangular area. Within a mosaic plot, each rectangular sub-area expands horizontally or vertically according to its weight relative to its siblings. This is the exact problem that CSS3’s flexbox specification solves. (Flexbox is currently a “last call working draft”, but it’s already supported by 88% of browsers.)
Consider the mosaic plot of marginal distribution
This rectangle has been split into three horizontal spines, the width of each corresponding to the relative weight of that level. This graphic was created with this HTML markup:
<div class="mosaic-plot spines"> <div style="flex: 5629;" data-happy="not too happy"></div> <div style="flex: 25874;" data-happy="pretty happy"></div> <div style="flex: 14800;" data-happy="very happy"></div> </div>
The inline style
flex value is simply that level’s marginal (e.g., 5629 respondents reported being “not too happy”).
The flexbox layout engine takes care of all the scaling for us.
The required styling (in SASS notation) is minimal:
.mosaic-plot display: flex height: 200px width: 200px .spines > div display: flex position: relative align-items: stretch
Switching to vertical spines rather than horizontal ones:
requires a single change in the container class from
vspines and the corresponding styles:
.vspines flex-direction: column .vspines > div display: flex position: relative align-items: stretch
Rather than using classes or inline styles to color each rectangle, I’m using
<div style="flex: 5629;" data-happy="not too happy"></div>
These have two advantages over class-based selectors:
- data attributes admit the exact level name from the original data — no need to convert spaces to underscores or dashes to make a valid class name
- data attributes enforce the disjoint semantics of our distribution: an element can have multiple
data-attributes corresponding to multiple data dimensions, but there is no way to accidentally mark an element as corresponding to multiple levels of the same dimension.
These attributes can be selected for to add the background colors:
[data-happy="not too happy"] background-color: rgb(255, 131, 73) [data-happy="pretty happy"] background-color: rgb(253, 255, 185) [data-happy="very happy"] background-color: rgb(102, 183, 73)
Because the spines themselves have the
display: flex property, they can be nested.
The joint distribution
consists of vertical spines nested within horizontal spines:
<div class="mosaic-plot spines"> <div style="flex:5629;" class="vspines" data-happy="not too happy"> <div style="flex:2424;" data-sex="male"></div> <div style="flex:3205;" data-sex="female"></div> </div> <div style="flex:25874;" class="vspines" data-happy="pretty happy"> <div style="flex:11555;" data-sex="male"></div> <div style="flex:14319;" data-sex="female"></div> </div> <div style="flex:14800;" class="vspines" data-happy="very happy"> <div style="flex:6378;" data-sex="male"></div> <div style="flex:8422;" data-sex="female"></div> </div> </div>
(Though notice that we have to assign flex on both the inner vertical spines and the parent horizontal spines.)
Finally, note that
.mosaic-plot display: flex height: 200px width: 200px
Labels and other display considerations
As with most statistical graphics, mosaic plots are useless without proper labeling. A color legend:
|not too happy|
consists of straightforward markup:
<table class="legend"> <tr><td> <span class="color-box" data-happy="not too happy"></span> not too happy </td></tr> <tr><td> <span class="color-box" data-happy="pretty happy"></span> pretty happy </td></tr> <tr><td> <span class="color-box" data-happy="very happy"></span> very happy </td></tr> </table>
table.legend width: 10em margin-left: 4em span.color-box display: inline-block width: 1em height: 1em margin: 0 0.2em vertical-align: middle border: 1px solid gray
with the colors assigned by the same
data-happy selector that colors the mosaic plot.
In this case, the legend itself is gratuitous — we can make a graphic that is both more concise and more readable:
<div class="mosaic-plot vspines"> <div style="flex:5629" data-happy="not too happy"> <label class="left">not too happy</label> </div> <div style="flex:25874" data-happy="pretty happy"> <label class="left">pretty happy</label> </div> <div style="flex:14800" data-happy="very happy"> <label class="left">very happy</label> </div> </div>
and with styles:
label width: 100% height: 100% &.left text-align: right padding-right: 1em position: absolute transform: translate(-100%, 0)
transform CSS property (90.6% browser support, with vendor prefixes) allows us to keep the labels vertically aligned with their parent spine, but shift them outside of mosaic plot.
In this particular graphic, we also have enough vertical space to keep the labels inside the spines:
label.within text-align: center color: black position: absolute
In both cases, the labels are positioned in part according to the data, so care must be taken to make sure the graphic is large enough to prevent the labels from colliding.
The final display consideration is the ordering of categorical variables.
A variable’s levels within a mosaic plot can be placed under a partial ordering via flexbox’s
This property overrides the markup order, making it particularly easy to customize ordering to call attention to certain facets of the data.
If the variable is truly categorical — there is no natural ordering of the levels — then a good choice is to order by proportion, with the either the smallest or largest value coming first. (Alphabetical ordering supports fast lookup, but if that’s the primary use case then you should use an exact table rather than a visualization.)
In our survey of happiness, we really have an ordinal variable: The levels are naturally ordered from least to most happy, which we can enforce with these styles:
[data-happy="not too happy"] background-color: rgb(255, 131, 73) order: 1 [data-happy="pretty happy"] background-color: rgb(253, 255, 185) order: 2 [data-happy="very happy"] background-color: rgb(102, 183, 73) order: 3
Mosaic plots are an excellent alternative to bar charts in situations where part-to-whole relationships should be emphasized or where physical space is limited. Although mosaic plots can be drawn “recursively” to depict joint distributions, such graphics quickly become incomprehensible — it’s best to use mosaic plots to display simpler marginal and conditional distributions. (If joint distributions must be shown, your best bet is to draw several complementary mosaic plots and bar charts — Stephen Few discusses this topic at length: Are mosaic plots worthwhile?)
For more details on mosaic plots (including their relationship with bar charts), see Hadley’s original paper.
If you’d like help implementing mosaic plots or designing analytics systems for your business, shoot us an email: firstname.lastname@example.org.
Thanks to Ryan Lucas for suggesting additional context/motivation. Thanks to Dan Luu for suggesting a clearer, parallel sentence structure and transposing one of the conditional tables. Thanks to Nicki Vance for suggesting smoother section transitions and linking tables+plots. Thanks to Hadley Wickham for suggesting exposition on the terms mosaic plots, bar plots, and product plots; also for discovering some CSS issues in Safari.