COVID19 data in the REPL
Disclaimer: I’m not an epidemiologist. This article was written in early March 2020 — before the quarantine across northern Italy, before the flood of high-quality coronavirus data journalism. It is a demonstration of my own workflow as I worked with open data for my own edification in the absence of the visualizations I wanted. The visualizations reflect the data at the time.
I wanted to better understand COVID-19, so I cloned Johns Hopkins’ daily-updated dataset, fired up a Clojure REPL, and started massaging the data into a visualization using the Vega grammar. Mapping the cases in Germany helped me internalize the numbers I was seeing in the news:
From this choropleth we can tell that the virus has hit North Rhein-Westphalia and the two southern states the hardest, but there is a trickle of cases across the country. Perhaps unsurprisingly, urban centers Berlin and Bremen have a higher concentration than the rural areas surrounding them.
Comparing the coronavirus situation in China to that in Germany using the same geographic visualization is ineffective because the disparity between Hubei and other regions drowns out any province-level shading. We can still use the case data to make a visually appealing map:
The red is eye-catching, we have differentiation between provinces, and we can truthfully say that the map reflects the data. But nevertheless this visualization is tremendously misleading. I cribbed this approach from Kenneth Field’s intentionally-deceptive example in Mapping coronavirus, responsibly, and it is seriously deceptive in two ways. First, why red?
People like red maps. Well that may be true, and they’re certainly attention-grabbing but consider the dataset. We’re mapping a human health tragedy that may get way worse before it subsides. Do we really want the map to be screaming bright red? Red is a very emotive colour. It has meaning.
The map is also specious in that it uses the absolute number of cases in a particular province, without adjusting for population. Our earlier map of Germany shades according to confirmed cases per 100,000 inhabitants. This avoids an unfortunately common pitfall:
There are very very few golden rules in cartography but this is one of them: you cannot map totals using a choropleth thematic mapping technique. The reason is simple. Each of the areas on the map is a different size, and has a different number of people in it.
A better way to produce a compelling geographic visualization of COVID19 in China is to apply a logarithmic scale, so that differences in shade represent not absolute difference but difference across orders of magnitude:
This is effective without the pitfalls of the red version. A log scale does require some numerical literacy of the reader, which might be a valid expectation in a data science context but perhaps not the right trade-off for popular journalism. Another valid approach to better feel the data would be to create some sort of land-skewing cartogram, which would take a bit more experimentation. But it’s important to remember that data often speaks best through simple charts. Here’s one comparing COVID19 cases in Chinese provinces and German states:
Here’s that same chart, removing only a single datapoint – the tremendous outlier that is Hubei province, in which lies the city Wuhan.
I find the second chart more interesting, but it only speaks truth after one has contemplated the first. Note the line with all of Germany’s cases combined.
There are lots of takeaways from just these few visualizations, and more insight from all the intermediate maps and graphs I made on the way. It helped me to visualize the same data multiple ways, so I could get an intuitive sense of the scales involved.
How the sausage was made
Some folks at the recent SciCloj meetup for data science in Clojure remarked that it would be good for people in the community to explain their process, so here’s an example of the tools we have been using lately at Applied.
Within Clojure, I ended up using a minimum of libraries: at first clojure.data.csv
and later Nils Grunwald’s meta-csv
for parsing TSVs, Metosin’s blazing-fast jsonista
for outputting to JSON, and a lib to handle turning Clojure forms into Vega-Lite specs and sending them to a live-reloading browser view: at first Metasoarous’s all-purpose Vega-Lite toolkit Oz
; later, after I was sure my use case was narrow enough, the similar but more narrowly-scoped Waqi
. If the source data had been a nested bramble I would’ve reached for specter
, but it didn’t reach that point because Clojure’s sequence and collection APIs give so much data manipulation power.
A few times I found the Vega docs incomplete or ambiguous, and struggled to get one of its built-in data-rearranging incantations to work. Rearranging the data on the Clojure side instead was a breeze because the language was built from the ground up to excel at that kind of task. There’s a lesson here: leave visualization to the visual grammar and manipulation of data structures to the functional language designed around practical immutability.
Another example of such fluidity was the snippet which provides values
for the two bar charts above. Its first incarnation was like so:
(->> covid19-cases-csv
rest
;; grab only province/state, country, and latest report of total cases:
(map (juxt first second last))
;; restrict to countries we’re interested in:
(filter (comp #{"Mainland China" "Germany"} second))
(reduce (fn [acc [province country current-cases]]
(if (string/blank? province)
;; put the summary of Germany first
(concat [{:state-or-province "(All German federal states)"
:cases (Integer/parseInt current-cases)}]
acc)
;; otherwise just add the datapoint to the list
(conj acc {:state-or-province province
:cases (Integer/parseInt current-cases)})))
[])
(concat (sort-by :state-or-province (vals deutschland/bundeslaender-data)))
(remove (comp #{"Hubei"} :state-or-province)))
For reference, covid19-cases-csv
is a sequence. Taking the first two elements, we get these two vectors:
(["Province/State" "Country/Region" "Lat" "Long" "1/22/20" "1/23/20" "1/24/20" "1/25/20" "1/26/20" "1/27/20" "1/28/20" "1/29/20" "1/30/20" "1/31/20" "2/1/20" "2/2/20" "2/3/20" "2/4/20" "2/5/20" "2/6/20" "2/7/20" "2/8/20" "2/9/20" "2/10/20" "2/11/20" "2/12/20" "2/13/20" "2/14/20" "2/15/20" "2/16/20" "2/17/20" "2/18/20" "2/19/20" "2/20/20" "2/21/20" "2/22/20" "2/23/20" "2/24/20" "2/25/20" "2/26/20" "2/27/20" "2/28/20" "2/29/20" "3/1/20" "3/2/20" "3/3/20" "3/4/20"]
["Anhui" "Mainland China" "31.8257" "117.2264" "1" "9" "15" "39" "60" "70" "106" "152" "200" "237" "297" "340" "408" "480" "530" "591" "665" "733" "779" "830" "860" "889" "910" "934" "950" "962" "973" "982" "986" "987" "988" "989" "989" "989" "989" "989" "989" "990" "990" "990" "990" "990" "990"])
Our code snippet is a thread, so we work from top to bottom, passing the result of each step as the last argument to the function call in the next step. First we throw away the header and focus on the rest
of the data. Then we restrict ourselves to the first
and second
elements in each row and the last
reported number of cases. juxt
lets us apply these three functions independently across the rows.
The next step is to narrow our focus to regions we’re interested in, which we do with one of my favorite idioms, well-described by Arne Brasseur over at Lambda Island: filter
with the composition of a set and one or two selector functions – in this case, second
, which by REPL experiment we can tell points to the country. I like to imagine comp
visually, as a river flowing from the source data on the right (which in this case is implicit because we’re using the thread-last macro ->>
, making it the result of the immediately previous map
) to the sequence function on the left: the data runs in my mind’s eye from filter
’s rightmost coll
argument to second
, then gets compared to the set #{"Mainland China" "Germany"}
, and finally any matches pass through the filter
. In other words, we ignore rows whose second element isn’t in that set.
Treating sets as functions, these little function-composing idioms, and other functional programming gizmos aren’t immediately obvious to beginners, and I know that higher-order functions like juxt
can take a bit for people like me to wrap their head around. But they don’t take much effort to learn and they’re super handy – especially in data exploration or shaping – once you recognize them. The effort to stretch your brain is worth it. Remember Alan Perlis: “A language that doesn’t affect the way you think about programming, is not worth knowing.”
From there in our data pipeline, we reduce
to change the shape of the data. The most interesting part here is that we make sure that the datapoint describing All German federal states
is the first element, so that when we concat
it to the elsewhere-prepared German data in our next thread-step it forms the dividing point between Chinese provinces and German states.
With that we are at the end, with another opportunity to use our favorite (comp #{} ...)
idiom, this time using a keyword as the function that selects which part of the threaded data we compare to our set. This remove
is the sexp we toggle on and off while livecoding to see how much of an outlier Hubei is.
Refactoring
This pipeline became even simpler after switching our CSV parser to meta-csv
. This library does the automatic parsing that you almost always want. Taking the first two elements of its output, we see it gives us a sequence of maps with values coerced to correct types, rather than the sequence of vectors we saw before:
({:province-state nil, :country-region "Afghanistan", :lat 33.0, :long 65.0, "2020-01-22" 0, "2020-01-23" 0, "2020-01-24" 0, "2020-01-25" 0, "2020-01-26" 0, "2020-01-27" 0, "2020-01-28" 0, "2020-01-29" 0, "2020-01-30" 0, "2020-01-31" 0, "2020-02-01" 0, "2020-02-02" 0, "2020-02-03" 0, "2020-02-04" 0, "2020-02-05" 0, "2020-02-06" 0, "2020-02-07" 0, "2020-02-08" 0, "2020-02-09" 0, "2020-02-10" 0, "2020-02-11" 0, "2020-02-12" 0, "2020-02-13" 0, "2020-02-14" 0, "2020-02-15" 0, "2020-02-16" 0, "2020-02-17" 0, "2020-02-18" 0, "2020-02-19" 0, "2020-02-20" 0, "2020-02-21" 0, "2020-02-22" 0, "2020-02-23" 0, "2020-02-24" 1, "2020-02-25" 1, "2020-02-26" 1, "2020-02-27" 1, "2020-02-28" 1, "2020-02-29" 1, "2020-03-01" 1, "2020-03-02" 1, "2020-03-03" 1, "2020-03-04" 1, "2020-03-05" 1, "2020-03-06" 1, "2020-03-07" 1, "2020-03-08" 4, "2020-03-09" 4, "2020-03-10" 5, "2020-03-11" 7, "2020-03-12" 7, "2020-03-13" 7, "2020-03-14" 11, "2020-03-15" 16, "2020-03-16" 21, "2020-03-17" 22, "2020-03-18" 22, "2020-03-19" 22, "2020-03-20" 24, "2020-03-21" 24, "2020-03-22" 40, "2020-03-23" 40, "2020-03-24" 74, "2020-03-25" 84, "2020-03-26" 94, "2020-03-27" 110, "2020-03-28" 110}
{:province-state nil, :country-region "Albania", :lat 41.1533, :long 20.1683, "2020-01-22" 0, "2020-01-23" 0, "2020-01-24" 0, "2020-01-25" 0, "2020-01-26" 0, "2020-01-27" 0, "2020-01-28" 0, "2020-01-29" 0, "2020-01-30" 0, "2020-01-31" 0, "2020-02-01" 0, "2020-02-02" 0, "2020-02-03" 0, "2020-02-04" 0, "2020-02-05" 0, "2020-02-06" 0, "2020-02-07" 0, "2020-02-08" 0, "2020-02-09" 0, "2020-02-10" 0, "2020-02-11" 0, "2020-02-12" 0, "2020-02-13" 0, "2020-02-14" 0, "2020-02-15" 0, "2020-02-16" 0, "2020-02-17" 0, "2020-02-18" 0, "2020-02-19" 0, "2020-02-20" 0, "2020-02-21" 0, "2020-02-22" 0, "2020-02-23" 0, "2020-02-24" 0, "2020-02-25" 0, "2020-02-26" 0, "2020-02-27" 0, "2020-02-28" 0, "2020-02-29" 0, "2020-03-01" 0, "2020-03-02" 0, "2020-03-03" 0, "2020-03-04" 0, "2020-03-05" 0, "2020-03-06" 0, "2020-03-07" 0, "2020-03-08" 0, "2020-03-09" 2, "2020-03-10" 10, "2020-03-11" 12, "2020-03-12" 23, "2020-03-13" 33, "2020-03-14" 38, "2020-03-15" 42, "2020-03-16" 51, "2020-03-17" 55, "2020-03-18" 59, "2020-03-19" 64, "2020-03-20" 70, "2020-03-21" 76, "2020-03-22" 89, "2020-03-23" 104, "2020-03-24" 123, "2020-03-25" 146, "2020-03-26" 174, "2020-03-27" 186, "2020-03-28" 197})
One can do this with clojure.data.csv
as well of course, but it requires some manual effort. A little "just do the right thing" magic is nice in a small, focused lib like meta-csv
, as long as it provides ergonomic options to override its defaults.
Getting our data as a sequence of maps makes our data pipeline even more concise and clear. Here’s the current version:
(let [date "2020-03-04"]
(->> johns-hopkins/confirmed ;; our data source has changed names
(filter (comp #{"China"} :country-region))
(map #(select-keys % [:province-state :country-region date]))
(concat (map #(assoc % :country "Germany")
deutschland/legacy-cases)
[{:state "(All of Germany)" :country "Germany"
:cases (apply + (map :cases deutschland/legacy-cases))}])
(map #(rename-keys % {:state :province-state
date :cases
:country-region :country}))
(remove (comp #{"Hubei"} :province-state))))
In the first iteration, I was working with the most current data. Now, I’m reaching back in time, which means wrapping our thread in a let
to bind the date, and later using a different data source for cases in deutschland
(Germany). (The Robert Koch Institute, our normal data source, omits historical data from its daily report.)
Working with a sequence of maps means we can dispense with some place-oriented programming (like second
and last
in our juxt
above) and refer to data by its named keys. This improves readability: we ignore maps that don’t have a value for :country-region
equal to "China". (Our (comp #{} :key)
isn’t strictly necessary in this version because the set has only one value, but changes in the source data mean we’re often modifying that line to include multiple values.) Then we ignore non-essential key/value pairs in each map with select-keys
so we have each province’s case data for the date
relevant to us.
We then concatenate those maps to our state-level data for Germany using – what else? – concat
. While we’re at it we add a :country
key and a single cumulative value for "All of Germany". After that our only remaining data-wrangling task is to standardize our keys and values with rename-keys
. (This function lives in the clojure.set
namespace but I always :refer
it so I can pretend it’s part of Clojure’s core API.) With that, we’re back to the remove
-toggle over Hubei.
As the data changed I found it helpful to refactor the visualization as well, adding color, a new sort order, and tooltips:
I found the original version of this code snippet fluid to write and it worked fine, but the refactored version is tighter and more obvious. I find that Clojure made it straightforward to work with both versions. The REPL-given ability to probe values made it especially satisfying to rework the code as the data changed underfoot.
Data sources are mostly in comments in the code. I got population data from Wikipedia, and state-by-state COVID19 case data for Germany from the Robert Koch Institute. Historical RKI data is from Wikipedia.
On workflow
It was somewhat awkward showing you those charts with and without Hubei in article form. For a single example this static format works all right, but I much prefer the experience of creating or modifying these dynamically. Removing Hubei is a one-liner I can toggle with a comment, but there are a dozen other facets in the data that don’t translate well to being examined in this medium. I find I develop intuition for a phenomenon best by getting elbows-deep into the data’s guts. My particular interactive workflow is a REPL-connected emacs buffer, but what’s important is the immediacy of the feedback loop, so that the code feels tangibly alive.
Whatever tools you use, writing and evaluating code in one screen and directly seeing the new visualization in the other is so much more powerful than being the client of someone else’s words or pictures. By laying the data out on our workbench we can poke and probe it, forming questions and then answers (or more questions) at nearly the speed of thought.
To those of you for whom this REPL evangelism is old hat, I apologize. I know I’m always harping on the importance of shortening feedback loops. But I will keep saying it because some folks haven’t heard, or have forgotten.
— Dave Liepmann, 06 March 2020