Adamistics: March 2016

16 March 2016

Simplify, simplify, simplify

I recently read a post to the Variance Explained blog on How to replace a pie chart. It referenced a series of 6 pie charts presented in a Wall Street Journal article, What Data Scientists Do All Day At Work.

These 6 pie charts are overkill, as was the collection of bar plots shared on Variance Explained. What do folks really want to know from the survey results? How much time data scientists spend on the various tasks. Can you glean that information from the pie charts? Not easily.

Instead, a single bar chart could be used to show the average number of hours per day that the respondents spend on various tasks.

dat <- data.frame(
  v1 <- c(11, 19, 34, 23, 27, 43),
  v2 <- c(32, 42, 29, 41, 47, 32),
  v3 <- c(46, 31, 27, 29, 20, 20),
  v4 <- c(12, 7, 10, 7, 6, 5),
  row.names = c("Basic exploratory data analysis",
  "Data cleaning", "Machine learning/statistics",
  "Creating visualizations", "Presenting analysis",
  "Extract/transform/load"))
names(dat) <- 
  c("< 1 a week", "1-4 a week", "1-3 a day", ">4 a day")

# convert the categories to approximate no. hours per day
hrsperday <- c(0.1, 0.4, 2.5, 6)

If, in fact, there is some interest in the variability among the respondents (not just the averages), then a stacked bar chart could be used. Centering the bars on the midline better illustrates which tasks were most common.

library(HH)
likert(dat[rev(order(totals)), ], 
  xlab="Survey respondents (%)",
  main="Time (hours) Spent on Tasks")

08 March 2016

Interactive heatmap of correlation matrix

I saw this tweet yesterday afternoon.

A quick start guide to correlation with R: https://t.co/0RuevkuK2j pic.twitter.com/N8BT8qhdsI
— DataStories (@LindaRegber) March 8, 2016

Earlier that same morning I had been perusing a presentation that Karl Broman gave at JSM2015, Interactive graphics for high-dimensional genetic data. The talk included an interactive heatmap of a correlation matrix (slide 7) that seemed like it would be useful to many folks, not just those working with genetics data.

It was time to give it a try.

It couldn't have been much simpler. I had to install the R package qtlcharts, then use the function iplotCorr().

install.packages("qtlcharts")
library(qtlcharts)
iplotCorr(mat=mtcars, group=mtcars$cyl, reorder=TRUE)

From URL images to animated GIF

I wanted to create an animated GIF using images I found on the internet, the reported cases of Lyme disease in the United States from 2001 to 2014. The images are located on the CDC website for Lyme disease, named map5 through map18.

My original intention was to explore the capabilities of R to do this. However, the solutions I found seemed to rely on another software package, ImageMagick, which I didn't want to install. So, I punted on using R.

Next I tried GIMP (which I already had installed), but I didn't quickly find a way to open/import several images from URLs as layers. So, I punted on using GIMP as well.

Finally, I decided to try GIPHY. GIPHY had an option to create a slideshow where I could pretty quickly (but still, one at a time) copy and paste each of the image URLs to create an animated GIF. I copied each of the images once, except for the last year (2014), which I copied a few times, so that when the GIF goes through its continuous loop, it pauses for the last year of mapped data.

I'm pleased with the results, but not with the process. I was looking for a solution with code, and came up short.

CDC data via GIPHY

I shared the GIF on Twitter ...

Reported cases of Lyme disease in US 2001-2014.
Data by @CDC. Animation by @giphy.https://t.co/9RFdb3ZcbO
— Jean Adams (@JeanVAdams) March 7, 2016

02 March 2016

Reporting simple linear regression results in R markdown

I recently wrote an R markdown document that incorporated results from a simple linear regression. I wanted the report to be reproducible (should the data change), so I included references to the summary statistics in the text. I was unsure at first how to put the numerator and denominator degrees of freedom for the F statistic as subscripts. But I found a handy page on math notation in R markdown that provided the solution I needed. The R markdown text and its result are shown below.

A few things to note.

I defined a function, myprint(), to ensure that the numbers I reported in the text had the specified number of decimal places. Simply using round() won't always do this.
I calculated the P value from the summary of the fitted model object.
I defined a character scalar, statement, to insert the appropriate verbiage in the text regarding significance.
I used math notation to incorporate the numerator and denominator degrees of freedom for the F statistic as subscripts.
Finally, I noted that the subscripts appeared as expected when viewed in Word or in Firefox, but not in Chrome. Not sure why.

---
title: "Simple Linear Regression"
output:
  html_document: default
---

```{r} 
# define function to easily paste numbers into text
myprint <- function(x, d=2) {
  sprintf(paste0("%.", d, "f"), round(x, d))
}

# fake data for simple linear regression
n <- 100
x <- 1:n
y <- rnorm(n)

# fit the regression, save the F statistic and P value
fit <- lm(y ~ x)
fstat <- summary(fit)$fstatistic
pval <- pf(fstat[1], fstat[2], fstat[3], lower.tail=FALSE)

# text regarding significance
statement <- ifelse(pval < 0.05, "was", "was not")
```

We conducted a simple linear regression of y on x; 
y `r statement` significantly related to x 
($F_{`r fstat[2]`,`r fstat[3]`}$ = `r myprint(fstat[1])`, 
*P* = `r myprint(pval)`).

01 March 2016

Generating combinations of levels

I have a linear model with a 4-level factor in it. I wanted to generate all possible level combinations of this factor. I couldn't find a function to help me do this in R, so I created my own.

combLevel <- function(n) {
  B <- matrix(1)
  for(i in 2:n) {
    maxB <- apply(B, 1, max) + 1
    B <- B[rep(1:nrow(B), maxB), ]
    B <- cbind(B, unlist(lapply(maxB, seq, 1, -1)))
  }
  dimnames(B) <- list(NULL, NULL)
  B
}

With 4 levels, this led to a total of 15 combinations, 1 with 4 levels, 6 with 3 levels, 7 with 2 levels, and 1 with 1 level.

> combLevel(4)
      [,1] [,2] [,3] [,4]
 [1,]    1    2    3    4
 [2,]    1    2    3    3
 [3,]    1    2    3    2
 [4,]    1    2    3    1
 [5,]    1    2    2    3
 [6,]    1    2    2    2
 [7,]    1    2    2    1
 [8,]    1    2    1    3
 [9,]    1    2    1    2
[10,]    1    2    1    1
[11,]    1    1    2    3
[12,]    1    1    2    2
[13,]    1    1    2    1
[14,]    1    1    1    2
[15,]    1    1    1    1

How did I come up with the function? I pictured the problem as a bifurcating tree, assigning each of the original levels to a new level, one at a time. At each step in the tree, the next level would either be different from one of the previous levels, or the same as one of the previous levels.

The code for producing this diagram is shown below.

library(DiagrammeR)

nodes <- create_nodes(
  nodes = 1:23,
  type = "number",
  label = c(1,  2:1,  3:1, 2:1,  4:1, 3:1, 3:1, 3:1, 2:1))

edges <- create_edges(
  from = rep(1:8, c(2, 3, 2, 4, 3, 3, 3, 2)),
  to =   2:23,
  rel = "related")

graph <- create_graph(
  nodes_df = nodes,
  edges_df = edges,
  graph_attrs = "layout = dot")

render_graph(graph)