According to some, ‘big data’ will transform everything, infiltrating every aspect of our work, play and comings and goings. But what are the implications for epidemiologists? What exactly is ‘big’? What exactly is ‘transform’? What’s next for us?

Daniel Westreich and Maya Petersen addressed these questions in the Society for Epidemiologic Research’s digital conference today. For epidemiologists, the consensus (from those keen to type responses in the chat box) was that big data may not be as revolutionary as popular imagination suggests. However to take full advantage, we may require new methods, training, more collaboration with programmers and ultimately better PR. Below is a full summary of the talks.

So what is ‘big’? It depends.

Daniel Westreich quoted others in saying ‘big’ is a moving target: what is big today was not big many years ago (think of your first CD compared to your current iPod). The summary I liked best: ‘big’ is anything that cannot fit on conventional devices. For example, I only discovered my dataset was ‘big’ when I tried to read it into R, the program froze, and my computer crashed. That’s big data (or a bad computer, but anyway, that’s the idea).

And could ‘big data’ transform epidemiology? Sort of.

First, unfortunately, simply having more data does not guarantee that causal assumptions are met. For example, Dr Westreich explained how scrapping big data from Twitter would result in huge amounts of highly biased data because the site is only used by a non-random 16% Americans. At the opposite extreme, we may end up over-confident in highly precise yet biased results. Big data could instead contribute more to prediction models. But Maya Petersen cautioned that even in these models, our implicit interest is often still causal – how often are we interested in knowing the probability of an event without even taking guesses as to why it occurs?

At the same time, we would need to move beyond classic model selection procedures to use it. Imagine 1000s of possible covariates, interactions and functional forms. According to Dr. Petersen, the way to arrive at a logical estimator must be to move away from using our logic: take humans out of it. She gave examples using UC Berkley’s signature SuperLearner in combination with Targeted Maximum Likelihood Estimation. Essentially, the first amounts to entering the covariates in a type of black box that attempts to find the best combination. Obviously the ‘best combination’ depends on the question at hand, hence the combined use with Targeted Maximum Likelihood Estimation. Though a specific example, we can only expect the use of such computer-intensive methods to increase alongside the use of big data in epidemiology.

Finally, what’s next for us? Training, collaboration, PR.

1) Revised Training: Requiring the use of these more computer-intensive methods also requires development of more advanced programming skills. But both speakers commented on the existing intensity of epidemiology PhD training. In fact, we are perhaps the only discipline where students come into the PhD with exactly 0 previous epidemiology courses. There is a lot to learn. At the same time, we cannot place the onus entirely on the students to self-teach. A better solution may be more optional courses.

2) Better collaboration: Rather than all of us returning to complete Bachelors in Computer Science, we could just become friends with programmers. In fact, there are lots of them. Dr Petersen discussed how teaching collaboration with computer scientists is a more feasible approach than teaching computer science itself. Part of that involves knowing the kinds of questions that we need to ask programmers.

3) More PR: Epidemiology public relations is a little non-existent relative to others sometimes (e.g. Economists). If we think we can benefit from big data to answer population health-relevant questions, we need to get ourselves invited to bigger discussions on the topic. For example, epidemiologists should be involved in a discussion on what data need to be collected. But the status quo generally excludes us.

More information: Daniel Westreich / Maya Petersen / Big Data in Epi Commentary