Exciting news: Unique data on social policies is now available

The social and economic conditions that surround us can affect our health.  This is not a new idea.  If the notion was not broadly appreciated by the time it was formalized in the Lalonde report, the point was hammered down more thoroughly in the Marmot review.  However, with little evidence quantifying the health impact of policies meant to address these conditions, this idea has large stayed as just that – an idea. Due to their complexity, we have been rarely able to answer questions like, ‘What exactly would happen to people’s health if we passed policy x, y, or z?’ or ‘How many less  people would get sick?’

Part of the challenge is the lack of analyzable policy information. The other part is the inherent difficulty in using conventional epidemiologic methods to answer such questions.  Taking on these challenges, the McGill-based MachEquity project (2010-) has been building databases and applying robust methods in an effort to build a slew of causal evidence of social policy effects on health in low and middle income countries. Below is a summary of their work and recent launch of data for public use.

How has MachEquity built ‘robust’ evidence?

First, there is the stringently-coded policy data. Focusing on national social policies, staff looked at full legislation documents in each country, amendments and appeals, or secondary sources if original documentation was not available.  Two researchers then quantified the legislation into something meaningful (for example, the months  paid maternity leave legislated), ultimately resulting in longitudinal datasets for with policy information from 1995-2013.

Second, the group is making use of quasi-experimental methods that attempt to mimic random assignment.  The latter is the ‘gold standard’ to evaluate the impact of anything (policy, intervention, drug…) because it ensures that we eliminate all other potential explanations for any differences we see in the people’s outcomes (e.g. their health status post-experiment).  Evidently, perfectly controlled randomized experiments are most often impossible when we are dealing with social determinants of health (can we randomize people to have more education?).  Enter: quasi-experimental methods. There are entire books and courses on this (literally – look here), but the basic idea is that we can mimic randomization by eliminating other sources of variation, e.g. over time, in different types of people, or in different countries, by controlling for it in specific ways.

So what have they done so far?

Built policy databases. Published. Presented. A lot. See here.  But not only that, researchers and staff on this project work in close collaboration with partners at the Department of Global Affairs, non-governmental organizations such as CARE, and ‘package‘ their work for policy-maker audiences . After all, the actual policy-making is in their hands, which are often far out of reach from academic research. Specific research topics have included the effect of removing tuition-fees, health service user-fees, maternity leave legislation, minimum age-of-marriage laws on outcomes like vaccination uptake, child mortality, adolescent birth rates and nutrition.

Where is the policy data and how can I use it?

Policy datasets on maternity leave, breastfeeding breaks at work, child marriage and minimum wage are now available for download here!  For each determinant, longitudinal data on low and middle income countries’ policies is available. You can therefore make use of quantified social policy information and changes in legislation over time, and the infinite possible analyses that such data lends itself to.

What do we do with all this data?

According to some, ‘big data’ will transform everything, infiltrating every aspect of our work, play and comings and goings.  But what are the implications for epidemiologists? What exactly is ‘big’? What exactly is ‘transform’?  What’s next for us?

Daniel Westreich and Maya Petersen addressed these questions in the Society for Epidemiologic Research’s digital conference  today.  For epidemiologists, the consensus (from those keen to type responses in the chat box) was that big data may not be as revolutionary as popular imagination suggests.  However to take full advantage, we may require new methods, training, more collaboration with programmers and ultimately better PR.  Below is a full summary of the talks.

So what is ‘big’?  It depends.

Daniel Westreich quoted others in saying ‘big’ is a moving target: what is big today was not big many years ago (think of your first CD compared to your current iPod).  The summary I liked best: ‘big’ is anything that cannot fit on conventional devices.  For example, I only discovered my dataset was ‘big’ when I tried to read it into R, the program froze, and my computer crashed.  That’s big data (or a bad computer, but anyway, that’s the idea).

And could ‘big data’ transform epidemiology? Sort of.

First, unfortunately, simply having more data does not guarantee that causal assumptions are met. For example, Dr Westreich explained how scrapping big data from Twitter would result in huge amounts of highly biased data because the site is only used by a non-random 16% Americans.  At the opposite extreme, we may end up over-confident in highly precise yet biased results. Big data could instead contribute more to prediction models. But Maya Petersen cautioned that even in these models, our implicit interest is often still causal – how often are we interested in knowing the probability of an event without even taking guesses as to why it occurs?

At the same time, we would need to move beyond classic model selection procedures to use it.  Imagine 1000s of possible covariates, interactions and functional forms.  According to Dr. Petersen, the way to arrive at a logical estimator must be to move away from using our logic: take humans out of it.    She gave examples using UC Berkley’s signature SuperLearner in combination with Targeted Maximum Likelihood Estimation. Essentially, the first amounts to entering the covariates in a type of black box that attempts to find the best combination. Obviously the ‘best combination’ depends on the question at hand, hence the combined use with Targeted Maximum Likelihood Estimation.  Though a specific example, we can only expect  the use of such computer-intensive methods to increase alongside the use of big data in epidemiology.

Finally, what’s next for us? Training, collaboration, PR.

1) Revised Training: Requiring the use of these more computer-intensive methods also requires development of more advanced programming skills.  But both speakers commented on the existing intensity of epidemiology PhD training.  In fact, we are perhaps the only discipline where students come into the PhD with exactly 0 previous epidemiology courses. There is a lot to learn.  At the same time, we cannot place the onus entirely on the students to self-teach. A better solution may be more optional courses.

2) Better collaboration: Rather than all of us returning to complete Bachelors in Computer Science, we could just become friends with programmers. In fact, there are lots of them.  Dr Petersen discussed how teaching collaboration with computer scientists is a more feasible approach than teaching computer science itself.  Part of that involves knowing the kinds of questions that we need to ask programmers.

3) More PR: Epidemiology public relations is a little non-existent relative to others sometimes (e.g. Economists).  If we think we can benefit from big data to answer population health-relevant questions, we need to get ourselves invited to bigger discussions on the topic. For example, epidemiologists should be involved in a discussion on what data need to be collected. But the status quo generally excludes us.

More information: Daniel Westreich / Maya Petersen / Big Data in Epi Commentary