Economists as data scientists
7 June 2018
Hal Varian, the chief economist at Google, is widely credited with suggesting that the use of statistics in data science would be the ‘sexiest job in the 21st century’, and it is worth noting that a number of prominent members of the data science community have economics backgrounds (John Akred from Silicon Valley Data Science, and Jenny Bryan, now at RStudio, are two who come to mind). However, within the broader universe of data scientists, it would appear as if economists are somewhat underrepresented in this community.
One reason is that many economists don’t self-identify themselves as data scientists. They think of themselves as primarily economists, don’t realise that they share many tools and areas of interest with data science, and until recently standard economics training has not included subjects such as using ‘big data sets’ or the tools required to extract insight from these. This is a shame since economists are a natural fit for many data science problems, and could be used to develop better models.
At the core of the economics discipline is attempting to understand how people interact, either individually, or collectively as ‘the market’. The rise of behavioural economics, which draws on psychology, neuroscience and other disciplines, has given rise to the view that most economists are no longer focused on the role of ‘super-rational’ individuals, but rather acknowledge that each of us have their own glitches and biases, many of which follow recognisable patterns. Economists also spend a lot of time thinking about prices, where these could be monetary or paid through time, where they are used to signal factors relating to the supply and demand in a particular market. Indeed, many data science problems involve aspects that relate to the functioning of markets; and with the use of large human-centric data sets we are often able to provide a more accurate description for the role that these prices play.
Furthermore, it is worth noting that economists already have many of the common tools data scientists use. Regression analysis is the a standard empirical tool used by economists, and Kaggle’s 2017 Data Science survey indicates that the logistic regression was the most common data science tool (https://www.kaggle.com/surveys/2017), used by more than 60% of data scientists. Economists are also well acquainted with ‘simpler’ but no less important tools - cross-tabulations, data visualisations and t-tests. Economists also often have a good understanding of the data generation process, and how things like sample selection can influence the composition of the data set, which may lead to a bias in the results. They also have some useful tools for this including Heckman two-stage regressions, and control function approaches. Randomised Control Trials (RCTs) are also analogous to A/B testing, albeit often on a much smaller scale.
So what are economists missing? In our experience there are three essential areas where economists are comparatively weak. The first is the ability to work in a scripting language like R or Python. Traditionally, economists make use of Excel, Stata and SPSS as their tools for empirical analysis, which are largely menu driven tools. The ability of scripting languages to automate much of daily procedures in a predictable way makes them indispensable in the data science field. Thankfully, the cost to learn how to use these scripting languages has been significantly reduced thanks to the on-line material that has been made available on Udemy, Datacamp, Lynda, Coursera, etc. In addition, it would also be worthwhile for the aspiring data scientist to acquire a working knowledge of Git, which facilitates version control and collaborative work.
Understanding overfitting and the effects of splitting a sample into training and testing subsets is also important. Economists generally use all the data they have when conducting an analysis. What this means is that the model often ‘overfits the data’ - they do a good job of explaining the patterns specific to the sample data but these may not be generalised to a broader sample. In data science we are often interested in building models which can be generalised. The standard methodology which is followed to overcome overfitting is to split the sample and build the model to fit the training data and then see how well the it functions on the testing data set. In economics one would often call this an ‘out-of-sample evaluation’.
There are also other skills including more sophisticated machine learning approaches, like Neural Networks, Random Forests and aspects relating to Bayesian statistics that can be useful. Knowing when these are appropriate to implement and when a simple regression could be more effective is especially usefully. Economists would also usually know relatively little about the use and structure of databases and how data should be stored. Knowing a bit more about this, and how to extract data from databases, and join tables, through for example SQL queries, is a useful skill.
With these skills, as well as soft skills like being able to work in a team and present your results in an accessible way, it is unlikely that as an economist you’ll be unemployed in the near future. After all, Amazon currently has over 20 job openings for economists (https://www.amazon.jobs/jobs-category/economics).