Since finishing my PhD, I’ve been dividing my time between preparing three articles for publication in mathematics journals, and learning various data science and analysis techniques. It’s been an enjoyable, challenging and rewarding experience that is currently culminating in a series of job interviews. This post is about the process of learning data science, from the unique perspective of someone with a PhD in pure mathematics.
I remember when I decided to change out of academia and start looking for “data science” jobs, the whole process seemed completely impossible to me. I asked a few friends for advice and they gave me a standard list of things: learn how to write SQL queries, how to talk about supervised vs unsupervised learning, how to use scikit-learn, what MapReduce is, etc etc, that sounded completely foreign to me. But if you take it slowly, appreciate and accept that it’s going to take time and effort, it’s actually a lot of fun.
I never took computer science courses in my undergraduate degree, so my first step was to become a competent programmer. I chose Python as my language because it’s something I’ve always been somewhat comfortable with. Already having a sound knowledge of the basics of Python programming, I spent some time on small personal projects. I wrote a series of Python scripts to automate searching for concert tickets on Gumtree.com.au, as well as an automated scraper for Google Trends; having actual projects that I had come up with work on was great, and I was very satisfied when I finished (even though the finished products still need a lot of work). My feeling is that having a project to work on that you’ve come up with yourself forces you to learn how to code and solve problems in a practical way: you know the steps you need to take to finish. You’re also more motivated because it’s a project you know you’re interested in.
Early on I learnt the utility of git and wished I’d known about it while I was writing my PhD thesis (all my LaTeX code consists of thousands of commented out lines which are still in the files in the vague hope of one day being uncommented back to life). I also discovered Kaggle, which introduced me to the basic greatest hits machine learning algorithms. Although I did quite a few of the for knowledge Kaggle competitions, I never got particularly far with the ones with a money prize.
Coursera was, as for most people doing what I’m doing, an invaluable resource. Tim Roughgarden’s Algorithms courses were great for someone like me who has a very solid mathematical background but never took formal courses in algorithm design. There are a lot of courses in the Coursera Data science specialisation that were helpful; in particular my R programming was a little rusty and there were some courses that used R which I did. Before I learnt anything about algorithms, I had a job interview that went abysmally because I was asked a question whose answer required basic knowledge of binary search trees. Later in the process, once I’d learnt the basics, I ended up having to code MergeSort in a job interview.
Most data science (who knows what that means anyway) jobs require knowledge of SQL. I am not the biggest fan of SQL but I learnt how to write queries using a bunch of online tutorials. I also found a great app that teaches you SQL queries using a dummy database of stars in the galaxy. Nothing really prepares you for the horribleness of writing SQL queries on a whiteboard in a job interview though.
Aside from these technical skills, most of which are still nascent for me and will be honed on the job, there was the less tangible requirement of me to change my mode of thinking. As a pure mathematician, you are trained as a purist, an aesthete. That there is only two possible outcomes for what you are doing: correct for eternity, or useless and irrefutably wrong. In all other disciplines there is some leeway. Your code is allowed to be sloppy, particularly if it does what it’s designed to do, and even more so if what it’s designed to do is just basic exploratory data analysis. You’re also allowed to, and expected to, learn things on the go. This is subtly different to mathematics, where research builds on top of a complete understanding of existing knowledge.
Finally, I learnt the value of recording what I’m doing. The same way as learning a language or doing a PhD, learning a whole new field of knowledge is a process for which it is valuable to have markers along the way informing you on your progress. It’s hard to notice day-by-day, but if you write everything down you can look back and see how far you’ve come. I started with keeping things in an Evernote file, and now these Evernote snippets get turned into posts on this website.