I joined the UCSD Cognitive Science PhD program with the aim to investigate multi-agent systems. A few years in I joined a project to investigate the interactions of bottlenose dolphins. The research group had a massive amount of audio and video recordings that was too big to handle without computational techniques. I joined the group to provide the computational support that they needed. During this process, I discovered that working with big data is motivating in its own right and that I wanted to pursue the data scientist path in lieu of academia.
I don’t see this choice as abandoning the traditional academic path (PhDs and Postdocs who leave academia traditionally harbor feelings of regret and worry about being failures). The number of open positions available to PhDs and Postdocs is slim to compared to the number of applicants:
This isn’t the only reason though. While academia has historically tackled interesting and challenging data problems, some of the most important discoveries and insights can now only be tackled by companies that have invested billions of dollars to accumulate the necessary data. I can think of no better place to be than being on the forefront of these endeavors.
These past few years have been filled with insights. I’ve been using data to investigate the lives of a very social mammal species, the bottlenose dolphin. I’m now setting my sights on humans, to find insights that can help us understand our own social world and help guide our individual (and corporate) decisions.
Learning to be a Data Scientist
Early on, I applied to software engineering jobs and even reached the last stage of the interview process at Google (prep material here). After a few interviews with other companies I realized that being a software engineer was not the optimal choice for my background or in line with my interests. I connected with a friend of a friend who was a data science practitioner to learn more about the role and the process. He had a lot of great advice and I recommend reading the article. Since then I have been going through the interview process for several data science positions and have had some successes. Recently, however, I put my job search on hold so that I could attend the Insight Data Science program as a Fellow for the 2015 Fall Silicon Valley session. Adventure awaits!
During this transition phase, I’ve tried to amass a large amount of relevant material for graduate students (particularly my friends in the department) who might want to pursue non-academic routes. While there are a growing number of data science programs across the globe, it is possible to pick up the skills while pursuing other STEM degrees. Most of these items are on my Twitter feed, but here are the highlights:
- Preparation for Insight
- Preparation for a transition to data science
- The Open Source Data Science Masters
- MSR NYC Data Science Summer School 2015
- The Data Science Handbook
- 66 job interview questions for data scientists
- Data Analysis
- Data Science Trello Board
- 16 Free Data Science Books
- Free Data Science Books
- Data Science Resources
- Randy Olson’s blog
- 38 Seminal Articles Every Data Scientist Should Read
- Statistical Learning @Stanford
- Computational Statistics in Python @Duke
- Time Series Analysis @MIT
- How to Become a Data Scientist
- Comparison of data analysis packages
- Python for Data Mining
- Data Science IPython Notebooks
- An example machine learning notebook
- Data Science Toolkit
- Learning data science for Python
- MOOCs on Data Science
- 11 most popular data science presentations on Slideshare
- Data, Tech, and Science Podcasts
- Learning Python for Social Scientists
In addition to the large datasets that you might encounter during your PhD (for example: Neuroscience Data by Ben Cipollini), there are plenty of free large datasets available to those interested:
- Great github list of public data sets
- Data.gov
- Google Public Data
- 30 Places to Find Open Data on the Web
- 20 Free Big Data Sources Everyone Should Know
- Data Sources for Cool Data Science Projects
- Big data sets available for free
Data Science Courses at UCSD
The Data Science Student Society at UCSD has put together a great list of courses available at the undergraduate level. Below is a listing of the graduate courses at UCSD that are relevant to Data Science (as of August 2015). I included courses focused on audio and video analysis as they also teach the skills data scientists need to tackle large amounts of noisy data.
As graduate students, you sometimes have the option of taking a free course through UCSD Extension. UCSD Extension offers a Data Mining Certificate from which you can take some of their courses. The Computer Science & Engineering Department and San Diego Super Computer Center now offer a Data Science & Engineering Masters degree.
If you feel I have missed a course, feel free to e-mail me and I’ll add it to the list.
UCSD Courses I took that were relevant to Data Science:
- CSE 250A. Artificial Intelligence: Search & Reason
- COGS 202. Foundations: Computational Modeling of Cognition
- COGS 225. Visual Computing
- COGS 260. Seminar on Special Topics (Sometimes on AI)
- COGS 200. Cognitive Science Seminar (I took Cognition under Uncertainty)
- COGS 220. Information Visualization
- ECE 272A. Stochastic Processes in Dynamic Systems (Dynamical Systems Under Uncertainty)
- MATH 285. Stochastic Processes
- MATH 280A. Probability Theory
- PSYC 232. Probabilistic Models of Cognition
Other UCSD courses I have not taken but are relevant to Data Science:
- MATH 280BC. Probability Theory
- MATH 281ABC. Mathematical Statistics
- MATH 282AB. Applied Statistics
- MATH 287A. Time Series Analysis
- MATH 287B. Multivariate Analysis
- MATH 287C. Advanced Time Series Analysis
- MATH 289A. Topics in Probability and Statistics
- MATH 289B. Further Topics in Probability and Statistics
- MATH 289C. Data Analysis and Inference
- PSYC 201ABC. Quantitative Methods
- PSYC 206. Mathematical Modeling
- PSYC 231. Data Analysis in Matlab
- CSE 250B. Principles of Artificial Intelligence: Learning Algorithms
- CSE 250C. Machine Learning Theory
- CSE 252AB. Computer Vision
- CSE 253. Neural Networks/Pattern Recognition
- CSE 255. Data Mining and Predictive Analytics
- CSE 256. Statistical Natural Learning Processing
- CSE 258A. Cognitive Modeling
- CSE 259. Seminar in Artificial Intelligence
- CSE 259C. Topics/Seminar in Machine Learning
- COGS 219. Programming for Behavioral Sciences
- COGS 230. Topics in Human-Computer Interaction
- COGS 243. Statistical Inference and Data Analysis
- ECE 250. Random Processes
- ECE 251AB. Digital Signal Processing
- ECE 252B. Speech Recognition
- ECE 253. Fundamentals of Digital Image Processing
- ECE 271AB. Statistical Learning
References
- Taylor, M., Martin, B., & Wilsdon, J. (2010). The scientific century: securing our future prosperity. The Royal Society.
- Schillebeeckx, M., Maricque, B., & Lewis, C. (2013). The missing piece to changing the university culture. Nature biotechnology, 31(10), 938–941.