Transitioning from Academic Machine Learning to AI in Industry

by Jeremy Karnowski and Emmanuel Ameisen

Beyond Implementing Papers

It requires more than just taking online courses or being able to implement papers to get a job in the modern AI industry. After speaking with over 50 top Applied AI teams all over the Bay Area and New York, who hire Applied AI practitioners, we have distilled our conversations into a set of actionable items outlined below. If you want to make yourself competitive and break into AI, not only do you have to understand the fundamentals of ML and statistics, but you must push yourself to restructure your ML workflow and leverage best software engineering practices. This means you need to be comfortable with system design, ML module implementation, software testing, integration with data infrastructure, and model serving.

Why do people implement papers?

Frequent advice for people trying to break into ML or deep learning roles is to pick up the required skills by taking online courses which provide some of the basic elements (e.g. Google’s Deep Learning courseFei-Fei Li’s Computer Vision courseRichard Socher’s NLP course). Many of these courses teach participants in the same way as students learn — through lectures and structured coursework.

While these core concepts of machine learning and deep learning are essential for Applied AI roles in industry, the experience of grappling with a real, messy problem is a critical piece required for someone seeking an industry role in this space. Because of this, Andrew Ng recommends that people to prepare for this transition by implementing research papers. This forces one to work through some of the same issues that Applied AI professionals must face when moving research to production. Through this process, newcomers to the field also learn the many tips and tricks that allows researchers to debug algorithmic issues and iterate more rapidly for better performance.

However, through our many conversations with 50+ teams across industry, we consistently hear that just learning concepts or implementing papers is not enough.

Top 5 Skills you need to acquire before transitioning to Applied AI

1. System Design

When machine learning and deep learning is employed to solve business problems, you must design systems that consider the overall business operations. The system’s components should be architected in a modular way, operate under a solid logic, and have extensive documentation for others.

Questions to ask for each project:

  • How do you efficiently train your model without impacting day to day production?
  • Where do you store and backup models?
  • How do you run quick inference?
  • What are concrete metrics you can relate to your model?
  • If needed, how do you integrate a human feedback loop?
  • Do you need deep learning to solve the problem, and if so, why?

 

2. Structured ML Modules

Jupyter notebooks, while wildly popular for rapidly prototyping deep learning models, are not meant to be deployed in production. For this reason, academics should push themselves to build structured ML modules that both use best practices and demonstrate you can build solutions that others can use.

Action Items:

  • Take a look at this GitHub repo (and related blog) from an Insight AI Fellow that, in addition to having some exploratory notebooks, converts these ideas into a well structured repo that can be called from the command line.
  • Read up on Tensor2Tensor (repo) and work to expose your model’s training and inference through an elegant API.

 

3. Software Testing

Academics often run code to find and eliminate errors in an ad hoc manner, but building AI products requires a shift towards using a testing framework to systematically check if systems are functioning correctly. Using services like Travis CI or Jenkins for automatic code testing is a great first step to showing you can work in a company’s production environment.

Action Items:

  • Check out a good starter blog post on testing by Alex Gude.
  • Read Thoughtful Machine Learning, which goes more in depth on how to test machine learning systems.
  • Read this paper on the tests and monitoring that companies care about for production-ready ML
  • Work through how you would test machine learning algorithms. For example, design tests that ensure a piece of your ML system is modifying data in the way you assumed (e.g. correctly preprocessing image data by making it the correct size for the model to use).
  • Check out testing options in Python.
  • Read this article on the differences between different continuous integration services. We recommend trying out Travis CI as a quick intro into this industry-standard practice.

 

4. Integrating with data infrastructure

No matter what company you join, you will have to access their often large data stores to provide the training and testing data you need for your experiments and model building. To show that you would be able to contribute on day one, demonstrate that you are able to interface with structured data records.

Academics typically experience a world where all the data they use can be stored locally, which is often atypical in industry. Similarly, many competitions and research problems are structured in a way that academics only need to use a folder of images.

To demonstrate industry know-how, academics should show that they can (1) query from large datasets and (2) construct more efficient datasets for deep learning training.

Action Items:

 

5. Model Serving

It’s one thing to have built a solid ML or deep learning model that has excellent accuracy. It’s another thing to turn that model into a package that can be incorporated into products and services. While many academics using ML are very familiar with model metrics (e.g. accuracy, precision, recall, F1 scores, etc), they need to become familiar with metrics that companies care about when it comes to fast, reliable, and robust ML services.

Action Items:

 

Accelerate your transition

Iterating rapidly on modeling and deployment, and learning from those experiences, is the best way to quickly get up to speed. Because of this, individuals looking to make the transition to applied AI roles need to take advantage of GPU compute to accelerate their progress. There are a wide range of options available for experimenting with GPUs.

 

Keeping up to date

AI is an exciting, ever-changing field. The demand for Machine Learning Engineers is strong, and it is easy to get overwhelmed with the amount of news surrounding the topic. We recommend following a few serious sources and newsletters, to be able to separate PR and abstract research from innovations that are immediately relevant to the field. Here are some sources to help out:

  • Hacker News: Hacker News is a social news website focusing on computer science, data science, and entrepreneurship. It is run by Y Combinator, a well-known startup incubator. Don’t be thrown off by the name! The original usage of the term “hacker” has nothing to do with cyber criminals, but rather someone who comes up with clever solutions to problems through their programming skills.
  • Import AI: Great newsletter by Jack Clark of OpenAI, that stays on top of most important advances in the field.
  • Insight Blog: They maintain a pretty active pace. Some posts in their AI blog talk about different past projects, and can serve as good inspiration for interesting problems to tackle. They also regularly send AI content to their mailing list, sign up here.
Transitioning from Academic Machine Learning to AI in Industry

Preparing for the Transition to Applied Artificial Intelligence

by Jeremy Karnowski and Emmanuel Ameisen

Applied AI roles involve a combination of software engineering and machine learning and are arguably some of the most difficult roles to break into. These roles, focusing on advanced algorithms and leveraging new research results for sophisticated products, are a core component in many organizations, from large corporations to early-stage startups.

While at Insight, we worked with top AI teams in the Bay Area and New York to help Fellows make the transition to Applied AI. Fellows from our first AI session joined teams like Salesforce R&D, Microsoft AI Research, Google, Quora ML, and Arterys (who has the first FDA-approved deep learning medical intervention).

Because of our unique position in this space, we want to share with a wider audience some industry insights, perspectives on how companies are building their teams, and skills to prepare for the transition, whether you are coming from academia or industry.

The Industry Perspective

After speaking with 50+ Applied AI teams in industry, including those building new products using advanced NLP, architecting deep learning infrastructure, and developing autonomous vehicles, the one thing we consistently see is that there is a spectrum for Applied AI practitioners — ranging from research to production.

  • Research: On one end of the spectrum, teams are developing new ideas, doing R&D, developing prototypes, and primarily looking to produce papers.
  • Production-level ML: On the other end of the spectrum, teams are taking current ideas and producing fast, efficient, and robust systems that can be embedded in products.

While there are roles in R&D labs and in teams doing deep learning research, the majority of roles exist in the middle of this spectrum, on teams that aim to simultaneously stay current with research and embed the best advances into products. Often teams have a mix of members working to achieve this, but you’ll be the most competitive if you can position yourself to add value in both areas — be able to read and digest current research and then implement in production.

Our Approach

Building a pipeline from research to production requires companies to structure their teams in a way that blends the benefits of both worlds — academic research and software engineering. Accordingly, the Insight AI program was structured in a similar fashion, bringing together academics doing ML and deep learning research and software engineers with experience in ML.

While all the AI Fellows have strong coding abilities and experience with deep learning, academics and software engineers have different strengths, so our advice for how to succeed for these two groups is different. In addition to Insight’s resources on getting prepared for Data Science, we’ve gathered additional resources that target the transition to Applied AI.

These two guides are a distillation of many conversations we’ve had with top teams in the Bay Area and in New York, who hire AI practitioners poised to tackle their technical challenges and accelerate their expansion into Applied AI.

 

(this post was originally hosted on Medium)

Preparing for the Transition to Applied Artificial Intelligence

How AI Careers Fit into the Data Landscape

AI vs. Data Science vs. Data Engineering

Jeremy Karnowski & Ross Fadely

The landscape of technical professions is constantly changing, and the resurgence of work in Artificial Intelligence has opened up new opportunities that differ from traditional Data Engineering and Data Science positions. Data Engineers build data pipelines and infrastructure to ensure a constant availability of transformed data. Data Scientists analyze and build models from these data to develop new product features or drive the bottom line of the business. The goal of newly-formed AI teams is to build intelligent systems, focused on quite specific tasks, that can be integrated into the scalable data transformations of Data Engineering work and the data products and business decisions of Data Science work.

The differences between Artificial Intelligence, Data Science, and Data Engineering can vary considerably among companies and teams. We previously posted on the differences between Data Science and Data Engineering roles, and because these new AI roles differ from both, here we outline the roles for contrast.

AI Professionals

Artificial Intelligence, or AI, focuses on understanding core human abilities such as vision, speech, language, decision making, and other complex tasks, and designing machines and software to emulate these processes. AI has a long and rich history, and while many of the tools and techniques have been around for decades (i.e. multi-layer perceptrons, convolutional neural networks, reinforcement learning), recent advances in high performance computing, the development of distributed methods, and the availability of large labeled datasets have accelerated its adoption in industry. Due to this accelerated growth and success, there is an unprecedented need for AI practitioners across a range of industries.

The types of AI roles vary from company to company and industry to industry. While AI professionals come with many titles (e.g. Deep Learning Engineer, Computer Vision Researcher, Machine Learning Engineer, and NLP Scientist), they all share the same focus: building complex, state-of-the-art models that tackle specific problems.

Building these systems requires strong knowledge of engineering and machine learning principles, and depending on the team or product, some roles may weigh heavier on specific skills. For instance, some AI roles are more research focused and concentrate on finding the right model to solve the task, while others are more focused on training, monitoring, and deploying AI systems in production. Projects often center around questions like: How can we encapsulate subject matter expertise in order to augment or replace complex time-consuming decision making tasks? How can we make our automated customer interactions more natural and human-like? How can we uncover subtle patterns and make decisions that involve complicated new types of streaming data?

While there is a spectrum of AI focused work, almost all practitioners regularly prototype new AI system architectures, including building end-to-end pipelines. This means they need to stay current with the latest (often academic) advancements in AI. They actively monitor the performance and training of systems, help scale them up for production, and iterate on systems given shifts in data and/or model performance. AI Professionals typically have a good working knowledge of the python Data Science stack (e.g., NumPy, SciPy, pandas, scikit-learn, Statsmodels, etc), employ one or more deep learning frameworks (e.g., TensorFlow, Theano, Torch, Deeplearning4j, etc), and sometimes leverage distributed data tools (e.g., Hadoop, Spark, Flink, etc).

Data Scientists

Roles of Data Scientists in industry vary considerably. Organizationally, Data Scientists are often interfacing with internal (and sometime external) teams to help direct decisions which drive business. This involves answering questions like: How do we better understand and serve our customers? What ways can we optimize our operations and product? Why should we roll-out a new feature or product?

Frequently, Data Scientists are also directly involved in building data products. These products are everywhere, integrated into the websites and apps we use. Classic examples include Facebook’s custom user-based news feed and LinkedIn’s “People you may know” feature.

The day-to-day for data scientists may involve cleaning and manipulating lots of data, scoping and testing out high ROI projects, building out customized algorithms, and communicating results to the team and company clients. Data Scientists typically use Python or R, make heavy use of SQL queries, do analyses in Jupyter notebooks, and often have some data visualization experience in one or more frameworks.

Data Engineers

The amount of data businesses are ingesting and serving is trending in one direction: UP! In order to enable the functionality of data teams and products, engineers must design robust and scalable data architectures. Engineers who construct these systems have to think carefully about the current and future demands the business will need. What compromises need to be made for different ingestion/serve rates? Do we need to provide real-time streams to customers? How do we efficiently query the data? How do we keep track of vital metrics?

Along with these increasingly demanding needs is a landscape of data tools which is constantly and rapidly evolving. Deciding between tools like Spark, Flink, Kafka, Cassandra, Redshift, and ElasticSearch is a difficult and challenging task. Skilled Data Engineers not only know the pros and cons of using one tool over another, but also know how to implement them in production systems.

As a result, the main daily activities of Data Engineers involve building out data systems to solve new project challenges, improving and maintaining existing architectures, integrating systems with new/better tools, and syncing with team members to ensure quality work and product flow.

Industry roles in Data Science, Data Engineering, and Artificial Intelligence typically have different objectives. While each role can benefit from all the skills above, some roles emphasize some skills more than others. And at the heart of every role is the availability of data and an emphasis on being data driven.

Overlap

Distinguishing between specific instances of the roles is often not clear cut. Many Data Science roles often need to know modern Data Engineering tools to get the data they need, while more and more Data Engineering roles are performing significant analyses and incorporating machine learning in their pipelines. There is also a significant overlap between the work AI professionals do and that of Data Scientists and Engineers. The core ideas and knowledge that drive the analyses of Data Scientists underpin the skills needed to build AI models. These models typically require very large datasets, so while efficient manipulation and use of large amounts of data is a fundamental aspect of Data Engineering work, it is crucial for state-of-the-art AI systems.

In our view, the fundamental difference between AI and Data Science/Engineering is in the nature of AI’s primary goal: to build intelligent systems that generate their own features and knowledge of a domain, and deliver performance on tasks which is near, at, or above human expert level. This aim naturally requires a different and/or additional set of skills, focused on models and techniques which marry the cutting edge of AI research and practice.

(This post was originally hosted on Medium)

How AI Careers Fit into the Data Landscape

From Cognitive Science to Data Science

by Rose Hendricks (originally found here)

While finishing his dissertation, Jeremy became an Insight Data Science Fellow, participating in an intensive post-doctoral program to prepare academics for careers in data science. Because he was such an all-star during the program, Jeremy is now a Program Director and Data Scientist for Insight Data Science. In this podcast, I talk to him about his experiences at Insight Data Science. He shares some of the features of his PhD training that have helped him in his data science quest and other skills and modes of thinking he’s had to develop since beginning to work in industry.

Enjoy!

Podcast available here (transcript below)

— — — — –

Rose: Hello! And welcome to UCSD Cog Sci’s very first podcast. I’m Rose Hendricks, and I’m here today with Jeremy Karnowski, who is finishing up his PhD in our department, and is here to talk to us about some of the things he’s been doing in the recent past with Insight Data Fellows, and the path he’s taken from PhD student into the working world. Jeremy, thanks so much for joining us!

Jeremy: Thanks for having me.

Rose: First, I’m hoping you can tell us: You completed a program called Insight Data Fellowship. The one sentence I got about that from their website is that it’s an intensive, seven-week post-doctoral training fellowship that bridges the gap between academia and a career in data science. Can you explain to us a little bit about that?

Jeremy: Yeah. As you mentioned it’s a seven-week program. People that are finishing up their PhDs or post-docs and have been doing a lot of quantitative work in physics, or bio-statistics, or computational social sciences, it’s a way for them to make that last little leap into industry. The idea is they’ve already being doing data science in academia, but they’re trying to find roles doing the same thing for companies, either in the Bay area, New York, or Boston.

Rose: Cool. I feel like “data science” is one of those phrases we all toss around, and of course we’re all doing science of some form and dealing with data of some form, but what do people really mean when they talk about data science?

Jeremy: Data science could be a couple different forms. This is some of the problem with people that are coming from academia and trying to transition into industry because the field is so broad that data science could mean many different things. Especially, say, in the Bay area. You could have things that are more from a business intelligence standpoint, or you could have people that do analytics where they’re trying to understand some of the insights that are in data at a company or doing some data mining, some people focus on machine learning, some people really focus around building data products, where you’re making some product for consumers or someone inside of a business, but does a lot of data in the back end that’s being useful for that. One really simple example would be “people you might know” on LinkedIn — there’s a lot of interesting graph relationships that are going on, it’s very data-driven, but then there’s a product on the front end that people can just use. They come to it, they find some way to connect with someone, and it’s something that’s real, but there’s data in the back end.

Rose: Sure. So if you project yourself back one year, you were in San Diego, working on your PhD. Maybe it wasn’t just one year ago, but what got you interested in data science in the first place? What was your path to finding Insight Data Fellowship?

Jeremy: I was in one lab in the cognitive science department, and it was very computational. And I switched and I was still doing something computational, but I was shifting my focus. I joined a lab that really had a large amount of audio and video data that no one was diving into. At that time, I was getting the sense that trends were going toward large data and machine learning, lots of different ways of dealing with massive amounts of data. I was like, well, this is probably going to be not only great for studying something in the department, but also building up skills for entering a different sort of work force. I would say — remind me again, what was the second part?

Rose: Just thinking about your transition, or, it seems like your back and forth with one foot in academia. How did you start? Certainly you needed new skills, I’m guessing.

Jeremy: You’d be surprised! Just — a lot of people that come out of PhDs or post-docs that go into Insight Data Science, they have already been doing data science, usually large-scale data analysis, in their PhD, so that sort of fits the bill. Often times they might not be using the exact toolkit that is in industry. Some people that come from physics are using a variant of C++ that does different things and a lot of physics packages are built into that. A lot of people, as you know, in neuroscience and computational social science use Matlab, so that’s not really industry standard, so making sure you’re up to speed with Python or R, and using the toolkits in those packages, is always a big plus. And I think other things that I jumped on board with… I really wasn’t clear about data science when I started my journey. I was thinking that I should probably go into software engineering or something, so it was really unclear. And those interviews were very different, and I was learning a lot about algorithms and data structures and doing coding problems. I went to a couple different interviews and it just didn’t seem right. It actually was really fortuitous because I talked with a friend. It was basically a friend of a friend. I talked with him and he had done a math PhD and looked into software engineering jobs, and then also had explored the data science space. And he was chatting with me about the role I was looking for and what I wanted to do in industry, and he was like, you should definitely check out data science, that’s where it’s at. And immediately as I started looking at those jobs, those were the sort of businesses that called me back and were interested because that’s the skill-set you have when you’re doing quantitative social science.

Rose: That leads into my next question, which is about some of the skills you developed during your PhD. Maybe not even tangible ones, maybe soft skills, that have helped you make this transition now into data science.

Jeremy: So I think, one of the things that’s very different from industry in academia is the time scale of things in academia could be months or years, and it’s really about perfecting something in a very small niche. And you’re spending a lot of effort to get there. And in industry, the timeline’s very different. Everyone’s working on several-week timelines to push something out. Or maybe you need this analysis in the next few days to make something happen. It doesn’t have to be 100% correct. Often times you really need to get 80 or 90 percent of the way there really quickly just so you can get a sense of how do you move a business in one way or another. That’s a very different skill than how academics do things. I think another thing that’s really important is trying to speak to the consumers or anyone using the product or different people in the organization. Academics are very used to talking with other academics. There’s lots of lectures and lots of talks and everything’s at a high level. It’s really important to be able to speak about really complicated topics at a high level and explain them to different parts of the company. I always found the cognitive science department really good for that because we have a lot of interesting and very in-depth research, say, neuroscience or psychology, or more computer science labs, and they all have to explain their research to each other, and having that skill is very important. And thirdly, I think another important thing is a lot of the data in data science organizations, they are fundamentally about people. There are people using the products. There are people interacting in the businesses. And a lot of the things you want to capture about how your product is being used is about human behavior. I think that is something that’s also a very different skill to learn because even though you’re doing the same sort of analysis, like say you might be doing something in bioinformatics or physics, you start thinking about the data in a different way.

Rose: That’s really interesting. It sounds like you’re saying maybe there are some ways of thinking that are valued in academia, but aren’t in the same way in industry, and at the same time, so many of those skills you cultivated over the course of your PhD are still really meaningful. Is that sort of along the lines…?

Jeremy: Yeah. There are definitely differences. The kinds of data you deal with could be very instrumental to… Say if you’re doing quantitative social science, or something where you’re studying people. That could be very helpful. If you’re doing things with — Even people who are studying say, for instance, EEG signals: you’re collecting data from lots of different small devices and trying to analyze that signal for some reason, and that’s still very data-science-like. But that data’s very different from the data you’d be collecting in industry. So sometimes you need to change up how you think about the data, sometimes you don’t. Thinking about the data might lend itself in different ways to what you might do down the line. I think the skills that are always really important for doing things in PhDs — You learn really quickly you hit your head against problems that are really frustrating for a long amount of time. You’re coming up with the right kinds of questions to ask, you’re testing it, you’re really driving home on things week after week. I think that is something that’s really hard to get if you’re trying to dive into data science and you haven’t had the experience of doing something like this for many years. So that’s a general PhD-level skill that I think is very transferrable. A lot of the very academic things, staying really in the details or taking a really long time to explain something or using a lot of jargon doesn’t work very well when you’re talking with someone, say, from the marketing department. Or maybe you’re doing some data science for marketing purposes and you have to explain to whole teams why what you’re doing matters and how it’s going to change the business. You can’t just use technical terms all the time.

Rose: So you did this seven-week program with Insight Data Science and you’re still at that company, is that correct?

Jeremy: Yeah.

Rose: So can you tell us a bit about what you do now? Maybe at both the high level and the more detailed aspect?

Jeremy: Yeah, sure. So at Insight, after the fellows go through the Insight Data Fellows program, they get call-backs from different companies about places they might want to be placed. And we work really hard to make sure every fellow is getting a job, so we’re with them until the end. And I think I had a lot of call-backs from different places, but at the same time I had sort of had a really great experience with them. And they invited me onto the team. So this was not standard, but it very much fit in line with my career goals. A lot of the things we do at Insight as program directors and data scientists is what you would expect from someone who’s doing a more senior data science role or a head of data science role. So that was really appealing to me. In my role, there are a couple different things. There’s work that’s going on on the back side that’s more about the data that we have and doing more technical work, but really a lot of my job is working with the fellows who come through the program, providing technical guidance, helping them think through their problems. So it’s a mixture of making sure they can get their projects done on time, talking with companies in the Bay area, and also helping provide — Making sure that no one is blocked on anything, that fellows can move forward in their ambitions and projects.

Rose: Wow, so you’re almost paying it forward and getting paid to do so, it sounds like.

Jeremy: Yes, it’s a great job. I think, I like the idea of being in academia, I like the idea of being in industry. I get to be a little bit of both. I get to be in data science and help people push their careers forward as well.

Rose: That’s cool. So I’m wondering now, I asked you earlier to project yourself backward one year, but if you go back even a few more, to maybe when you were like in your second, third, fourth year at UCSD, are there any things you wish you could go back and tell yourself?

Jeremy: Yeah! I think one thing that I would imagine is really trying to understand the landscape of different tools and technologies to tackle a problem. It’s very easy to start from scratch and have an adviser or other grad students tell you these are the ways to do these different things, but then really trying to sample all the stuff that’s out there to understand, if I get this sort of problem, what are all the tools that I could bring to bear to tackle that problem? I don’t know if everyone does that enough. I think another thing that I would suggest, I really wish I would have taken this data mining course that was in the computer science department. At that point I didn’t have to take any more classes and I think that it would have been a really great class to take because that’s really fundamental in a lot of the analytics and things that are going on in the Bay area or in data science in general. I think just taking more courses where you do projects and you’re really trying to get deep into, “What can I do with this data to get to a goal?” I think would be important. Other than things I wish I had done, I actually did get a lot of really great advice. I was in my third year and I was talking with Seana Coulson and Marta Kutas, and I had been transitioning labs. I was going from modeling human decision making and I was switching into modeling multi-agent decision making. So lots of artificial agents making decisions with each other on some sort of task, so distributed artificial intelligence. The advice I got was, this is great, and it’s modeling, and there’s these scenarios that you’re creating, but it would be better if you were able to actually study real-world systems and understand what is actually going on in the real world with systems and agents interacting with each other. And that’s actually how I jumped on the project that did have a large amount of audio and video data because it was all about dolphins interacting with each other and trying to understand their activity. And that was like a real-world system. And I think that was really fundamental in how I started shifting from one career to the next, because it made it much more relevant. And I think if people are in the clouds too much, trying to focus in on real phenomena and understanding them and explaining them is a really important thing to learn.

Rose: I have one more question for you before we finish up because I’ve asked you to project yourself back a few times. So now, naturally, I’d like to go the other way. I’m curious, what are some of the things on your mind for the future? Whether it’s short term, you can tell me what you’re going to do this weekend, or longer term, like things you’re excited about, either in your own life or the whole world. Tell me everything.

Jeremy: I have been really excited about the trends of development. Everyone’s releasing open toolkits for deep learning or artificial intelligence. That’s definitely something I’ve been exploring and trying to get more of a grasp on. Companies have a very large amount of data, and so they’re comfortable with releasing the tools to deal with that because the data is the gold that they’re using. So I’ve been really interested in exploring that space. Another goal I had being out here in the Bay area is I really wanted to see how companies run, see how companies start and form, so that’s interesting as well. And also, getting to the point where trying to lead a team of data scientists at a company and move a business forward. One thing that always surprised me at Insight is that, this was sort of my five-year goal. And I would meet a lot of alumni that would do Insight Data Science, and they were like, “Oh, I did Insight two years ago, and now I’m the head of this data science team, or I lead the data science team at this company.” And I was like, that’s two years! I was expecting it to be five years. So I think anything I say is probably going to happen much quicker than I expect. That’s just how industry works compared to academia. Things are moving so quickly and I don’t even quite know what’s going to happen in the next year or two.

Rose: Well, that’s exciting!

Jeremy: I enjoy it. I try to maximize for the largest number of possible options in the cool and interesting areas, and just keeping doing that, that way everything’s on your plate and open to you.

Rose: Is there anything else that you were hoping to share that I haven’t yet asked you about today?

Jeremy: I would say other than, we’ve been talking a little bit about Insight Data Science in Silicon Valley over in the Bay area, but we also have lots of programs in New York, Boston, there’s a remote program. We also have a data engineering program in the Bay area, and New York. And we’ve had a health data science program in Boston and our first health data science program in Silicon Valley is going to start September.

Rose: Wow, so it’s really exploding.

Jeremy: Yeah, it’s pretty great. We have a lot of great companies that we partner with. And there’s lots of very exciting domains that are branching out. It’s a very exciting time.

Rose: So Jeremy, if people want to find you on the internet, do you have a twitter handle we can share?

Jeremy: I do! It’s @mwakanosya, which sounds really strange. It’s M-W-A-K-A-N-O-S-Y-A. And there’s a whole other story to that! Someday I can tell you if they want to hear it.

Rose: Thanks so much for chatting with me. And I’m sure this will be a pleasure for all the listeners.

Jeremy: Yeah, definitely connect with me on Twitter, I’m happy to chat on LinkedIn. I’d love to chat more and help people out in any way I can.

Rose: Thanks Jeremy!

Jeremy: Thanks very much!

From Cognitive Science to Data Science

Inputting Image Data into TensorFlow for Unsupervised Deep Learning

Everyone is looking towards TensorFlow to begin their deep learning journey. One issue that arises for aspiring deep learners is that it is unclear how to use their own datasets. Tutorials go into great detail about network topology and training (rightly so), but most tutorials typically begin with and never stop using the MNIST dataset. Even new models people create with TensorFlow, like this Variational Autoencoder, remain fixated on MNIST. While MNIST is interesting, people new to the field already have their sights set on more interesting problems with more exciting data (e.g. Learning Clothing Styles).

In order to help people more rapidly leverage their own data and the wealth of unsupervised models that are being created with TensorFlow, I developed a solution that (1) translates image datasets into a file structured similarly to the MNIST datasets (github repo) and (2) loads these datasets for use in new models.

To solve the first part, I modified existing solutions that demonstrate how to decode the MNIST binary file into a csv file and allowed for the additional possibility of saving the data as images in a directory (also worked well for testing the decoding and encoding process).

I then reversed the process, making sure to pay attention to how these files were originally constructed (here, at the end) and encoding the information as big endian.

The TensorFlow tutorials allow you to import the MNIST data with a simple function call. I modified the existing input_data function to support the loading of training and testing data that have been created using the above procedure: input_any_data_unsupervised. When using this loading function with previously constructed models, remember that it does not return labels and those lines should be modified (e.g. the call to next_batch in line 5 of this model).

A system for supervised learning would require another encoding function for the labels and a modified input function that uses the gzipped labels.

I hope this can help bridge the gap between starting out with TensorFlow tutorials with MNIST and diving into your own problems with new datasets. Keep deeply learning!

Inputting Image Data into TensorFlow for Unsupervised Deep Learning

Visualizing deaf-friendly businesses

In order to prepare for the Insight Data Science program, I have been spending some time on acquiring/cleaning data, learning to use a database (MySQL) to store that data, and trying to find patterns. It is uncommon in academia to search for patterns in data in order to improve a company’s business, so I thought I should get some practice putting myself in that mindset. I thought an interesting idea would be to visualize the rated organizations from deaffriendly.com on a map of the U.S. to identify patterns and provide some insights for the Deaf community.

This could be useful for a variety of reasons:

  • We could get a sense of where in the U.S. the website is being used.
  • We could identify cities that receive low ratings, either because businesses are unaware of how to improve or because the residents of that city have different rating thresholds. This could help improve the ability to calibrate reviews across the country.
  • We could identify regions in cities that do not receive high reviews to target those areas for outreach.
  • Further work to provide a visual version of the website could allow users to find businesses on the map in order to initiate the review process.

In the above image, I plotted reviews from deaffriendly.com and highlighted some interesting patterns. While qualitatively these statements seem true, the next step would be to do a more in depth study. Also, a future version of this map could look similar to Yelp’s Wordmap or the Health Facility Map which uses TZ Open Data Portal and OpenStreetMap.

Why was this of interest to me?

As a PhD student at UCSD, I am friends with many people who are in the Center for Research in Language and work with Carol Padden studying sign language. I participated as a Fellow in the recent CARTA Symposium on How Language Evolves and was paired with David Perlmutter (PhD advisor to Carol Padden) who presented on sign language. While I have not studied sign language in my research, the Deaf community is one that interests me and I thought I would help out if I could.

In June of this year, I was the chair and organizer for the inter-Science of Learning Conference, an academic conference that brings together graduate students and postdoctoral fellows from the six NSF-Sponsored Science of Learning Centers. One of these centers is VL2 which is associated with Gallaudet University. As part of the conference, I organized interpreter services and used CLIP Interpreting as they were highly recommended by many groups (and VL2 students said they were the best interpreters they had ever had). At the end of the conference CLIP Interpreting told the VL2 students and the iSLC organizers about deaffriendly.com and encouraged us to contribute to the website. This was my chance to pay it forward. I highly recommend using deaffriendly.com and helping them expand their impact.

Visualizing deaf-friendly businesses

Find your Dream Job!

A few months ago Andrej Karpathy wrote an excellent introductory article on recurrent neural networksThe Unreasonable Effectiveness of Recurrent Neural Networks. With this article, he released some code (and larger version) that allows someone to train character-level language models. While RNNs have been around for a long time (Jeff Elman from UCSD Cognitive Science did pioneering work in this field), the current trend is implementing with deep learning techniques organizationally different networks that attain higher performance (Long Short-term memory networks). Andrej demonstrated the model’s ability to learn the writing styles of Paul Graham and Shakespeare. He also demonstrated that this model could learn the structure of documents, allowing the model to learn and then produce Wikipedia articles, LaTeX documents, and Linux Source code.

Others used this tutorial to produce some pretty cool projects, modeling audio and music sequences (Eminem lyricsObama SpeechesIrish Folk Music, and music in ABC notation) as well as learning and producing text that resembles biblical texts (RNN Bible and Learning Holiness).

Tom Brewe’s project to learn and generate cooking recipes, along with Karpathy’s demonstration that the network can learn basic document syntax, inspired me to do the same with job postings. Once we’ve learned a model, we can see what dream jobs come out of its internal workings.

To do this, I performed the following:

  1. Obtain training data from indeed.com:
  • Create a function that takes city, state, and job title and provides indeed.com results
  • Gather the job posting results, scrape the html from each, clean up html
  • Save each simplified html file to disk
  • Gather all the simplified html files and compile one text file

2. Use recurrent neural network to learn the structure of the job postings

3. Use the learned network to generate imaginary job postings

Obtaining Training Data

In order to obtain my training data, I scraped job postings from several major U.S. cities from the popular indeed.com (San Francisco Bay Area, Seattle, New York, and Chicago). The code used to scrape the website came from this great tutorial by Jesse Steinweg-Woods. My modified code, available here, explicitly checked if a website was located on indeed.com (and not another website as the job posting structure was different) and stripped the website down to a bare bones structure. Having this more specific structure I thought would help reduce the training time for the recurrent neural network. Putting these 1001 jobs into one text document gives us a 4.2MB text file, or about 4 million characters.

Training the Recurrent Neural Network

Training the RNN was pretty straight forward. I used Karpathy’s code and the text document generated from all the job postings. I set up the network in the same manner as the network Karpathy outlined for the writings of Shakespeare:

th train.lua -data_dir data/indeed/ -rnn_size 512 -num_layers 3 -dropout 0.5

I trained this network over night on my machine that has a Titan Z GPU (here is more info on acquiring a GPU for academic use).

Imaginary Job Postings

The training procedure produces a set of files that represent checkpoints in training. Let’s take a look at the loss over training:

It looks like the model achieved pretty good results around epoch 19. After this, the performance got worse but then came back down again. Let’s use the checkpoint that had the lowest validation loss (epoch 19) and the last checkpoint (epoch 50) to produce samples from the model. These samples will demonstrate some of the relationships that the model has learned. While none of these jobs actually exist, the model produces valid html code that represents imaginary dream job postings.

Below is one of the jobs that was produced when sampling from the saved model at epoch 19. It’s for a job at Manager Persons Inc. and it’s located in San Francisco, CA. It looks like there is a need for the applicant to be “pizza friendly” and the job requires purchasing and data entry. Not too shabby. Here it is as a web page.

At epoch 50, the model has learned a few more things and the job postings are typically much longer. Here is a job for Facetionteal Agency (as a website). As you can see, more training can be done to improve the language model (“Traininging calls”) but it does a pretty good job of looking like a job posting. Some items are fun to note, like that the job requires an education of Mountain View, CA.

Below is another longer one (and as a website). Turns out the model wants to provide jobs that pay $1.50 an hour. The person needs to be a team player.

Conclusion

This was a pretty fun experiment! We could keep going with this as well. There are several knobs to turn to get different performance and to see what kind of results this model could produce. I encourage you to grab an interesting dataset and see what kind of fun things you can do with recurrent neural networks!

Other RNN links:

Find your Dream Job!

Robust Principal Component Analysis via ADMM in Python

Principal Component Analysis (PCA) is an effective tool for dimensionality reduction, transforming high dimensional data into a representation that has fewer dimensions (although these dimensions are not from the original set of dimensions). This new set of dimensions captures the variation within the high dimensional dataset. How do you find this space? Well, PCA is equivalent to determining the breakdown M = L + E, where L is a matrix that has a small number of linearly independent vectors (our dimensions) and E is a matrix of errors (corruption in the data). The matrix of errors, E, has been minimized. One assumption in this optimization problem, though, is that our corruption, E, is characterized by Gaussian noise [1].

Sparse but large errors affect the recovered low-dimensional space.
From:

Introduction to RPCA

Robust PCA [2] is a way to deal with this problem when the corruption may be arbitrarily large, but the errors are sparse. The idea is to find the breakdown M = L + S + E, where we now have S, a matrix of sparse errors, in addition to L and E. This has been shown to be an effective way for extracting the background of images (the static portion that has a lower dimensionality) from the foreground (the sparse errors). I’ve used this in my own work to extract foreground dolphins from a pool background (I used the ALM method which only provides L and S).

Robust PCA can be solved exactly as a convex optimization problem, but the computational constraints associated with high dimensional data make exact solutions impractical. Instead, there exist fast approximation algorithms. These techniques include the inexact augmented Lagrange multiplier (ALM) method [1] and the alternating direction method of multipliers (ADMM) method [3].

For the most part, these approximation algorithms exist in MATLAB. There have been a few translations of these algorithms into Python. Shriphani Palakodety (code) and Deep Ganguli (code) both translated the Robust Principal Component Pursuit algorithm, Nick Birnie (code) translated the ALM method, and Kyle Kastner (code) translated the Inexact ALM method. There also exists Python bindings for an RPCA implementation in a linear algebra and optimization library called Elemental.

Since the ADMM method is supposed to be faster than the ALM method (here on slide 47), I thought it would be a good idea to translate this code to a working Python version. My translation, Robust Principal Component Analysis via ADMM in Python, implements the MATLAB code found here (along with [a] and [b]). In addition to porting this code to Python, I altered the update equations to use multiple threads to speed up processing.

References

  1. Lin, Chen, Ma (2010). The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices.
  2. Candès, Li , Ma, and Wright (2011) Robust Principal Component Analysis?
  3. Boyd (2014) ADMM
Robust Principal Component Analysis via ADMM in Python

Cognitive Science to Data Science

I joined the UCSD Cognitive Science PhD program with the aim to investigate multi-agent systems. A few years in I joined a project to investigate the interactions of bottlenose dolphins. The research group had a massive amount of audio and video recordings that was too big to handle without computational techniques. I joined the group to provide the computational support that they needed. During this process, I discovered that working with big data is motivating in its own right and that I wanted to pursue the data scientist path in lieu of academia.

I don’t see this choice as abandoning the traditional academic path (PhDs and Postdocs who leave academia traditionally harbor feelings of regret and worry about being failures). The number of open positions available to PhDs and Postdocs is slim to compared to the number of applicants:

Only 0.45% of STEM graduates in the UK will become professors. [1]

“Across all scientific fields, NSF data suggest that only about 23 percent of Ph.D.s land tenure or tenure-track positions at academic institutions within three to five years of finishing grad school.” [2]

This isn’t the only reason though. While academia has historically tackled interesting and challenging data problems, some of the most important discoveries and insights can now only be tackled by companies that have invested billions of dollars to accumulate the necessary data. I can think of no better place to be than being on the forefront of these endeavors.

These past few years have been filled with insights. I’ve been using data to investigate the lives of a very social mammal species, the bottlenose dolphin. I’m now setting my sights on humans, to find insights that can help us understand our own social world and help guide our individual (and corporate) decisions.

Learning to be a Data Scientist

Early on, I applied to software engineering jobs and even reached the last stage of the interview process at Google (prep material here). After a few interviews with other companies I realized that being a software engineer was not the optimal choice for my background or in line with my interests. I connected with a friend of a friend who was a data science practitioner to learn more about the role and the process. He had a lot of great advice and I recommend reading the article. Since then I have been going through the interview process for several data science positions and have had some successes. Recently, however, I put my job search on hold so that I could attend the Insight Data Science program as a Fellow for the 2015 Fall Silicon Valley session. Adventure awaits!

During this transition phase, I’ve tried to amass a large amount of relevant material for graduate students (particularly my friends in the department) who might want to pursue non-academic routes. While there are a growing number of data science programs across the globe, it is possible to pick up the skills while pursuing other STEM degrees. Most of these items are on my Twitter feed, but here are the highlights:

In addition to the large datasets that you might encounter during your PhD (for example: Neuroscience Data by Ben Cipollini), there are plenty of free large datasets available to those interested:

Data Science Courses at UCSD

The Data Science Student Society at UCSD has put together a great list of courses available at the undergraduate level. Below is a listing of the graduate courses at UCSD that are relevant to Data Science (as of August 2015). I included courses focused on audio and video analysis as they also teach the skills data scientists need to tackle large amounts of noisy data.

As graduate students, you sometimes have the option of taking a free course through UCSD Extension. UCSD Extension offers a Data Mining Certificate from which you can take some of their courses. The Computer Science & Engineering Department and San Diego Super Computer Center now offer a Data Science & Engineering Masters degree.

If you feel I have missed a course, feel free to e-mail me and I’ll add it to the list.

UCSD Courses I took that were relevant to Data Science:

  • CSE 250A. Artificial Intelligence: Search & Reason
  • COGS 202. Foundations: Computational Modeling of Cognition
  • COGS 225. Visual Computing
  • COGS 260. Seminar on Special Topics (Sometimes on AI)
  • COGS 200. Cognitive Science Seminar (I took Cognition under Uncertainty)
  • COGS 220. Information Visualization
  • ECE 272A. Stochastic Processes in Dynamic Systems (Dynamical Systems Under Uncertainty)
  • MATH 285. Stochastic Processes
  • MATH 280A. Probability Theory
  • PSYC 232. Probabilistic Models of Cognition

Other UCSD courses I have not taken but are relevant to Data Science:

  • MATH 280BC. Probability Theory
  • MATH 281ABC. Mathematical Statistics
  • MATH 282AB. Applied Statistics
  • MATH 287A. Time Series Analysis
  • MATH 287B. Multivariate Analysis
  • MATH 287C. Advanced Time Series Analysis
  • MATH 289A. Topics in Probability and Statistics
  • MATH 289B. Further Topics in Probability and Statistics
  • MATH 289C. Data Analysis and Inference
  • PSYC 201ABC. Quantitative Methods
  • PSYC 206. Mathematical Modeling
  • PSYC 231. Data Analysis in Matlab
  • CSE 250B. Principles of Artificial Intelligence: Learning Algorithms
  • CSE 250C. Machine Learning Theory
  • CSE 252AB. Computer Vision
  • CSE 253. Neural Networks/Pattern Recognition
  • CSE 255. Data Mining and Predictive Analytics
  • CSE 256. Statistical Natural Learning Processing
  • CSE 258A. Cognitive Modeling
  • CSE 259. Seminar in Artificial Intelligence
  • CSE 259C. Topics/Seminar in Machine Learning
  • COGS 219. Programming for Behavioral Sciences
  • COGS 230. Topics in Human-Computer Interaction
  • COGS 243. Statistical Inference and Data Analysis
  • ECE 250. Random Processes
  • ECE 251AB. Digital Signal Processing
  • ECE 252B. Speech Recognition
  • ECE 253. Fundamentals of Digital Image Processing
  • ECE 271AB. Statistical Learning

References

  1. Taylor, M., Martin, B., & Wilsdon, J. (2010). The scientific century: securing our future prosperity. The Royal Society.
  2. Schillebeeckx, M., Maricque, B., & Lewis, C. (2013). The missing piece to changing the university culture. Nature biotechnology31(10), 938–941.
Cognitive Science to Data Science