Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter

Switching to Data Science

Many are aware that data scientist, which was called the sexiest job of this century by Harvard Business Review, is among the best jobs of this decade. We also read from glassdoor that data scientist is number one best job in America in 2016. Because of this reason there are lots of people who want to become data scientist. Many inbox me asking how they can switch career to data science and how they can be good data scientists. Hoping that I could help some data science aspirants, I decided to write this article mainly based on my experience.


Source

In a nutshell data science needs good programing skills, statistics and probability theory, math and domain knowledge. Many define data scientist as someone who is better at statistics than any software engineer and better at software engineering than any statistician.

I am an atmospheric physicist who turned data scientist and I can share my experience on the transition. Since many people have different background, the effort required to switch to data science varies greatly based on their background. It is relatively easier for mathematics/statistics, computer science, physics, engineering and economics graduates. However, anyone with passion and stamina from any background ranging from biology or philosophy to nuclear physics can become successful data scientist.

There are three tools that are commonly used in data science: Python, R and SQL. There is always unresolved debate between R fans and Python fans on which is better data science tool. I personally believe data scientists have to learn both tools. However, for someone who is new to the field learning either Python or R and SQL is enough and once they join industry, they can learn the other tool. If you have background with either tool, you can build on that. Both are easy to learn but Python is easier than R. R and Python are good data science tools. Both have their strengths. One is not better. One is not worse. R is excellent for statistics/predictive modeling applications and for visualizations while Python is better for data munging and for development. Further, you can use R within Python and Python within R. So, knowing both tools and leveraging the unique strengths of each tool is very beneficial. Moreover, some industries ask for R while some look for Python users. Furthermore, knowing both tools helps to collaborate with data scientists who use either R or Python. If you have time and enthusiasm, you can learn both of them in parallel.

What about SQL? SQL is used to retrieve data from databases and to manipulate it. It is commonly used and it is easy to learn. So, data science aspirants should learn it. Further, SQL can help you to learn R and Python because there are packages that enable us to use SQL in Python or R. Since SQL is relatively easy to learn, if you master it, you can perform lots of data manipulation tasks in Python or R using your SQL skills.

Tools alone cannot make you a data scientist. Have a solid understanding of statistics and probability, linear algebra and calculus. If you have taken these courses in college but you are rusty in some areas, you can use online data science courses (books) to review them and focus on the areas that are new to you.

I have listed online courses below that can help to switch to data science. You can focus on the areas that you are weak in. I have put courses with both Python and R. If you want to learn either tool, lean the courses that are related to that tool. It seems more data science courses on edx and coursera are with R. But, there are still courses with Python as well and there are so great books that can help you to study python if you prefer to start with Python. Here I am listing courses from edx, coursera and udacity but there are also other platforms that provide online courses.

Novice

Learn basic statistics and probability, math, linear algebra and programming

Beginner

Use the skills learned in the novice stage and take them to the next stage

Getting and cleaning data from Johns Hopkins University

Data analysis with R from Udacity

Exploratory data analysis from Johns Hopkins University

Intro to data science from Udacity

Intro to Machine learning from Udacity

Data visualization and communication with Tableau from Duke University

Multivariate Calculus from MIT

Managing Big data with MySQL from Duke University -this course is great to learn SQL. There are nice labs and quizzes where you will get the chance to practice various SQL commands.

Intermediate

learn machine learning and developing data products

Regression models from Johns Hopkins University

Machine Learning from Stanford - this course is in Matlab/Octave but the course is among the best online machine learning courses. So, if you take it, you will benefit a lot.

The Analytics Edge from MIT- among the best online courses to learn R and analytics. It is nicely organized and covers both supervised and unsupervised machine learning techniques with real world data. It has also a nice chapter on data visualization.

Developing Data products from Johns Hopkins University

If you have gone up to this point, congratulations! You are already a data scientist. However, keep in mind that data science is about constant learning and developing oneself. So, next learn big data tools and techniques. But what is big data and how are the tools used for processing, analyzing and visualizing big data different from R or Python?

The volume, variety and complexity of data generated every day is increasing tremendously. More than 95% of the world data was collected in the last couple of years. Billions of devices collect data of different kind, size and shape. Massive data is collected from social media (e.g., Facebook, Twitter), weather (e.g., rainfall radar), machines and sensors, online shopping websites and so on. To store and analyze the massive amount of data collected daily, big data infrastructures and platforms, which have fundamentally different infrastructure and fundamentally different platform, were created.

Hadoop is an open source distributed software framework which is scalable, cost-efficient and fault-tolerant cluster. In Hadoop 10’s, 100’s or 1000’s of machines are used to store, analyze, process and visualize big data by many organizations in healthcare, financial world, manufacturing and telecom, to name some. Hadoop consists of many tools and my recommendation tool for big data analytics is Spark. Spark is a fast and general engine for large-scale data processing. It is lingua franca for big data processing, analytics in both batch and streaming mode and also for machine learning. Spark supports multiple languages including Python, R and Scala. So once you study R or Python, studying Spark is not a challenge.

Advanced: learn big data tools and techniques

Data science and Engineering with Spark which includes the following courses:

• Introduction to Apache Spark

• Distributed Machine Learning with Apache Spark

• Big Data Analysis with Apache Spark

• Advanced Apache Spark for Data Science and Data Engineering

• Advanced Distributed Machine Learning with Apache Spark

Optional: Spark is built on Scala and most features are first availabe on Scala. It is also known that Scala is faster than Python. Further, using Scala can help to debug any problems by going to source code since Spark is written in Scala. Therefore, learn Scala and big data analysis with Scala and Spark. This specialization from École Polytechnique Fédérale de Lausanne could be useful. I am currently enrolled in this specialization.

Summary

There are different ways that can help to switch to data science. Some can do it by taking online courses, others can learn from books and blogs. Yet others can switch to data science in industry by working in projects that involve data analytics. By the way, whether you are learning from online courses or books, do not forget to open GitHub account and put your side projects on GitHub. Moreover, follow data sciencntists on Linkedin, GitHub and Twitter. Last but not least, have some data science blogs to usually read and learn from and do not forget that you can easily build a personal website and host it on GitHub (see this). Final and most important point, do as many side-projects as you can and work with different kinds and sizes of data. You can get public data sets to work with from this or other websites.

comments powered by Disqus