My First Exploratory Data Analysis (EDA) on Coursera Dataset
Introduction
As someone who loves to polish her skills and knowledge through modern methods, I have been a fond user of platforms like Coursera and Duolingo for some time. For my very first EDA project, I got hold of this Coursera dataset someone scrapped from online and decided to perform some data analysis on it to see if I can make this data tell stories for everyone.
What is Coursera?
Coursera is an American massive open online course provider founded in 2012 by Stanford University’s computer science professors Andrew Ng and Daphne Koller. Coursera works with universities and other organizations to offer online courses, certifications, and degrees in a variety of subjects.
Data Preparation & Understanding the Data
Let’s first import all the libraries needed, as without them we essentially cannot do anything.
Next, time to check the quality of our data and do some processing. There are no duplicate rows, no null value, and we have gotten familiar with the data after checking shape, info, etc.
Improving Data Usability
One major issue with the original dataset is that number of students enrolled is documented with ‘k’ and ‘m’ representing thousands and millions. Machines are smart, but they are not smart enough to interpret this without people telling them. So I converted these strings into usable numbers. As you can see in the table, a new column called new_course_students_enrolled of usable floats is added.
Data Visualization
- What are the most popular courses on Coursera? Understanding this can let us see a trend in subjects in online studying.
Wow, based on this bar chart, it seems like topics like Machine Learning, Python, and Programming are amongst the most popular choices! No wonder the heat with data science these days!
2. What are the organizations with the most online courses on Coursera?
The bar chart shows us that universities and tech companies are the major providers of courses. It also tells us that several prestigious universities such as UPenn, which has the most courses, have abundant resources on Coursera. Maybe we do not necessarily need to go to those schools to access part of their education!
3. What are the organizations with the most students on Coursera?
I used Tableau:
As you can see, the results don’t differ much from the previous one. Organizations with more courses have more students enrolled.
4. What’s the distribution of courses by certificate type? There are three different certificates on Coursera, which are Specialization, Course, and Professional Certificate.
This piechart shows us that the majority of courses offered are specialization, which means they are the easiest to complete. As the depth and length of the course go up, there are fewer choices. Great news for online-learning beginners who want to start with easier courses!
5. What’s the distribution of courses by difficulty level? There are four levels of difficulty in this dataset, Beginner, Intermediate, Mixed and Advanced.
6. What’s the distribution of course difficulty in each category of course certificates?
Analysis for Factors Affecting Ratings (visualization & regression)
7. Is there a correlation between ratings and the number of students enrolled in that class?
It seems like there is no significant correlation between course ratings and the number of enrolled students. However, it’s worth noting that popular courses indicated by the size of the bubbles in this scatterplot all have pretty good ratings.
8. Is there a correlation between ratings and difficulty level of the course?
To explore this question, I used Tableau:
With a highlight table, we can see that comparatively, Advanced courses have a lower rating compared to other levels. Reasons might be that people have higher expectations for Advanced courses, or people find them too difficult to have an enjoyable learning experience.
9. Is there a correlation between ratings and the certificate type of the course?
We can see that there is no obvious correlation between ratings and the certificate type of the course.
I also used Pandas Python to do a regression analysis to test the above two correlations, and each time I got very low R-Squared values (0.067 and 0.021, respectively). Therefore, it is safe to say that the difficulty level or certificate type of the course, alone, cannot explain the variations in ratings from this dataset.
Conclusion
It was a fun journey doing data analysis on this dataset. I learned many aspects of courses on Coursera through my analysis, and potentially found a correlation between course difficulty level and course ratings. Overall, I would say the ratings for Coursera courses are pretty homogenized, not giving too much freedom for me to do a prediction model.