Saturday 29 July 2017

Big Data Specialization E-Learning


The 3 areas that support the Internet of Things (IoT) are Big Data Analytics, Artificial Intelligence (AI) and Cyber Security. Being enthusiastic in IoT, I began taking the first step to acquire some foundation knowledge for these 3 areas. Having prior knowledge and experience in data extraction and manipulation using relational databases (e.g. Oracle, MS SQL, Postgres etc), Big Data Analytics naturally became the first item on my learning roadmap.

As IoT involves linking multiple devices together via network in order to communicate with one another, a huge volume of data would be generated in the process. These data must be analyzed in order to make further decision to achieve the intended purpose of having little or no human intervention. Another objective of Big Data Analytics is to discover hidden patterns or trends from huge volume of raw data for the purpose of making decisions to improve business processes or generate more revenue.

Due to limited time, the best way for me to acquire some foundation knowledge for these 3 areas is through Massive Open Online Courses (MOOC). The multiple advantages which MOOC provides includes having flexible schedule (study at your own convenience and pace), inexpensive (most courses are below USD 100) and having a pay-as-you-learn model which allows learners to pay to attend courses and stop whenever they wish to without incurring large sum of money. The best part is that everything which the learner has accomplished will be saved and he/she can choose to pay to resume anytime later from where he/she has stopped. These are major plus points for busy working professionals with full time job and family commitments. With MOOC, we can enroll for and study a course to gauge our interest and suitability before committing huge amount of time and money to pursue more in-depth courses for the same area. The major disadvantage of MOOC is the lack of recognition by employers and educational institution due to the difficulties in ensuring that the learners submit true copies of their own work and attempt the online assessments themselves without any assistance. Nevertheless, having MOOC certificates would show possessing initiative to learn which is something that employers look out for. If you expect to be qualified and eventually be employed for a particular profession after completing a course, you should enroll for a non-MOOC course.

Since I had already completed an online course through Coursera (the one mandated by my institution for all staff to attend), I made use of the same MOOC provider to search for and pursue a data analytics course. After 3 months of intensive studying online, I have completed the Big Data Specialization offered by Coursera. This is the first specialization from this course provider, Coursera which I had completed.

Here are some key points which I wish to share about this specialization.

About this Specialization

  • Objective is to gain an understanding of what insights big data can provide through hands-on experience with the tools and systems used by big data scientists and engineers.
  • Consists of 6 courses created by the University of California San Diego and offered through Coursera.
  • No prior programming or big data experiences required but advantageous to understand SQL and how to work with relational database management systems.
  • Subscription to this specialization cost approximately USD50 per month. Learners are expected to spend an average of 7 months to complete all 6 courses.
  • Mode of learning includes video lectures, quizzes (theory & hands-on), peer-graded assignments and discussion forums.
  • As this specialization requires the Cloudera Virtual Machine (VM) and some open-source tools to be installed, there are hardware and OS requirements to be met


List of Courses
 

  1. Introduction to Big Data 
  2. Big Data Modeling and Management Systems 
  3. Big Data Integration and Processing 
  4. Machine Learning With Big Data 
  5. Graph Analytics for Big Data 
  6. Big Data - Capstone Project


About the Capstone Project


The Capstone project is about analysing the data set for a game and to make recommendations to improve the game or generate more revenue from the game.

The name of the game is “Catch the Pink Flamingo” and its key details are as follows.
  • Online game created by Eglence Inc. (an imaginary company).
  • Multi-user and multi-level game where players can choose to join or form a team.
  • Objective​ of the game is to catch as many Pink Flamingos as possible. These Pink Flamingos randomly pop up on a gridded world map based on missions that change in real­ time. The levels get more complicated in mission speed and map complexity as the users or team move from level to level.
  • Provides chat boards for the teams to keep in touch.
  • Users are allowed to purchase items to be used in the game. This is a major source of revenue for the company.
  • Another form of revenue is advertisements shown in the game. Users’ clicks on advertisements are recorded.


Tools used in the Capstone Project

  1. Splunk - Tool for analyzing machine-generated big data.
  2. KNIME - Open source data analytics, reporting and integration platform.
  3. Apache Spark - Open-source distributed computing framework.
  4. Neo4J - Graph database management system.


Processes used in the Capstone Project


Part 1 (Aggregation & Filtering using Splunk)
  1. Review the data sets (in CSV format) and the Entity Relationship Diagram provided.
  2. Perform aggregation on the items purchased and revenue generated.
  3. Perform filtering on the total amount of money spent by the top ten users (ranked by how much money they spent).
Part 2 (Classification using KNIME)
  1. Perform classification to predict which user is likely to purchase big-ticket items (i.e. cost more than $5).
  2. Generate the decision tree and confusion matrix. The decision tree shows the predicted number of users based on categories and the confusion matrix shows the number of correctly and incorrectly predictions.
  3. Conclude the analysis and make recommendations.
     

Part 3 (Clustering using Spark)
  1. Select attributes from the CSV files provided and aggregate them.
  2. Perform clustering. This may be a repetitive task as the attributes selected may not reveal any significant differences between the clusters. The results also varies according to the number of clusters generated.
  3. Recommend actions to help improve the company’s business.

Part 4 (Analyzing graphs using Neo4J)
  1. Load all the CSV files containing chat data for the game to create the graph database.
  2. Query the graph database created to find useful information.

My Opinions on this Specialization


Pros
  • Suitable for beginners who have no experience in data analytics.
  • Flexible schedule that allows learners to learn at their own pace and availability.
  • Subscription fees are inexpensive which allows learners to have an idea of Big Data Analytics in order to decide if they are interested to commit more time and money to pursue more in-depth courses in this area.

Cons
  • Without proper assessments, the certifications obtained may not recognized by employers or academic institutions.
  • Need to install VM and tools for hands-on. There are hardware and OS requirements to be met. The VM and tools consume huge amount of memory and disk space which slows down the computer.
  • When facing difficulties in tools installation and/or course syllabus are encountered, only source of help is the discussion forums. After submitting a post in the forum, the learner can only wait for the instructors or fellow learners' reply.
  • Certain chapters contain complex mathematical formulas which are intimidating to some learners.
  • Course materials not up to standard (i.e. Theory quizzes too easy. Instructions not clear and contains mistakes).

A week after the completion of this specialization, I received the following email from Coursera Community inviting me to be part of their Beta Tester team. This is indeed a great opportunity for me to preview new courses in areas which I am interested in before their launch. I certainly hope to give constructive feedback to the course instructors to ensure high standards in the learning materials.


1 comment:

  1. You have knack for explaining/giving instructions that are clear, precise & easy to comprehend. I find that the curriculum towards a better understanding of a qualification towards a better means of supply and demand via a MOOC or non-MOOC course is a very exciting step in someone's career path. This is a great concept to learn at own pace & pay as you go & it not being age limited is a plus. I enjoyed reading your blog, thank you for sharing, & I wish you wealth in knowledge & fortune in the coming years :)

    ReplyDelete