Big Data and Data Science

Any discussion about Big Data will not be complete without discussing about Data Science and its relation with Big Data.

Data Science can be considered as the extraction of knowledge from large volumes of data that are structured (e.g. RDBMS, Excel) or unstructured (e.g. emails, videos, photos, social media, and other user-generated content). Data Science may be considered as a continuation of the field of data mining and predictive analytics.

Data that scale to Big Data are of particular interest in data science, although the discipline is not generally considered to be restricted to such data. Data science actually employs techniques and theories drawn from many fields such as nanotechnologies, physics, robotics, mathematics, statistics, information theory and information technology.

Data Scientists are qualified people with strength and patience to tunnel through lots of information and the technical skills in writing algorithms to extract insights from these mountains of information. Data scientists apply expertise in data preparation, statistics, and machine learning to investigate complex problems in many various domains, such as marketing optimization, fraud detection, setting public policy, etc.

While some see no distinction between data science and statistics, some consider it is a distinct field with specific skill sets, training techniques and goals. For the purpose of this note, we will assume that Data Science is more than just statistics.

 

Three Facets of Data Science

Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson, describe about three facets of Data science, which are coding, statistics and domain knowledge. It also says about the Data Science Venn Diagram.

  • Statistics is the mathematical knowledge or training (e.g. probability) and helps in generating the right results.

  • Domain knowledge is the knowledge about the domain in which the research is done (e.g. Marketing) and is very important for a proper research. According to many researchers like Svetlana Sicular of Gartner, it is easier to turn domain people into Hadoop than making Hadoop people gain the domain knowledge.

  • A fair amount of coding knowledge (even a little bit), can be handy in many areas such as creating exploration and manipulation of data sets, transformations of data from various sources into common formats before processing etc. Having knowledge in coding also helps in Algorithmic thinking to get through a problem.

 

Another version of the Venn diagram I could find, describe the three facets as:

  • Math and statistics knowledge (statistics)

  • Substantive expertise (domain knowledge)

  • Hacking skills (coding).

You can read more from the reference links.

 

Combination of different facets of Data Science

According to the Venn Diagram, different combination of skills has some significance:

  • Combination of Statistics and Domain knowledge is often what traditional researchers possess.

  • Statistics and coding together can result in machine learning based researches and applications. A n email spam filter is an example.

  • Combination of Domain knowledge and Coding, without statistics, is considered as a danger zone, as you are very unlikely to derive successful conclusions without statistics.

  • Finally, a combination of Statistics, Domain knowledge and Coding, is what can be called as Data Science.

 

Types and Skills of Data Scientists

Almost everyone talk about "data science," "big data," and "analytics." However, there is a lack of clarity around the skill sets and capabilities of their practitioners. This lack of clarity has frequently led to missed opportunities.

To address this issue, the authors of the book “Analyzing the Analyzers” surveyed several hundred practitioners via the Web to explore the varieties of skills, experiences, and viewpoints in the emerging data science community, and has documented in the book. Here is a quick summary of it:

Data scientists were classified into four categories, with subtypes:

  1. Data Developer

    • Developer, Engineer

  2. Data Researcher

    • Researcher, Scientist, Statistician

  3. Data Creative

    • Jack of all trades, Artist, Hacker

  4. Data Businessperson

    • Leader, Businessperson, Entrepreneur

 

The book also classified the skill sets into 5 categories, with sub skills:

  1. Business

    • Product Development, Business

  2. ML/Big Data

    • Unstructured data, Structured Data, Machine Learning, Big and Distributed Data

  3. Math/OR

    • Optimization, Math, Graphical Models, Bayesian/Monte Carlo statistics, Algorithms, Simulation

  4. Programming

    • System Administration, Back End Programming, Front End Programming.

  5. Statistics

    • Visualization, Temporal Statistics, Surveys and Marketing, Spatial Statistics, Science, Data Manipulation, Classical Statictics.

The book finally finds out what all skill categories and their percentage are available for each data scientist category.

Each data scientist type category were having some knowledge from all skillset categories, but the distribution percentage of skillset category per data scientist category varied from one data scientist category to another. For instance, Data Businessperson had a high percentage of Business skill set and Data Developer had high percentages of ML/Big Data and Math/OR skills.

You can find the distribution as per the research in “Chapter 3: A Survey of, and About, Professionals”, under the heading “Combining Skills and Self-ID”.

 

Data Science without Big Data

According to Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson, the three facets of Data science (Coding, Statistics and Domain Knowledge) apply even when the three Vs of Big Data (Volume, Velocity and Variety) are not present at the same time:

  • Only Volume – When lot of static data is there.

  • Only Velocity – When streaming data comes in and only a small window is analyzed at a time.

  • Only Variety – Static but complex data like face recognition, data visualization etc.

 

Big Data without Data Science

Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson say about below valid cases:

  1. Big Data with only Coding and Statistics

    • This is where Machine Learning fits

    • E.g. spam filter, facial recognition

  2. Big Data with Coding and Domain Knowledge

    • E.g. Word Count, Natural Language Processing.

However,

  1. Big Data with only Statistics and Domain Knowledge (without no knowledge at all of coding) is not possible.

  2. Big Data is also not possible with only one of Coding, Statistics and Domain Knowledge.

 

References and Notes

  1. This is not a blog post, but a JavaJEE note post (Read more @ http://javajee.com/javajee-note-posts).

  2. https://en.wikipedia.org/wiki/Data_science

  3. https://www.facebook.com/dan.ariely/posts/904383595868

  4. https://en.wikipedia.org/wiki/Machine_learning

  5. Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson

  6. http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

  7. Analyzing the Analyzers -An Introspective Survey of Data Scientists and Their Work by   Harlan Harris, Sean Murphy, Marck Vaisman.

Big Data Learning Plans

Want to know about our Big Data course plan? 

Visit http://javajee.com/bigdata-and-hadoop-course-plan.

Want to join our Volunteer Learning Program for Big Data and Hadoop?

Fill the following form selecting course as Big Data and Hadoop: http://javajee.com/content/volunteer-learning-program.

Quick Notes Finder Tags

Activities (1) advanced java (1) agile (3) App Servers (6) archived notes (2) Arrays (1) Best Practices (12) Best Practices (Design) (3) Best Practices (Java) (7) Best Practices (Java EE) (1) BigData (3) Chars & Encodings (6) coding problems (2) Collections (15) contests (3) Core Java (All) (55) course plan (2) Database (12) Design patterns (8) dev tools (3) downloads (2) eclipse (9) Essentials (1) examples (14) Exception (1) Exceptions (4) Exercise (1) exercises (6) Getting Started (18) Groovy (2) hadoop (4) hibernate (77) hibernate interview questions (6) History (1) Hot book (5) http monitoring (2) Inheritance (4) intellij (1) java 8 notes (4) Java 9 (1) Java Concepts (7) Java Core (9) java ee exercises (1) java ee interview questions (2) Java Elements (16) Java Environment (1) Java Features (4) java interview points (4) java interview questions (4) javajee initiatives (1) javajee thoughts (3) Java Performance (6) Java Programmer 1 (11) Java Programmer 2 (7) Javascript Frameworks (1) Java SE Professional (1) JPA 1 - Module (6) JPA 1 - Modules (1) JSP (1) Legacy Java (1) linked list (3) maven (1) Multithreading (16) NFR (1) No SQL (1) Object Oriented (9) OCPJP (4) OCPWCD (1) OOAD (3) Operators (4) Overloading (2) Overriding (2) Overviews (1) policies (1) programming (1) Quartz Scheduler (1) Quizzes (17) RabbitMQ (1) references (2) restful web service (3) Searching (1) security (10) Servlets (8) Servlets and JSP (31) Site Usage Guidelines (1) Sorting (1) source code management (1) spring (4) spring boot (3) Spring Examples (1) Spring Features (1) spring jpa (1) Stack (1) Streams & IO (3) Strings (11) SW Developer Tools (2) testing (1) troubleshooting (1) user interface (1) vxml (8) web services (1) Web Technologies (1) Web Technology Books (1) youtube (1)