Data in Big Data - Generation, Structure, Storing and Challenges

According to the concept of the 3Vs, BigData is data that may be very big (Volume) that may come in very fast for processing like a continuous streaming data (Velocity) and may be very diverse like structured, unstructured, NoSQL database data etc (Variety). Data is the most important part in BigData; if there is no data, then there is no BigData. So we will discuss about how data is generated, data types, where the data is stored and also various challenges with managing and processing.


Data Generation

The data in Big Data may be generated by humans or by machines.

Human Generated Data

Humans may generate data either knowingly or unknowingly.

  • Intentional Data

    • Intentional data generated by data include photos, videos and text, shares and likes in social media etc.

  • Metadata

    • Metadata is data about data, and often accompanies the data contents without being noticed by the end user. This data is usually machine-readable as they usually follow some protocol.

    • Examples of metadata include:

      • Photograph Exif metadata that will contain additional info like the location, time etc. when the image was taken.

      • Cellphone metadata will contain the location and time details of the call.

      • Email metadata will contain many additional data like to, cc, from etc.

      • A Twitter tweet contain lot of metadata, even much bigger than the tweet content.


Machine-Generated Data

A lot of data is generated by machines or devices along with other processing.

  • Machine generated data usually follow some protocol and hence it can be easily read and analyzed than human generated data.

  • Example sources for machine generated data are:

    • Cell phones connecting to towers exchange data

    • Reading from medical devices

    • Web crawlers and spam bots.

  • There are many uses for these automatically generated data:

    • Monitoring production lines

    • Identifying and monitoring pets

    • Infrastructure management

    • Energy management


Structure of the Data

Data in BigData may be structured or unstructured, compared to general relational data which will be mostly only structured.

Structured Data

  • Data is said to be structured if it is placed in files with fixed fields or variables.

  • Examples include Relational Database Management Systems (RDBMS) and Excel Spreadsheets, which arrange data into tabular form with fixed rows or columns.


Unstructured Data

  • Data is said to be completely unstructured if they are not mapped to any fields or variables.

  • Examples for unstructured data include text, presentations, images etc.

  • Majority of the business data are unstructured.


Semi-structured data

  • Semi structured data is data that is not fully structured like an RDBMS or not completely unstructured. Semi structured data may have fields and variables, that can be marked, making the data identifiable; but these fields and variables may not be fixed as in the case of an RDBMS.

  • These data can still be arranged hierarchical and nested without a fixed structure of fields and variables.

  • Examples for semi structured data include XML and JSON data, without following any predefined structure.


NoSQL databases

  • NoSQL databases follow a semi structured format, and usually uses formats such as JSON to store data. Data will be stored in key value pairs, but different rows may contain different keys.

    • For instance, first row for a person may contain name, salary etc. and second line may contain name, city etc.

  • MongoDB is a popular NoSQL database that uses JSON format for saving data.

  • HBase is a NoSQL database popularly used with Hadoop.

  • NoSQL databases are not yet widely used as is in the case of RDBMS today and does not have a standard query language like the SQL for RDBMSs. NoSQL databases and RDBMS actually are suited for different scenarios.

  • If the data is not relational, then NoSQL is definitely the way to go.


Storing Data

Since volume of data is too big in Big Data, data may be stored across disks.

  • Traditional systems for storing distributed data include SAN and NAS.

    • A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage. SAN may consist of large and expensive collection of disk drives on Racks.

    • Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients.

    • NAS provides both storage and a file system. This is often contrasted with SAN (Storage Area Network), which provides only block-based storage and leaves file system concerns on the "client" side.

    • Both SAN and NAS may be expensive solutions and may also require expertise to maintain them.

  • HDFS is Hadoop Distributed File System and can store data across many systems in reliable and scalable way.

  • Hadoop is more affordable as it is designed to run on regular commodity hardware, and also requires less expertise than required for managing SAN or NAS.


Cloud Storage

  1. Storing data on cloud has many advantages, including:

    1. Scalability (and even cost) – Can increase storage as needed, paying for only what you use.

    2. Redundancy – Data may be stored redundant to avoid data loss and high availability.

    3. Speed – Data may be usually stored in high speed SSD drives.

  2. Amazon S3 is an example.

  3. Hadoop can be easily installed in these cloud servers, thus benefitting from cloud computing along with the benefits of HDFS.


Cloud DaaS

Apart from the standard cloud services such as Iaas (Infrastructure as a Service), Paas (Platform as a Service) and Saas (Software as a Service), we can also have Daas (Data as a Service).

  • DaaS provides faster and reliable access to data.

    • You can buy access to data for research or other data analytics.



Challenges with Anonymity

  1. Privacy is an important concern when processing data from various sources.

    1. Known or unknown violations of privacy can lead to serious issues.

  2. One possible solution is to anonymize the data set.

    1. However the dataset might be de-anonymized.

  3. Another practice followed to protect public records is to remove HIPAA protected info.


Challenges with Confidentiality

  1. While anonymity concerns refer to identifying the person from public records, confidentiality deals with not making non-public info available to public.

  2. Protecting confidentiality help in maintaining trust.

  3. Some of the possible solutions or precautions are:

    1. Protect and access data

    2. May purchase data breach insurance

    3. Store only required data

Challenges with Data Quality

  1. Input data can have errors or problems and bad input may in turn lead to bad results.

  2. Possible errors or problems in input data are:

    1. Incomplete or corrupted data records.

    2. May lead null pointer error, divide by zero error etc.

    3. Duplicate records

    4. Typographical errors

    5. Data that is missing context or missing measurement information like unit of measurement.

    6. Incomplete or incorrect transformations of data.

  3. Though these problems are applicable to small data as well, it is more of concern with Big Data and Hadoop as:

    1. Normal method for checking accuracy of data may not be applicable for Big Data.

    2. If data stays in Hadoop, it may not be converted; and if data is not converted, it may not be examined.


Special Challenged with Big Data

  1. Responsibilities are spread out.

  2. Different people or groups prepare, analyze and apply the data.

  3. There is a greater need to think about quality.

  4. There is a greater need to think about the meaning of the project.


References and Notes

  1. This is not a blog post, but a JavaJEE note post (Read more @

  2.’s Techniques and Concepts of Big Data with Barton Poulson.




Big Data Learning Plans

Want to know about our Big Data course plan? 


Want to join our Volunteer Learning Program for Big Data and Hadoop?

Fill the following form selecting course as Big Data and Hadoop:

Quick Notes Finder Tags

Activities (1) advanced java (1) agile (3) App Servers (6) archived notes (2) Arrays (1) Best Practices (12) Best Practices (Design) (3) Best Practices (Java) (7) Best Practices (Java EE) (1) BigData (3) Chars & Encodings (6) coding problems (2) Collections (15) contests (3) Core Java (All) (55) course plan (2) Database (12) Design patterns (8) dev tools (3) downloads (2) eclipse (9) Essentials (1) examples (14) Exception (1) Exceptions (4) Exercise (1) exercises (6) Getting Started (18) Groovy (2) hadoop (4) hibernate (77) hibernate interview questions (6) History (1) Hot book (5) http monitoring (2) Inheritance (4) intellij (1) java 8 notes (4) Java 9 (1) Java Concepts (7) Java Core (9) java ee exercises (1) java ee interview questions (2) Java Elements (16) Java Environment (1) Java Features (4) java interview points (4) java interview questions (4) javajee initiatives (1) javajee thoughts (3) Java Performance (6) Java Programmer 1 (11) Java Programmer 2 (7) Javascript Frameworks (1) Java SE Professional (1) JPA 1 - Module (6) JPA 1 - Modules (1) JSP (1) Legacy Java (1) linked list (3) maven (1) Multithreading (16) NFR (1) No SQL (1) Object Oriented (9) OCPJP (4) OCPWCD (1) OOAD (3) Operators (4) Overloading (2) Overriding (2) Overviews (1) policies (1) programming (1) Quartz Scheduler (1) Quizzes (17) RabbitMQ (1) references (2) restful web service (3) Searching (1) security (10) Servlets (8) Servlets and JSP (31) Site Usage Guidelines (1) Sorting (1) source code management (1) spring (4) spring boot (3) Spring Examples (1) Spring Features (1) spring jpa (1) Stack (1) Streams & IO (3) Strings (11) SW Developer Tools (2) testing (1) troubleshooting (1) user interface (1) vxml (8) web services (1) Web Technologies (1) Web Technology Books (1) youtube (1)