What Are the Types of Big Data?

Stephan Miller Headshot
By Stephan Miller - Guest Contributor

Published
8 min read
Header image for the blog article, "What Are the Types of Big Data?"

Understanding the types of big data can better prepare you to handle large data sets

Many of the advances in AI, machine learning, and business analytics are possible because of big data. Data powers the algorithms that make cars self-driving, suggests the next movie we should watch, and tells business leaders how to increase revenue.

But not all data is created the same. 

To effectively classify, organize, and analyze the data generated by a business and its customers, a business analyst or data scientist needs to know what type of data they're working with. 

/ What is big data?

Big data refers to high-volume, high-speed, or high-variety information that needs sophisticated processing and analysis. The data alone isn't helpful—it's analysis of it that is key to improving business processes. Businesses use several techniques to analyze big data, such as data mining, which highlights patterns in the data. As an example, companies can mine data to learn what sales offers will appeal to particular consumers. When companies handle big data correctly, it facilitates better decisions and helps them deliver better customer service and better products.

Let's dive into the characteristics and main types of big data.

Big data characteristics: The 5 Vs

While big data is a general term that applies to many types of data, there are five characteristics typically used to define big data (also known as the 5 Vs or the features of big data).

1. Volume

This characteristic is in the name: Big data is big. The definition of big is relative, and changes depending on the technology available at the time. For example, a three gigabyte hard drive was once considered huge, while now a laptop with a terabyte of storage is standard.

2. Velocity

Big data is generated quickly. Sensors on IoT devices send messages multiple times per second. Website analytics monitor every mouse movement visitors make to gain insights into their browsing habits. Applications that use this data often need to process it as close to real time as possible.

3. Variety

Variety is the main topic of this article (so keep reading for more!). There is significant variety in big data; every organization that collects data does so from multiple sources and in multiple formats. To turn this data into useful information, data from diverse sources has to be combined.

4. Veracity

Veracity is a characteristic that defines data quality. Not all collected data is complete; it may be inaccurate or contain corrupted data points. Messy big data can do more harm than good; data may need to be cleaned or discarded to provide accurate insights.

5. Value

A business simply having a lot of data doesn't mean that all of its data is useful. Another defining characteristic of big data is that it will provide value in the form of insights.

3 main types of big data

While we could create an endless number of categories for the different types of big data, it's much simpler to sort big data into three main types: structured, unstructured, and semi-structured.

1. Structured data

Structured big data is data stored in a fixed schema. Most commonly, this means it is stored in a relational database management system or RDBMS. This data is stored in tables where each record has a fixed set of properties, and each property has a fixed data type.

One example is user records in a database:.

ID

Email

Name

City

State

ZIP code

1

[email protected]

Bob

Kansas City

MO

64030

2

[email protected]

Sara

Chicago

IL

60007

3

[email protected]

Sam

New York

NY

10001

4

[email protected]

Rick

Los Angeles

CA

90001

Every record in this table has the same structure, and each property has a specific type. For example, the State column is limited to two uppercase letters, and the ID and ZIP code columns are limited to integers. If you attempt to insert a record in the database that does not fit this structure, it will not allow it, and an error will be shown.

Structured big data is typically relational. This means that a record such as the user table above can be linked to a record or records in another table. Let's say the user table is for a shopping cart, and each user has orders.

ID

User_ID

Item

Total

1

1

Cup

2.00

2

2

Bowl

4.00

3

2

Plate

3.00

4

4

Spoon

1.00

The User_ID property of the order table above links orders to the IDs in the user table. We can see that Sara has two orders, and Sam hasn't ordered yet.

This type of static structure makes the data consistent and easy to enter, query, and organize. The language used to query database tables like these is SQL (Structured Query Language). Using SQL, developers can write queries that join the records in database tables in endless combinations based on their relationships.

The disadvantage of structured data is that updating the structure of a table can be a complex process. A lot of thought must be put into table structures before you even begin using the database. This type of big data is not as flexible as semi-structured data.

2. Unstructured data

According to some estimates, 80-90% of data is unstructured.[1] But just what is unstructured big data? Any data that doesn't fit into the other two categories here counts as unstructured.

Everything that is stored digitally is data. Unstructured data includes text, email, video, audio, server logs, webpages, and on and on. Unlike structured and semi-structured data that can be queried and searched in a consistent manner, unstructured data doesn't follow a consistent data model.

This means that instead of simply using queries to turn this data into useful information, a more complex process must be used, depending on the data source. This is where machine learning, artificial intelligence, natural language processing, and optical character recognition (OCR) can be useful.

One example of unstructured data is scanned receipts that are stored for expense reports. In their native image format, the data is essentially useless. Here, OCR software can turn the images into structured data that can then be inserted into a database.

The disadvantage of unstructured big data is that it is hard to process, and each data source needs a custom processor. Advantages include the mere existence of many types of unstructured data, as the insights gathered from it often can't be found in any other data source.

3. Semi-structured data

Semi-structured big data fits somewhere between structured and unstructured data. A common source of semi-structured data is from NoSQL databases. The data in a NoSQL database is organized, but it isn't relational and doesn't follow a consistent schema.

For example, a user record in a NoSQL database may look like this:

{ _id: ObjectId("5effaa5662679b5af2c57829"), email: "[email protected]", name: "Sam", address: "101 Main Street" city: "Independence", state: "Iowa" }

Here, users access the data they need by the keys in the record. And while it looks similar to the records in the structured data example above, it isn't in a consistent table format. 

Instead, it's in JSON format, which is used to store and transmit data objects. While this one record in the database may have this set of attributes, it doesn't mean the rest of the records will have the same structure. The next record may lack a street address but have a ZIP code instead.

An advantage of semi-structured data stored in a NoSQL database is that it is very flexible. If you need to add more data to a record, simply add it with a new key. This can also be a disadvantage if you need data to be consistent.

But NoSQL data isn't the only type of semi-structured big data. XML and YAML are two other flexible data formats that applications use to transfer and store data. Email can also be considered semi-structured data since parts of it can be parsed consistently, such as email addresses, time sent, and IP addresses, while the body is unstructured data.

Comparing structured, semi-structured, and unstructured data

This table better illustrates the differences between these three types of big data: 

 

Structured

Semi-structured

Unstructured

Format

Most commonly data from relational databases where the data is arranged in structured tables and has specific types such as integer, float, and text.

Most commonly data from NoSQL databases and transferred in a data serialization language such as JSON, XML, or YAML.

Unstructured data doesn't follow any schema and can take the form of log files, raw text, images, videos, and more.

Querying

Can be queried quickly with SQL in a structured and consistent way.

This data can be queried, but due to its semi-structured nature, records may not be consistent.

The raw data must be parsed and processed with custom code in many cases.

Transactions

Databases support transactions to ensure dependent data is updated.

Transactions are partially supported in NoSQL databases.

Transactions are not possible with unstructured data.

Flexibility

Structured data sets have a complex update process and are not very flexible.

NoSQL databases are flexible because data schemas can be updated dynamically.

Unstructured data is the most flexible but also the hardest to process.

Evaluate your data sources to get started with big data 

A good first step in any big data project is taking an inventory of all data sources available to you and your business and categorizing them by type. This allows you to begin processing and compiling data to provide useful insights. 

To learn more about big data and its role in modern business, check out these resources:



Looking for Business Intelligence software? Check out Capterra's list of the best Business Intelligence software solutions.

Was this article helpful?


About the Author

Stephan Miller Headshot

Stephan Miller is a freelance writer and software developer specializing in software and programming. He has written two books for Packt Publishing.

visitor tracking pixel