Big Data Integration, Data Lakes, Data Warehouses and Mining
- Overview
Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency. Big data has one or more of the following characteristics: high volume, high velocity or high variety. Artificial intelligence (AI), mobile, social and the Internet of Things (IoT) are driving data complexity through new forms and sources of data.
For example, big data comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media — much of it generated in real time and at a very large scale.
A data warehouse is a large multi-faceted repository for data of all types, and is a critical element in any Big Data strategy. Just as a warehouse is a large building for the storage of goods, a data warehouses is a repository where large amounts of data can be collected - it's an important tool for Big Data.
Data mining is considered as a process of extracting data from large data sets, whereas a Data warehouse is the process of pooling all the relevant data together. Data mining is the process of analyzing unknown patterns of data, whereas a Data warehouse is a technique for collecting and managing data.
Data Warehousing is one of the common words for last 10-20 years, whereas Big Data is a hot trend for last 5-10 years. Both of them hold a lot of data, used for reporting, managed by an electronic storage device. So one common thought that recent big data will replace old data warehousing very soon. But still, big data and data warehousing is not interchangeable as they used totally for a different purpose.
A data repository is a data library or archive. It may refer to large database management systems or several databases that collect, manage, and store sensitive data sets for data analysis, sharing, and reporting.
- Data Warehouses
A data warehouse (DW), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. The system’s logical design facilitates the integration of data sources and allows the generation of new, additional valuable data sources without significant structural adjustment.
Each organization has distinct operation practices and business models, which result in a variety of data generation platforms. Ultimately, a data warehouse should be larger than the sum of its data, and serve as an ongoing intelligent resource for use by multiple members of an organization, large or small. For that to happen, data warehouse technologies require data virtualization, processing, and transformation methods.
The are several delivery models, including physical appliances, such as dedicated traditional storage subsystems built to support analytics and business performance (BI) (BI is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance).
With the addition and ongoing evolution of the cloud, cloud-based solutions, seen as agile and low capital intensive solutions, aim to simplify both the hosting of and analysis of data in an increasingly complicated environment.
In addition to the explosive growth in the amount of data and data sources we’ve seen in recent years, another motivation for creating even more sophisticated data warehousing systems is the ever-increasing need for customizable business intelligence and analytics.
- Data Lakes
A data lake is a centralized repository that stores, processes, and secures large amounts of data. It can store data in its native format and process any variety of it, ignoring size limits.
Data lakes are used for: Analytics applications, Big data analytics, Machine learning, Reporting, Visualization, Advanced analytics.
Data lakes are different from traditional data warehouses, which store data in hierarchical dimensions and tables. Data lakes use a flat architecture to store data, primarily in files or object storage.
Data lakes are used for: Analytics applications, Big data analytics, Machine learning, Reporting, Visualization, Advanced analytics.
Data lakes are different from traditional data warehouses, which store data in hierarchical dimensions and tables. Data lakes use a flat architecture to store data, primarily in files or object storage.
Data lakes can include raw copies of source system data, sensor data, social data, transformed data.
A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). Data lakes can be used to explore and analyze petabytes of data. One petabyte of data is equivalent to 1 million gigabytes.
- Data Repository
A data repository is a centralized place to store, manage, and share data. It can also be called a data library or archive. Data repositories are commonly used in scientific research and business, and can be large database management systems or multiple databases.
Data repositories can have many benefits, including:
- Data management: Data repositories can help you organize and deposit data, and make it easier to find and use.
- Data preservation: Data repositories can help preserve data for long-term use.
- Data discovery: Data repositories can make data more valuable for research by making it easier to discover.
- Data citation: Data repositories can provide persistent identifiers, such as Digital Object Identifiers (DOIs), that allow you to cite your data.
When choosing a data repository, you can consider things like:
- Preservation plan: Whether the repository has a plan to ensure the data is preserved
- FAIR Data Principles: Whether the repository supports these principles.
- Repository finder tools: Whether there are tools that can help you find repositories that meet your needs
Some examples of data repositories include:
- Dryad: A general-purpose repository that makes scientific publication data discoverable, reusable, and citable.
- Google Dataset Search: A search engine that allows you to search across thousands of online data repositories.
[More to come ...]