In the early days of modern computation, computers would collect data from physical repositories (cards, keyboard input and tapes) and deal with it in memory, meaning a virtual environment. In the late 1970s, computers started to cluster data for processing within worksheet like structures, the grandfather of Databases.
Databases were to follow and the logic behind it was for data to be input in a logical manner according to previous human classification, so although raw data (in the sense it hadn’t been previously treated) went into the Database tables, the decision of what would fit where and why was based on human criteria. Databases are in fact a limited tool since they require specific contexts, meaning there is a Database for Accounting Data which must be a distinct entity than the Database for Logistics Data (although they may need to connect and have mutual referencing structures).
Most of the Systems and Tools under operation today, still have their own Database structures which interconnect and exchange data towards creating added value information. The Database evolved to larger clusters of alike data, which are the Data Warehouses and, although the process may somehow differ, Data is grouped in the same manner while pertaining a given common context.
With the event of BigData (for Data tends to grow, exponentially), plus nowadays innovation in data handling/ processing tools, we have reached a stage where large flows of several types of Data may be channeled into extremely large data repositories (Data Lakes), in an ad-hoc manner which nevertheless allows correlation to be established under pre-defined “schema” where from to extract meaningful information.
The main leverage of Data Lakes pertains the following characteristics:
- While Data Warehouses store structured already processed Data, Data Lakes also store semi or unstructured as well as completely raw and (apparently) unrelated data. And, no … it’s not a waste of time and space since most times relationships are only found after processing.
- Data Warehouses architecture implies expensive large data volumes whereas Data Lakes are specifically designed for low-cost.
- Data Warehouses obey by a pre-defined structure in which data is stored (pre-defined data classes and families), while Data Lakes have great flexibility towards dynamic structural configuration. And that is more the manner in which humans apprehend information and think.
- Data which goes into Data Warehouses is mature (it has a clearly defined context), yet Data Lakes store Data that is still maturing (still finding its place in the overall context).
Without Data Lakes, the IoT or even AI would be heavy, slow or even impossible technologies to implement in a cost-effective manner.
Think of it as an actual lake, into which data flows from several streams (M2M, log files, real-time data collection, CTI, etc) and then you are able to run appropriated analytics towards it, therefore extracting added value information.
Accessing data from different sources like DB2 based systems or SAP, or any other different IT Systems requires both specific tools as well as licensing, which represents time and money. Extracting meaningful new added value information out of those distinct IT environments represents creating a detailed map of WHAT data is relevant towards WHICH requirements and then developing code that can gather such Data in the appropriate sequence to produce information that shows new meaningful content.
Data Lakes are an inevitable consequence of widespread access and integration as well as social and tech trends like Social Media or IoT will only make them more common. A Data Lake needs to enable the features to embed in the ISASA concept which specifically represents the ability to:
- Ingest data from several distinct flow streams through appropriate APIs or Batch processes.
- Store such dynamically unforeseen in size amounts of Data on scalable repositories (the Lake) through all necessary protocols (NFS, CIFS, FTP, HDFS, other)
- Analyze the Data by finding the relevant correlations according to your needs and expectations.
- Surface relevant information in a user-friendly manner that better conveys what you need to see in the most straightforward effective way.
- Act, in the most efficient and cost-effective manner leading you to reach the intended goals.
The entire concept of capturing raw data and applying distinct Schemas to it (logical circumstantial processing) is a very powerful one and in fact, it is precisely what our brain does. We gather raw data and based on our knowledge (rules gathered through past experience and learning) we are able to “issue” a new set of Data that adds value to what we have collected and had stored as “knowledge”.
And this … has everything to do with Artificial Intelligence.
The above is an extract. Read the full article on www.tenfold.com