Ever since man began to systematize knowledge, he has had the need to classify and define the surrounding reality. To do this he introduced quantitative methods to describe everything around him.
A database is an IT tool that has the purpose of making it easy and efficient not only to store descriptions of realities of interest, but above all to recover data in a correlated way in order to extract information.
Let's take a simple example. Products on sale in a supermarket could be described by brand, trade name, description and price. A regular customer may have a "loyalty card" which is associated with some personal data, the composition of the family unit and educational qualifications. Each supermarket receipt connects the customer with the products purchased on a certain date and time. These are the data.
But what information can we extract? For example, what is the day of the week on which the greatest quantity of beer is sold, or the type of food purchased by families in which at least one member has a university degree.
The more details we store for a given reality, the greater the possibility of interrogating the system to extract valuable information. In English the verb used to express the concept of questioning the system is Query which actually has a Latin root, the verb quaero (ask to get something). The secret to creating a database that can potentially be used for a wide range of queries is to represent the descriptive data of a reality in an atomic and non-aggregated form.
In the previous example the product description is not a good method if we insert details such as for example. the weight of a product or the number of packages packed together. Data that is not explicitly cataloged makes their processing very complex because they require an interpretation of the content.
To make data cataloging intuitive, the table construct is used, in which an entity of interest is represented by rows (e.g. a product on sale, a patient, etc.) and its characteristics by columns (e.g. ., brand, product name, price, etc.). A table can also represent relationships between different entities. For example, if I need to represent the ownership relationship of a vehicle, I will only need a table in which each row shows the owner's tax code and the vehicle's license plate, i.e. the two characteristics that allow me to uniquely identify the owner and vehicle.
The one described so far is the relational data model proposed by Edgar Codd (IBM1) which still represents the standard for data representation, also thanks to the simplicity of the language developed to query the system (SQL, sometimes referred to as an acronym for Structured Query Language, although in reality this is the name given to the standard of the language to differentiate it from the commercial name used by IBM, SEQUEL).
The availability of a large amount of detailed information stored in a relational database allows you to extract useful information for the monitoring, management and strategic planning of an organization. For example, the aggregation of the individual receipts of a commercial establishment, or of the profit assessments of a student, allow us to study the overall trend of sales respectively (by time slots of the day, by type of customer, etc.) or of student careers (university exam results by semester, by course of study, etc.). These operations are carried out in Data Warehouse, archives where data is stored in aggregate form. The data analysis tools used in a data warehouse are called Business Intelligence and include several statistical and statistical techniques machine learning algorithm . The term has been used in the past Data Mining indicating that data is a mine from which to extract value.
The relational model allows data to be stored efficiently and to be able to perform different types of correlations, but with an intrinsic processing slowness due to the separation of the information into distinct tables which requires several reading operations from the storage system to produce the result . NoSQL models have been spreading over the last 20 years2 (Not only SQL) specialized for storing aggregates. An example from the e-commerce sector can clarify the concept. When we search for a product, and compare different alternatives, we have the possibility to select the desired characteristics. This is a typical functionality of relational models that store product characteristics in a structured way. For example, for a television we can select the size of the screen, the resolution, the presence or absence of certain connection ports, etc. When we proceed with the purchase we will be able to use a discount code or an offer of the day. All these details are stored in a single element of a NoSQL database which represents the equivalent of the receipt or invoice. In fact, it would be onerous to memorize the history of the various versions of the products sold, promotional campaigns and discount codes through the relational model.
In a modern organization we therefore find different databases, each specialized for a specific purpose: relational databases to store all the details of a specific reality, and NoSQL databases to represent aggregates that are easy to retrieve without the need to correlate each time information. We therefore often speak of polyglot persistent memorization3.
The need to store data and then process it has increased dramatically today thanks to the development of different types of sensors which we often refer to with the generic term Internet of Things. In our daily lives we probably use a wearable device to monitor some parameters of our activity. Many vehicles (cars, scooters, bikes, etc.) allow the memorization of the route taken and the wear of some components. In our homes there are domestic utility meters smart because they communicate, at constant time intervals, information on the consumption of electricity, gas or water. This data is relevant both for instantly identifying anomalous or dangerous situations, but above all it is useful when aggregated to identify trends and habits.
This abundance of data, structured and unstructured, managed with different models and technologies (often referred to as datalake), has given rise in recent years to the professional figure of data scientist, that is, the specialist in the data chain from their production, to their filtering, cleaning and aggregation to arrive at the query techniques that allow information to be extracted.
We live in the so-called information society4, where the storage, representation and correlation between data constitute the true wealth. It is no coincidence that in 2006 the saying that became famous “Data is the new oil”5, and more recently that “artificial intelligence is the new electricity”6.
1https://www.ibm.com/history/edgar-codd
2https://sheinin.github.io/nosql-database.org/
3https://martinfowler.com/articles/nosqlKeyPoints.html
4https://www.manuelcastells.info/en/
5https://www.sheffield.ac.uk/cs/people/academic-visitors/clive-humby
6https://www.gsb.stanford.edu/insights/andrew-ng-why-ai-new-electricity