Many companies face the problem that data may be important for new applications years later, but when that time comes, they have long been deleted, or their structure has since been changed several times. Furthermore, data have often been selected, aggregated or transformed before they are first saved, i.e. they are no longer complete when they are to be used later.
For data-intensive projects in the field of data science or AI in particular, suitable data must therefore first be collected again, causing significant delays in the planned projects.
How can data lakes help?
Data lakes are an architectural pattern that aims at making data from various applications available in a centralized ecosystem in the long term. Data from every segment and department of a company are stored in a central location if possible. Unlike with traditional data warehouses, however, the raw data are always stored as well, often in an object storage system such as S3.
The advantage of this method is the fact that the information is available in its entirety, without being reduced or transformed when they are first stored like they are in traditional data warehouses. Consequently, the central data pool does not have a structure that is tailored for specific user requirements, i.e. in this case, the consumers have to deduce the meaning of the data themselves.
In order to be able to efficiently exploit the advantage of data lakes, they should be provided on a cross-departmental level. This way, the data can be retrieved anywhere they are needed.
It is possible to store the data in different zones, allowing access with different levels of abstraction. For data scientists, for example, low-level tools such as Athena are used to gain in-depth, detailed insight into the data pool, whereas more specialized data marts are preferable for technical departments.
What does Amazon Athena offer?
Amazon Athena allows for SQL queries to be executed directly on (semi-)structured data in S3 buckets, without the need for a database with a defined structure. Preparatory ETL (Extract Transform Load) processes as we know them from traditional data warehouses are not required for the work with the raw data, either.
As Amazon Athena is a serverless service, no infrastructure has to be provided. This happens automatically in the background, and is transparent for the user. On the one hand, this reduces the effort and specialist knowledge required, and on the other hand, using this service only causes costs per gigabyte of the data read from S3.
Lecture at online campus event (German only)
The following video of our first online campus event gives more detailed insight into the technical background and the possibilities of application and optimization. It shows discussions about practical experiences and a brief live demonstration in the AWS console.