Apache iceberg example

12/12/2023

Here is a very basic diagram of the different files that are created during a CTAS (create table as select): All engines that want to interact with the table first get the latest “pointer” from the metastore then start interacting with Iceberg metadata files from there.

A Hive compatible metastore is used to “point” to the latest metadata file that has the current state of the table. Under the covers, Iceberg uses a set of avro based files to keep track of this metadata. This makes a very powerful feature called time travel available because the table at any given point contains a set of snapshots over time which can be queried with the proper syntax. When an Iceberg client (Trino let’s say) wants to query a table, the latest snapshot is read and the files that “belong” to that snapshot are read. It also keeps a current “snapshot” of the files that belong to the table and statistics about them in order to reduce the amount of data that is needed to be read during queries greatly, improving performance.Įverytime a modification to an Iceberg table is performed, (insert, update, delete, etc.) a new snapshot of the table is created. This log keeps track of the current state of the table including any modifications. It provides a transaction log per table very similar to a traditional database. Iceberg is a layer of metadata over your object storage. Partitioning is performed on any column and end users query Iceberg tables just like they would a database. Since Iceberg stores a table state in a snapshot, the engine simply needs to read the metadata in that snapshot then start retrieving the data from storage saving valuable time and reduced cloud object store retrieval costs. Much like a database, Iceberg supports full schema evolution including columns and even partitions. With Iceberg, data can easily be modified to adhere to use cases and compliance such as GDPR. Modifying data in Hadoop was a huge challenge. So you can guarantee consistency, even with multiple writers. Iceberg metadata is always available to all engines. The data Iceberg and these engines work on, is YOUR data in YOUR account which avoids data lock-in. This offers the ultimate flexibility to own your own data and choose the engine that fits your use cases. Here is a list of the many features Iceberg provides: Choose your engineĪs you can see from the diagram above, there are many engines that support Iceberg. It has been growing in popularity, not only because of how useful it is, but also because it’s truly an open source table format, many companies have contributed and helped improve the specification making it a true community based effort. With more and more technologies jumping on board, Iceberg isn’t a passing fad. As you can see, the popularity and work that each engine has done is a great indicator of the popularity and usefulness that this exciting technology brings. In the diagram below, you can see many different technologies can work the same set of data as long as they use the open-source Iceberg API. One of the best things about Iceberg is the vast adoption by many different engines. They are excited a true open source table format has been created with many engines both open source and proprietary jumping on board. Most of the customers and prospects I speak with on a weekly basis are either considering migrating their existing Hive tables to it or have already started. The excitement around Iceberg began last year and has greatly increased in 2022.

Iceberg allows organizations to finally build true data lakehouses in an open architecture, avoiding vendor and technology lock-in. What is Apache Iceberg?Īpache Iceberg is a table format, originally created by Netflix, that provides database type functionality on top of object stores such as Amazon S3.

This allows an organization to take advantage of low-cost, high performing cloud storage while providing data warehouse features and experience to their end users without being locked into a single vendor. How to migrate your Hive tables to Apache IcebergĪpache Iceberg is an open source table format that brings database functionality to object storage such as S3, Azure’s ADLS, Google Cloud Storage and MinIO.Improving performance with Iceberg sorted tables.Automated maintenance for Apache Iceberg tables in Starburst Galaxy.Apache Iceberg Time Travel & Rollbacks in Trino.Apache Iceberg Schema Evolution in Trino.Apache Iceberg DML (update/delete/merge) & Maintenance in Trino.Iceberg Partitioning and Performance Optimizations in Trino.Introduction to Apache Iceberg in Trino.This post is part of the Iceberg blog series.

0 Comments

Apache iceberg example

Leave a Reply.

Author

Archives

Categories