Engineering Lakehouses with Open Table Formats

Build scalable and efficient lakehouses with Apache Iceberg, Apache Hudi, and Delta Lake

Häftad, Engelska, 2025

739 kr

Beställningsvara. Skickas inom 5-8 vardagar. Fri frakt för medlemmar vid köp för minst 249 kr.

Jump-start your journey toward mastering open data architectural patterns by learning the fundamentals and applications of open table formatsKey FeaturesBuild lakehouses with open table formats using compute engines such as Apache Spark, Flink, Trino, and PythonOptimize lakehouses with techniques such as pruning, partitioning, compaction, indexing, and clusteringFind out how to enable seamless integration, data management, and interoperability using Apache XTablePurchase of the print or Kindle book includes a free PDF eBookBook DescriptionEngineering Lakehouses with Open Table Formats provides detailed insights into lakehouse concepts, and dives deep into the practical implementation of open table formats such as Apache Iceberg, Apache Hudi, and Delta Lake.You’ll explore the internals of a table format and learn in detail about the transactional capabilities of lakehouses. You’ll also get hands on with each table format with exercises using popular computing engines, such as Apache Spark, Flink, Trino, and Python-based tools. The book addresses advanced topics, including performance optimization techniques and interoperability among different formats, equipping you to build production-ready lakehouses. With step-by-step explanations, you’ll get to grips with the key components of lakehouse architecture and learn how to build, maintain, and optimize them.By the end of this book, you’ll be proficient in evaluating and implementing open table formats, optimizing lakehouse performance, and applying these concepts to real-world scenarios, ensuring you make informed decisions in selecting the right architecture for your organization’s data needs.What you will learnExplore lakehouse fundamentals, such as table formats, file formats, compute engines, and catalogsGain a complete understanding of data lifecycle management in lakehousesLearn how to systematically evaluate and choose the right lakehouse table formatOptimize performance with sorting, clustering, and indexing techniquesUse the open table format data with ML frameworks like TensorFlow and MLflowInteroperate across different table formats with Apache XTable and UniFormSecure your lakehouse with access controls and ensure regulatory complianceWho this book is forThis book is for data engineers, software engineers, and data architects who want to deepen their understanding of open table formats, such as Apache Iceberg, Apache Hudi, and Delta Lake, and see how they are used to build lakehouses. It is also valuable for professionals working with traditional data warehouses, relational databases, and data lakes who wish to transition to an open data architectural pattern. Basic knowledge of databases, Python, Apache Spark, Java, and SQL is recommended for a smooth learning experience.

Produktinformation

Utgivningsdatum2025-12-26
Mått191 x 235 x 22 mm
Vikt773 g
FormatHäftad
SpråkEngelska
Antal sidor416
FörlagPackt Publishing Limited
MedarbetareChao Sun
ISBN9781836207238

Tillhör följande kategorier

Databaser inom Data och IT
Nätverk och kommunikation inom Data och IT
Människa – datorinteraktion inom Data och IT

Dipankar Mazumdar is currently the Director of Developer Advocacy at Cloudera, where he leads global developer initiatives focused on lakehouse architectures and generative AI. Previously, he held developer advocacy roles at Dremio, Onehouse, and Qlik, contributing to open source projects such as Apache Iceberg, Apache Hudi, and XTable, among others. For most of his career, Dipankar has worked at the intersection of data engineering and AI. He has also contributed to O'Reilly's Apache Iceberg: The Definitive Guide and has spoken at numerous conferences, including Databricks Data + AI, Netflix Engineering, ApacheCon, Scale By the Bay, and Data Day Texas, among others. Vinoth Govindarajan is a seasoned data expert and staff software engineer at Apple Inc., where he spearheads data platforms using open-source technologies like Iceberg, Spark, Trino, and Flink. Before this, he worked on designing incremental ETL frameworks for real-time data processing at Uber. He is a dedicated contributor to the open source community in projects such as Apache Hudi and dbt-spark. As a thought leader, Vinoth has shared his expertise through speaking engagements at conferences such as dbt Coalesce and Hudi OSS community meet-ups. He has published numerous blogs on building open lakehouses. Holding a bachelor's degree in information technology, Vinoth has also authored multiple research papers published in journals like IEEE.

Table of ContentsOpen Data Lakehouse: A New Architectural ParadigmTransactional Capabilities of the LakehouseApache Iceberg Deep DiveApache Hudi Deep DiveDelta Lake Deep DiveCatalog and Metadata ManagementInteroperability in LakehousesPerformance Optimization and Tuning in a LakehouseData Governance and Security in LakehousesEvaluating and Selecting Open Table FormatsReal-World Applications and Learnings