【数据中台】开源项目(5)-Amoro

news/2024/7/10 19:59:53 标签: 开源, 数据中台, Amoro

介绍

        Amoro is a Lakehouse management system built on open data lake formats. Working with compute engines including Flink, Spark, and Trino, Amoro brings pluggable and self-managed features for Lakehouse to provide out-of-the-box data warehouse experience, and helps data platforms or products easily build infra-decoupled, stream-and-batch-fused and lake-native architecture。
        Amoro定位是一个搭建在 Apache Iceberg之上的流式湖仓服务,流式强调向实时能力的拓展,服务则强调管理、标准化度量,以及其他可以抽象到基础软件中的湖仓一体能力。
通过 Amoro,用户可以在 Flink、Spark、Trino 等引擎上实现更加优化的 CDC、流式更新、OLAP 等功能, 结合数据湖高效的离线处理能力,Arctic 能够服务于更多流批混用的场景;同时,Arctic 的结构自优化、并发冲突解决以及标准化的湖仓管理功能,将有效减少用户在数据湖管理和优化上的负担。
        开源地址: GitHub - NetEase/amoro: Amoro is a Lakehouse management system built on open data lake formats.

Amoro架构

The architecture of Amoro is as follows:
The core components of Amoro include:
  • AMS: Amoro Management Service provides Lakehouse management features, like self-optimizing, data expiration, etc. It also provides a unified catalog service for all computing engines, which can also be combined with existing metadata services.
  • Plugins: Amoro provides a wide selection of external plugins to meet different scenarios.
  • Optimizers: The self-optimizing execution engine plugin asynchronously performs merging, sorting, deduplication, layout optimization, and other operations on all type table format tables.
  • Terminal: SQL command-line tools, provide various implementations like local Spark and Kyuubi.
  • LogStore: Provide millisecond to second level SLAs for real-time data processing based on message queues like Kafka and Pulsar.

支持的格式

Amoro can manage tables of different table formats, similar to how MySQL/ClickHouse can choose different storage engines. Amoro meets diverse user needs by using different table formats. Currently, Amoro supports three table formats:
  • Iceberg format: means using the native table format of the Apache Iceberg, which has all the features and characteristics of Iceberg.
  • Mixed-Iceberg format: built on top of Iceberg format, which can accelerate data processing using LogStore and provides more efficient query performance and streaming read capability in CDC scenarios.
  • Mixed-Hive format: has the same features as the Mixed-Iceberg tables but is compatible with a Hive table. Support upgrading Hive tables to Mixed-Hive tables, and allow Hive’s native read and write methods after upgrading.

支持的引擎

Iceberg format

Iceberg format tables use the engine integration method provided by the Iceberg community. For details, please refer to: Iceberg Docs.

Paimon format

Paimon format tables use the engine integration method provided by the Paimon community. For details, please refer to: Paimon Docs.

Mixed format

Amoro support multiple processing engines for Mixed format as below:
Processing Engine
Version
Batch Read
Batch Write
Batch Overwrite
Streaming Read
Streaming Write
Create Table
Alter Table
Flink
1.15.x, 1.16.x and 1.17.x
Spark
3.1, 3.2, 3.3
Hive
2.x, 3.x
Trino
406

应用场景

Self-managed streaming Lakehouse

Amoro makes it easier for users to handle the challenges of writing to a real-time data lake, such as ingesting append-only event logs or CDC data from databases. In these scenarios, the rapid increase of fragment and redundant files cannot be ignored. To address this issue, Amoro provides a pluggable streaming data self-optimizing mechanism that automatically compacts fragment files and removes expired data, ensuring high-quality table queries while reducing system costs.

Stream-and-batch-fused data pipeline

Whether in the AI or BI business field , the requirement for real-time analysis is becoming increasingly high. The traditional approach of using one streaming job to complete all data processing from the source to the end is no longer applicable. There is an increasing demand for layered construction of streaming data pipeline, and the traditional layered construction approach based on message queues can cause a inconsistency problem between the streaming and batch data processing. Building a unified stream-and-batch-fused pipeline based on new data lake formats is the future direction for solving these problems. Amoro fully leverages the characteristics of the new data lake table formats about unified streaming and batch processing, not only ensuring the quality of data in the streaming pileline but also enhancing critical features such as incremental reading for CDC data and streaming dimension table association, helping users to build a stream-and-batch-fused data pipeline.

Cloud-native Lakehouse

Currently, most data platforms and products are tightly coupled with their underlying infrastructure(such as the storage layer). The migration of infrastructure, such as switching to cloud-native OSS, may require extensive adaptation efforts or even be impossible. However, Amoro provides an infra-decoupled, lake-native architecture built on top of the infrastructure. This allows products based on Amoro to interact with the underlying infrastructure through a unified interface (Amoro Catalog service), protecting upper-layer products from the impact of infrastructure switch.

http://www.niftyadmin.cn/n/5233131.html

相关文章

经典文献阅读之--Traversability Analysis for Autonomous Driving...(Lidar复杂环境中的可通行分析)

0. 简介 对于自动驾驶来说,复杂环境的可通行是最需要关注的任务。《Traversability Analysis for Autonomous Driving in Complex Environment: A LiDAR-based Terrain Modeling Approach》一文提出了用激光雷达完成建图的工作,其可以输出稳定、完整和精…

LeetCode的几道题

一、捡石头 292 思路就是: 谁面对4块石头的时候,谁就输(因为每次就是1-3块石头,如果剩下4块石头,你怎么拿,我都能把剩下的拿走,所以你就要想尽办法让对面面对4块石头的倍数, 比如有…

_WorldSpaceLightPos0的含义 UNITY SHADER

_WorldSpaceLightPos0 为当前平行光的方向,方向是从光源到照射的方向。 因此,如果要算发现和平行光之间的夹角, 则需要首先将归一化的_WorldSpaceLightPos0去负数。这样才能继续去计算。 也就是: fixed3 reflectdirnormalize…

C++ | unique_ptr

unique_ptr 一个unique_ptr独占它所指向的对象。当unique_ptr被销毁时&#xff0c;它所指向的对象也被销毁。 初始化unique_ptr时只能使用直接初始化的方式&#xff0c;不能使用普通的拷贝或赋值操作。 unique_ptr<string> p1; //正确 unique_ptr<int> p2(…

学习TypeScrip3(接口和对象类型)

对象的类型 在typescript中&#xff0c;我们定义对象的方式要用关键字interface&#xff08;接口&#xff09;&#xff0c;我的理解是使用interface来定义一种约束&#xff0c;让数据的结构满足约束的格式。定义方式如下&#xff1a; 使用接口约束的时候不能多一个属性也不能少…

Python标准库:math库【侯小啾python领航班系列(十六)】

Python标准库:math库【侯小啾python领航班系列(十六)】 大家好,我是博主侯小啾, 🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ🌹꧔ꦿ…

Linux中的UDEV机制与守护进程

Linux中的UDEV守护进程 udev简介守护进程守护进程概念守护进程程序设计守护进程的应用守护进程和后台进程的区别 UDEV的配置文件自动挂载U盘 udev简介 udev是一个设备管理工具&#xff0c;udev以守护进程的形式运行&#xff0c;通过侦听内核发出来的uevent来管理/dev目录下的设…

一篇带你走进线性表之顺序表(C语言阐述)——逐行解释代码

目录哇 1. 引言- 顺序表在数据结构中的地位和作用- 概述本文将讨论的内容和结构 2. 顺序表的基本概念- 定义&#xff1a;什么是顺序表&#xff1f;- 结构&#xff1a;顺序表的内部结构和特点 3. 实现一个基本的顺序表***需要用到的头文件******定义顺序表的基本结构和属性*****…