投资10元导师一对一赚钱

書名： Real Time Analytics with SAP HANA
作者名： Vinay Singh
本章字?jǐn)?shù)： 1043字
更新時間： 2021-07-09 21:20:52

Row and column storage in SAP HANA

Relational databases typically use row-based data storage. SAP HANA uses both (row based and column based data storage)

The row storage: This stores records in a sequence of rows
The column storage: The column entries are stored in a continuous memory location

Before getting into a SAP HANA specific discussion, let's try to understand how different column storage is from row. The column-oriented database systems (in our case, SAP HANA) perform better than traditional row-oriented database systems on analytical tasks, in areas such as data warehouses, decision support, predictive analysis, and business intelligence applications.

The major reason behind this performance difference in these areas is that column stores are more I/O efficient for read-only queries as they only have to read the attributes accessed by a query from the disk or memory.

Let's see a few factors that optimize performance in the column storage:

Compression: The data stored in columns is more compressible as compared to data stored in rows. Compression algorithms perform better on data with low information entropy (high data value locality). A good example could be an employee table (which includes the employee name, number, and e-mail address). Storing data in columns allows all of the names to be stored together, all of the employee numbers together, and so on. Certainly, phone numbers are more similar to each other than the surrounding text fields such as Name. Further, if the data is sorted by one of the columns, that column will be super-compressible.
Block iteration and support for parallelism: In the row storage, while processing a series of tuples, it first iterates through each tuple, and then it extracts the needed attributes from these tuples through a tuple representation interface. In contrast to the case-by-case implementation in row and column storage blocks of values from the same column are sent to an operator in a single function call. No attribute extraction is needed, and if the column is fixed-width, these values can be iterated through directly as an array. Operating on data as an array not only minimizes per-tuple overhead, but it also exploits potential for parallelism on CPUs.
Late materialization: In a column store, information about a logical entity is stored in multiple locations on disk, whereas in a row store such information is usually collocated in a single row of a table. However, most queries access more than one attribute from a particular entity. So, at some point in most query plans, data from multiple columns must be combined together into rows of information about an entity. Consequently, this join-like materialization of tuples (also called "tuple construction") is an extremely common operation in a column store.
There are a couple of advantages of late materialization:
- Selection and aggregation operators tend to render the construction of some tuples unnecessary (if the executor waits long enough before constructing a tuple, it might be able to avoid constructing it altogether)
- If data is compressed using a column-oriented compression method, it must be decompressed before the combination of values with values from other columns

It is up to the data modeler in the SAP HANA database to specify whether a table is to be stored column-wise or row-wise. One can also alter an existing column-based table to row-based, and vice versa. Let's take a diagrammatical approach to see how row and column representation happens for a table:

Logical representation of a Table:
Traditionally data is stored in database as row and can be represented as follows:
If we use Column based data storage, the representation for the same table would look like as follows:

Having seen the row and column representation, we can clearly correlate the preceding mentioned points of compression, block iteration, parallel processing, and late materialization.

Choosing the right table

It is up to a data modeler in the SAP HANA database to specify whether a table is to be stored column-wise or row-wise. One can also alter an existing column-based table to row-based, and vice versa.

The following figure shows some of the scenarios for using the row and column store:

As illustrated in the figure, each store has an advantage in different scenarios and SAP HANA allows you to choose how you want to store your data—row or column as per your need. The SAP HANA database allows joining column table with row table.

Other features of SAP HANA that we should also be aware of are:

Partitioning: In the SAP HANA database, tables can be partitioned horizontally into disjunctive subtables or partitions. Partitioning supports the creation of very large tables by decomposing them into smaller manageable chunks.
Two of the most widely used cases for partitioning are as follows:
- Parallelization: Operations are parallelized by using several execution threads per table.
- Load balancing: Using partitioning, the individual partitions might be distributed over the landscape. This way, a query on a table is not processed by a single server but by all the servers that host partitions that are relevant for processing.
Insert only on Delta: The challenge of updating and inserting data into a sorted column store table is huge, as the whole table (sort order) is reorganized each time. SAP has tackled this challenge by separating these tables into main stores (read-optimized, sorted columns) and delta stores (write-optimized, nonsorted columns or rows).

There is a regular database activity that merges the Delta stores into the main store. This activity is called Delta Merge.

Insert on Delta - Process

Merge process can be understood in three phases. Operation that are performed before Merge, during merge and post Merge.(Steps are illustrated in the preceding diagram).

Before the merge operation: All write operations go to storage Delta1 and read operations read from storages Main1 and Delta1.
During the merge operation: All the changes go to the second delta storage (Delta 2) While the merge operation is running. Read operations read from the original main storage, Main1, and from both delta storages, Delta1 and Delta 2. Uncommitted changes in Delta 1 are copied to Delta 2. The content of Main1 and the committed entries in Delta 1 are merged into the new main storage, Main2.
After the merge operation: Once the merge process is complete . Main1 and Delta 1 storages are deleted.

官术网_书友最值得收藏!

Real Time Analytics with SAP HANA

Row and column storage in SAP HANA

Choosing the right table