書名： Hands-On Data Science with SQL Server 2017
作者名： Marek Chmel Vladimír Mu?n?
本章字數： 1157字
更新時間： 2021-06-10 19:14:02

Installing and configuring

PolyBase can be installed as a part of SQL Server installation. You can choose whether PolyBase will be configured as a scale-out group. If you would like to use a scale-out group for maximum performance and throughput, you need to make sure that the following is correct:

All nodes are in the same domain
All nodes use the same account and password for the PolyBase service
All nodes can communicate over the network
All nodes are running the same SQL Server version

To install PolyBase, you will need a Java Runtime Environment available on your SQL Server, which is not part of the installation media, You'll be provided with a link to download the installation files for JRE.

You can confirm that PolyBase is installed on your server with the following:

SELECT SERVERPROPERTY ('IsPolybaseInstalled') AS IsPolybaseInstalled;

Also, when Polybase is installed, you'll see three used databases available on your SQL Server:

DWConfiguration
DWDiagnostics
DWQueue

Once you have installed PolyBase, you actually need to enable the PolyBase feature in SQL Server by running a stored procedure, sp_configure, with proper parameters:

sp_configure @configname = 'hadoop connectivity', @configvalue = 1; 
GO 
RECONFIGURE 
GO

The config value, as the parameter to this procedure, is used to determine the proper available data source. As of now, there are eight possible values, ranging from 0 to 7. All available config values used for various data sources are as follows:

Option 0: Disable Hadoop connectivity
Option 1: Hortonworks HDP 1.3 on Windows Server
Option 2: Azure blob storage (WASB[S])
Option 3: Hortonworks HDP 1.3 on Linux
Option 4: Cloudera CDH 4.3 on Linux
Option 5: Hortonworks HDP 2.0 on Windows Server
Option 6: Azure blob storage (WASB[S])
Option 7: Hortonworks HDP 2.0 on Linux
Option 8: Cloudera 5.1, 5.2, 5.3, 5.4, 5.5, 5.9, 5.10, 5.11, 5.12, and 5.13 on Linux
Option 9: Hortonworks 2.1, 2.2, 2.3, 2.4, 2.5, and 2.6 on Linux
Option 10: Hortonworks 2.1, 2.2, and 2.3 on Windows Server
Option 11: Azure blob storage (WASB[S])

Unlike many other configuration parameters, which you can change with the sp_configure procedure, after configuring Hadoop connectivity, you have to explicitly restart the SQL Server service; otherwise, the setting won't take effect. To update this configuration value, you have to be a member of sysadmin server fixed role or have the ALTER SERVER permission assigned to your login.

Once you have enabled Hadoop connectivity, you actually need to create several more objects inside your user database. When we're using PolyBase, we're connecting to a data source that usually stores numerous files containing the required information. When you want to access such files from SQL Server via the Polybase feature, you have to define the following:

An external data source
An external data file format
An external table

The external data source defines the location of the the source defined with the access protocol, such as hdfs and WASB. The data source type can be of several different values where we will frequently work with Hadoop or blob_storage types. Other available types are RDBMS and SHARD_MAP_MANAGER, where the first is used with cross-database queries on Elastic Database Query on Azure SQL Database, and the second for sharing on the Azure SQL Database.

To create a data source for a Hadoop file system running on the internal network with a defined IP address, you can use the following query:

CREATE EXTERNAL DATA SOURCE Hadoop
WITH (
   TYPE = HADOOP,
    LOCATION = 'hdfs://192.168.1.100:8050'
)

This example presumes that you have a running hdfs storage on your IP address, 192.168.1.100, configured with default port 8050.

To get hands-on experience with Hadoop and hdfs, you can run a sandbox in your virtual machine, such as Hortonworks Data Platform or Cloudera CDH, which you can get as a quickstart virtual machines for numerous virtualization platforms.

Once we have defined the external data source, we need to define the format of the data stored in such a data source. This is handled via the definition of the EXTERNAL FILE FORMAT. This file format has several possible types:

Delimited text: Simple text files with a predefined delimiter
RCfile
ORC: Optimized row columnar file, which is commonly used with Hadoop and offers reasonable performance and compression
Parquet

For configuring the file format as a plain text file, where columns are delimited with the | character, we can use the following code:

CREATE EXTERNAL FILE FORMAT TextFile
WITH (
    FORMAT_TYPE = DelimitedText,
    FORMAT_OPTIONS (FIELD_TERMINATOR = '|')
);

Once the data source and the structure of the file are defined, we can create an external table in the database. This external table will define the columns for the data stored in the files and will work as a metadata wrapper for accessing the external data stored in the hdfs filesystem. We can either query that external data directly, or load the data into the SQL Server to new tables. We will work with a sample file containing sales records separated by the | character.

To create the table, we will use the following code:

CREATE EXTERNAL TABLE dbo.FactResellerSalesArchiveExternal (
    [ProductKey] [int] NOT NULL,
    [OrderDateKey] [int] NOT NULL,
    [DueDateKey] [int] NOT NULL,
    [ShipDateKey] [int] NOT NULL,
    [ResellerKey] [int] NOT NULL,
    [EmployeeKey] [int] NOT NULL,
    [PromotionKey] [int] NOT NULL,
    [CurrencyKey] [int] NOT NULL,
    [SalesTerritoryKey] [int] NOT NULL,
    [SalesOrderNumber] [nvarchar](20) NOT NULL,
    [SalesOrderLineNumber] [tinyint] NOT NULL,
    [RevisionNumber] [tinyint] NULL,
    [OrderQuantity] [smallint] NULL,
    [UnitPrice] [money] NULL,
    [ExtendedAmount] [money] NULL,
    [UnitPriceDiscountPct] [float] NULL,
    [DiscountAmount] [float] NULL,
    [ProductStandardCost] [money] NULL,
    [TotalProductCost] [money] NULL,
    [SalesAmount] [money] NULL,
    [TaxAmt] [money] NULL,
    [Freight] [money] NULL,
    [CarrierTrackingNumber] [nvarchar](25) NULL,
    [CustomerPONumber] [nvarchar](25) NULL,
    [OrderDate] [datetime] NULL,
    [DueDate] [datetime] NULL,
    [ShipDate] [datetime] NULL
)
WITH (
        LOCATION='/SalesData/', 
        DATA_SOURCE=Hadoop, 
        FILE_FORMAT=TextFile
);

Once the table is defined, we can query the table as this would be a regular table in our database, and the PolyBase service will access the external data and process the data movement for you automatically.

So, the query, which can access the data stored in the text files on the Hadoop hdfs storage, would look such as this:

SELECT * FROM dbo.FactResellerSalesArchiveExternal

If we're working in a small environment, we can use just one node to access the external data, and such a configuration would combine the compute and head node on a single server. In complex scenarios, we can use a feature of PolyBase to distribute the workload among the compute nodes and run the parallel query. Such a scale-out group requires a more complex configuration and enterprise edition on the head node.

The head node is a SQL Server instance that accepts the queries from the application or the user and distributes the work to the data movement service on the compute nodes for execution. Once processing is complete, all the results are sent back to the head node and returned to the client. In this manner, SQL Server can be used to process large datasets and access Big Data data sources for complex processing:

官术网_书友最值得收藏!

Hands-On Data Science with SQL Server 2017

Installing and configuring