- Hands-On Data Science with SQL Server 2017
- Marek Chmel Vladimír Mu?n?
- 1157字
- 2021-06-10 19:14:02
Installing and configuring
PolyBase can be installed as a part of SQL Server installation. You can choose whether PolyBase will be configured as a scale-out group. If you would like to use a scale-out group for maximum performance and throughput, you need to make sure that the following is correct:
- All nodes are in the same domain
- All nodes use the same account and password for the PolyBase service
- All nodes can communicate over the network
- All nodes are running the same SQL Server version
To install PolyBase, you will need a Java Runtime Environment available on your SQL Server, which is not part of the installation media, You'll be provided with a link to download the installation files for JRE.
You can confirm that PolyBase is installed on your server with the following:
SELECT SERVERPROPERTY ('IsPolybaseInstalled') AS IsPolybaseInstalled;
Also, when Polybase is installed, you'll see three used databases available on your SQL Server:
- DWConfiguration
- DWDiagnostics
- DWQueue
Once you have installed PolyBase, you actually need to enable the PolyBase feature in SQL Server by running a stored procedure, sp_configure, with proper parameters:
sp_configure @configname = 'hadoop connectivity', @configvalue = 1;
GO
RECONFIGURE
GO
The config value, as the parameter to this procedure, is used to determine the proper available data source. As of now, there are eight possible values, ranging from 0 to 7. All available config values used for various data sources are as follows:
Option 0: Disable Hadoop connectivity
Option 1: Hortonworks HDP 1.3 on Windows Server
- Option 2: Azure blob storage (WASB[S])
Option 3: Hortonworks HDP 1.3 on Linux
Option 4: Cloudera CDH 4.3 on Linux
Option 5: Hortonworks HDP 2.0 on Windows Server
Option 6: Azure blob storage (WASB[S])
Option 7: Hortonworks HDP 2.0 on Linux
Option 8: Cloudera 5.1, 5.2, 5.3, 5.4, 5.5, 5.9, 5.10, 5.11, 5.12, and 5.13 on Linux
Option 9: Hortonworks 2.1, 2.2, 2.3, 2.4, 2.5, and 2.6 on Linux
Option 10: Hortonworks 2.1, 2.2, and 2.3 on Windows Server
Option 11: Azure blob storage (WASB[S])
Once you have enabled Hadoop connectivity, you actually need to create several more objects inside your user database. When we're using PolyBase, we're connecting to a data source that usually stores numerous files containing the required information. When you want to access such files from SQL Server via the Polybase feature, you have to define the following:
- An external data source
- An external data file format
- An external table
The external data source defines the location of the the source defined with the access protocol, such as hdfs and WASB. The data source type can be of several different values where we will frequently work with Hadoop or blob_storage types. Other available types are RDBMS and SHARD_MAP_MANAGER, where the first is used with cross-database queries on Elastic Database Query on Azure SQL Database, and the second for sharing on the Azure SQL Database.
To create a data source for a Hadoop file system running on the internal network with a defined IP address, you can use the following query:
CREATE EXTERNAL DATA SOURCE Hadoop
WITH (
TYPE = HADOOP,
LOCATION = 'hdfs://192.168.1.100:8050'
)
This example presumes that you have a running hdfs storage on your IP address, 192.168.1.100, configured with default port 8050.
Once we have defined the external data source, we need to define the format of the data stored in such a data source. This is handled via the definition of the EXTERNAL FILE FORMAT. This file format has several possible types:
- Delimited text: Simple text files with a predefined delimiter
- RCfile
- ORC: Optimized row columnar file, which is commonly used with Hadoop and offers reasonable performance and compression
- Parquet
For configuring the file format as a plain text file, where columns are delimited with the | character, we can use the following code:
CREATE EXTERNAL FILE FORMAT TextFile
WITH (
FORMAT_TYPE = DelimitedText,
FORMAT_OPTIONS (FIELD_TERMINATOR = '|')
);
Once the data source and the structure of the file are defined, we can create an external table in the database. This external table will define the columns for the data stored in the files and will work as a metadata wrapper for accessing the external data stored in the hdfs filesystem. We can either query that external data directly, or load the data into the SQL Server to new tables. We will work with a sample file containing sales records separated by the | character.
To create the table, we will use the following code:
CREATE EXTERNAL TABLE dbo.FactResellerSalesArchiveExternal (
[ProductKey] [int] NOT NULL,
[OrderDateKey] [int] NOT NULL,
[DueDateKey] [int] NOT NULL,
[ShipDateKey] [int] NOT NULL,
[ResellerKey] [int] NOT NULL,
[EmployeeKey] [int] NOT NULL,
[PromotionKey] [int] NOT NULL,
[CurrencyKey] [int] NOT NULL,
[SalesTerritoryKey] [int] NOT NULL,
[SalesOrderNumber] [nvarchar](20) NOT NULL,
[SalesOrderLineNumber] [tinyint] NOT NULL,
[RevisionNumber] [tinyint] NULL,
[OrderQuantity] [smallint] NULL,
[UnitPrice] [money] NULL,
[ExtendedAmount] [money] NULL,
[UnitPriceDiscountPct] [float] NULL,
[DiscountAmount] [float] NULL,
[ProductStandardCost] [money] NULL,
[TotalProductCost] [money] NULL,
[SalesAmount] [money] NULL,
[TaxAmt] [money] NULL,
[Freight] [money] NULL,
[CarrierTrackingNumber] [nvarchar](25) NULL,
[CustomerPONumber] [nvarchar](25) NULL,
[OrderDate] [datetime] NULL,
[DueDate] [datetime] NULL,
[ShipDate] [datetime] NULL
)
WITH (
LOCATION='/SalesData/',
DATA_SOURCE=Hadoop,
FILE_FORMAT=TextFile
);
Once the table is defined, we can query the table as this would be a regular table in our database, and the PolyBase service will access the external data and process the data movement for you automatically.
So, the query, which can access the data stored in the text files on the Hadoop hdfs storage, would look such as this:
SELECT * FROM dbo.FactResellerSalesArchiveExternal
If we're working in a small environment, we can use just one node to access the external data, and such a configuration would combine the compute and head node on a single server. In complex scenarios, we can use a feature of PolyBase to distribute the workload among the compute nodes and run the parallel query. Such a scale-out group requires a more complex configuration and enterprise edition on the head node.
The head node is a SQL Server instance that accepts the queries from the application or the user and distributes the work to the data movement service on the compute nodes for execution. Once processing is complete, all the results are sent back to the head node and returned to the client. In this manner, SQL Server can be used to process large datasets and access Big Data data sources for complex processing:
