官术网_书友最值得收藏!

Installing and configuring

PolyBase can be installed as a part of SQL Server installation. You can choose whether PolyBase will be configured as a scale-out group. If you would like to use a scale-out group for maximum performance and throughput, you need to make sure that the following is correct:

  • All nodes are in the same domain
  • All nodes use the same account and password for the PolyBase service
  • All nodes can communicate over the network
  • All nodes are running the same SQL Server version

To install PolyBase, you will need a Java Runtime Environment available on your SQL Server, which is not part of the installation media, You'll be provided with a link to download the installation files for JRE.

You can confirm that PolyBase is installed on your server with the following:

SELECT SERVERPROPERTY ('IsPolybaseInstalled') AS IsPolybaseInstalled; 

Also, when Polybase is installed, you'll see three used databases available on your SQL Server:

  • DWConfiguration
  • DWDiagnostics
  • DWQueue

Once you have installed PolyBase, you actually need to enable the PolyBase feature in SQL Server by running a stored procedure, sp_configure, with proper parameters:

sp_configure @configname = 'hadoop connectivity', @configvalue = 1; 
GO
RECONFIGURE
GO

The config value, as the parameter to this procedure, is used to determine the proper available data source. As of now, there are eight possible values, ranging from 0 to 7. All available config values used for various data sources are as follows:

  • Option 0: Disable Hadoop connectivity

  • Option 1: Hortonworks HDP 1.3 on Windows Server

  • Option 2: Azure blob storage (WASB[S])
  • Option 3: Hortonworks HDP 1.3 on Linux

  • Option 4: Cloudera CDH 4.3 on Linux

  • Option 5: Hortonworks HDP 2.0 on Windows Server

  • Option 6: Azure blob storage (WASB[S])

  • Option 7: Hortonworks HDP 2.0 on Linux

  • Option 8: Cloudera 5.1, 5.2, 5.3, 5.4, 5.5, 5.9, 5.10, 5.11, 5.12, and 5.13 on Linux

  • Option 9: Hortonworks 2.1, 2.2, 2.3, 2.4, 2.5, and 2.6 on Linux

  • Option 10: Hortonworks 2.1, 2.2, and 2.3 on Windows Server

  • Option 11: Azure blob storage (WASB[S])

Unlike many other configuration parameters, which you can change with the sp_configure procedure, after configuring Hadoop connectivity, you have to explicitly restart the SQL Server service; otherwise, the setting won't take effect. To update this configuration value, you have to be a member of sysadmin server fixed role or have the ALTER SERVER permission assigned to your login.

Once you have enabled Hadoop connectivity, you actually need to create several more objects inside your user database. When we're using PolyBase, we're connecting to a data source that usually stores numerous files containing the required information. When you want to access such files from SQL Server via the Polybase feature, you have to define the following:

  • An external data source
  • An external data file format
  • An external table

The external data source defines the location of the the source defined with the access protocol, such as hdfs and WASB. The data source type can be of several different values where we will frequently work with Hadoop or blob_storage types. Other available types are RDBMS and SHARD_MAP_MANAGER, where the first is used with cross-database queries on Elastic Database Query on Azure SQL Database, and the second for sharing on the Azure SQL Database.

To create a data source for a Hadoop file system running on the internal network with a defined IP address, you can use the following query:

CREATE EXTERNAL DATA SOURCE Hadoop
WITH (
TYPE = HADOOP,
LOCATION = 'hdfs://192.168.1.100:8050'
)

This example presumes that you have a running hdfs storage on your IP address, 192.168.1.100, configured with default port 8050. 

To get hands-on experience with Hadoop and hdfs, you can run a sandbox in your virtual machine, such as Hortonworks Data Platform or Cloudera CDH, which you can get as a quickstart virtual machines for numerous virtualization platforms.

Once we have defined the external data source, we need to define the format of the data stored in such a data source. This is handled via the definition of the EXTERNAL FILE FORMAT. This file format has several possible types:

  • Delimited text: Simple text files with a predefined delimiter
  • RCfile
  • ORC: Optimized row columnar file, which is commonly used with Hadoop and offers reasonable performance and compression
  • Parquet

For configuring the file format as a plain text file, where columns are delimited with the | character, we can use the following code:

CREATE EXTERNAL FILE FORMAT TextFile
WITH (
FORMAT_TYPE = DelimitedText,
FORMAT_OPTIONS (FIELD_TERMINATOR = '|')
);

Once the data source and the structure of the file are defined, we can create an external table in the database. This external table will define the columns for the data stored in the files and will work as a metadata wrapper for accessing the external data stored in the hdfs filesystem. We can either query that external data directly, or load the data into the SQL Server to new tables. We will work with a sample file containing sales records separated by the | character.

To create the table, we will use the following code:

CREATE EXTERNAL TABLE dbo.FactResellerSalesArchiveExternal (
[ProductKey] [int] NOT NULL,
[OrderDateKey] [int] NOT NULL,
[DueDateKey] [int] NOT NULL,
[ShipDateKey] [int] NOT NULL,
[ResellerKey] [int] NOT NULL,
[EmployeeKey] [int] NOT NULL,
[PromotionKey] [int] NOT NULL,
[CurrencyKey] [int] NOT NULL,
[SalesTerritoryKey] [int] NOT NULL,
[SalesOrderNumber] [nvarchar](20) NOT NULL,
[SalesOrderLineNumber] [tinyint] NOT NULL,
[RevisionNumber] [tinyint] NULL,
[OrderQuantity] [smallint] NULL,
[UnitPrice] [money] NULL,
[ExtendedAmount] [money] NULL,
[UnitPriceDiscountPct] [float] NULL,
[DiscountAmount] [float] NULL,
[ProductStandardCost] [money] NULL,
[TotalProductCost] [money] NULL,
[SalesAmount] [money] NULL,
[TaxAmt] [money] NULL,
[Freight] [money] NULL,
[CarrierTrackingNumber] [nvarchar](25) NULL,
[CustomerPONumber] [nvarchar](25) NULL,
[OrderDate] [datetime] NULL,
[DueDate] [datetime] NULL,
[ShipDate] [datetime] NULL
)
WITH (
LOCATION='/SalesData/',
DATA_SOURCE=Hadoop,
FILE_FORMAT=TextFile
);

Once the table is defined, we can query the table as this would be a regular table in our database, and the PolyBase service will access the external data and process the data movement for you automatically.

So, the query, which can access the data stored in the text files on the Hadoop hdfs storage, would look such as this:

SELECT * FROM dbo.FactResellerSalesArchiveExternal

If we're working in a small environment, we can use just one node to access the external data, and such a configuration would combine the compute and head node on a single server. In complex scenarios, we can use a feature of PolyBase to distribute the workload among the compute nodes and run the parallel query. Such a scale-out group requires a more complex configuration and enterprise edition on the head node.

The head node is a SQL Server instance that accepts the queries from the application or the user and distributes the work to the data movement service on the compute nodes for execution. Once processing is complete, all the results are sent back to the head node and returned to the client. In this manner, SQL Server can be used to process large datasets and access Big Data data sources for complex processing:

主站蜘蛛池模板: 马公市| 年辖:市辖区| 东方市| 桂东县| 洛宁县| 镶黄旗| 金昌市| 昌都县| 威远县| 济阳县| 巴林左旗| 洮南市| 定襄县| 固始县| 克山县| 中宁县| 绵阳市| 教育| 牟定县| 从化市| 清涧县| 镇沅| 廊坊市| 松原市| 永宁县| 团风县| 清远市| 庄河市| 高雄市| 永丰县| 广水市| 夏邑县| 天峻县| 磐安县| 宜兴市| 龙南县| 德兴市| 剑阁县| 吉木乃县| 阜宁县| 平塘县|