書名： Splunk 7.x Quick Start Guide
作者名： James H. Baxter
本章字數： 605字
更新時間： 2021-06-10 19:04:54

Data collection – data inputs

You can obtain an estimate of the amount of daily ingestion volume that you will need to handle by polling your application and other data source teams for this information; the findings of this collection will affect the number of nodes and disk storage factors involved in sizing the indexing tier of the Splunk solution.

You can collect this information with a spreadsheet or web form distributed to your application and other data source teams to complete and return to you for calculating an initial sizing; the same forms can be employed for submitting requests to ingest new data inputs down the road. The data fields collected for application and web server log files, for example, should include the following:

Environment: There is typically a development and test environment, and one or more production environments, for example.
Application name: This is the full name of the application for reference purposes.
Application ID: There may be a short alphanumeric ID for each application in use at your company.
Host: The DNS-resolvable (fully qualified domain name) hostname and/or IP address.
OS: Windows, Linux, or other OS. This is helpful if you plan to install any of the Splunk apps to collect metrics from specific operating systems.
Middleware: Any middleware used by this host, such as Apache Tomcat, IIS, JBoss, WebSphere, WebLogic, and so on. This is used to determine the Splunk sourcetype that will be applied for parsing the log file contents when ingesting the data.
Log location path: This is the full path to the directory containing the log file(s) that will be ingested—this is needed for configuring the inputs on universal forwarders.
Log filename: There may be more than one log file type and naming scheme in a logging directory; specify each naming scheme individually.
Daily log size (MB): This can be obtained by inspecting a historical list of log file sizes (before compression and archiving) and calculating the average per-day logging volume. It is obviously important that this number is as accurate as possible.
PCI/PII data in the log file: This (yes/no) field indicates the presence of payment card information (credit card numbers) or personally identifiable information (names, addresses, phone, email, and so on) in the log data that will indicate that these fields need to be masked and/or the data stored in restricted access indexes to prevent exposing confidential data to unauthorized personnel.
Data retention period (days): The period of time the data should be stored. Typical periods are 7-30 days for dev/test, and 30-90 days, or even a year or longer for production, depending on the nature of the data, business reporting needs, and/or any regulatory requirements. This affects indexer disk storage requirements, so it should be carefully considered.

An example of a data inputs collection spreadsheet for application log files is depicted in Fig 2.1 as shown in the following screenshot:

Fig 2.1: Splunk data inputs—log files

Note that this spreadsheet includes cells on the right-hand side for calculations of the total daily ingestion volume and the total data volume for the specified retention period; this will be useful for calculating indexer sizing, and helps your user community to develop an appreciation for the logging volume they are imposing on the Splunk environment (and hopefully, only log what is truly needed). The formulas for these cells are as follows:

Total daily ingestion volume MB = (Avg daily log size MB) × (# Hosts)
Total data retention volume MB = (Total daily ingestion volume MB) × (Data retention days)

官术网_书友最值得收藏!

Splunk 7.x Quick Start Guide

Data collection – data inputs