官术网_书友最值得收藏!

Modeling

Data is the lifeline of any scientist, and the selection of data providers is critical in developing or evaluating any statistical inference or machine learning algorithm.

What is a model?

We briefly introduced the concept of a model in the Model categorization section in Chapter 1, Getting Started .

What constitutes a model? Wikipedia provides a reasonably good definition of a model as understood by scientists [2:1]:

A scientific model seeks to represent empirical objects, phenomena, and physical processes in a logical and objective way.

Models that are rendered in software allow scientists to leverage computational power to simulate, visualize, manipulate and gain intuition about the entity, phenomenon or process being represented.

In statistics and probabilistic theory, a model describes data that one might observe from a system to express any form of uncertainty and noise. A model allows us to infer rules, make predictions, and learn from data.

A model is composed of features, also known as attributes or variables, and a set of relations between those features. For instance, the model represented by the function f(x,y) = x.sin(2y) has two features x and y and a relation f. Those two features are assumed to be independent. If the model is subject to a constraint, such as f(x, y) < 20, for example, then the conditional independence is no longer valid.

An astute Scala programmer would associate a model to a monoid for which the set is the group of observations and the operator is the function implementing the model.

Models come in a variety of shapes and forms:

  • Parametric: This consists of functions and equations (for example, y = sin(2t+w))
  • Differential: This consists of ordinary and partial differential equations (for example, dy = 2x.dx)
  • Probabilistic: This consists of probability distributions (for example, p(x|c) = exp (k.logx – x)/x!)
  • Graphical: This consists of graphs that abstract out the conditional independence between variables (for example, p(x,y|c) = p(x|c).p(y|c))
  • Directed graphs: This consists of a temporal, spatial relationships (for example, scheduler)
  • Numerical method: This consists of computational method such as finite difference, finite elements or Newton-Raphson
  • Chemistry: This consists of formulas and components (for example, H2O, Fe + C12 = FeC13)
  • Taxonomy: This consists of a semantic definition and a relationship of concepts (for example, APG/Eudicots/Rosids/Huaceae/Malvales)
  • Grammar and lexicon: This consists of a syntactic representation of documents (for example, Scala programming language)
  • Inference logic: This consists of rules (for example, IF (stock vol> 1.5 * average) AND rsi> 80 THEN …)

Model versus design

The confusion between model and design is quite common in computer science, the reason being that these terms have different meanings for different people depending on the subject. The following metaphors should help with your understanding of these two concepts:

  • Modeling: This is describing something you know. A model assumes, which becomes an assertion if proven correct (for example, the US population p increases by 1.2% a year, dp/dt= 1.012).
  • Designing: This is manipulating representation for things you don't know. Designing can be regarded as the exploration phase of modeling (for example, what are the features that contribute to the growth of US population? Birth rate? Immigration? Economic conditions? Social policies?).

Selecting features

The selection of a model's features is the process of discovering and documenting the minimum set of variables required to build the model. Scientists assume that data contains many redundant or irrelevant features. Redundant features do not provide information already given by the selected features, and irrelevant features provide no useful information.

A features selection consists of two consecutive steps:

  1. Search for new feature subsets.
  2. Evaluate these feature subsets using a scoring mechanism.

The process of evaluating each possible subset of features to find the one that maximizes the objective function or minimizes the error rate is computationally intractable for large datasets. A model with n features requires 2n-1 evaluations!

Extracting features

An observation is a set of indirect measurements of hidden, also known as latent, variables, which may be noisy or contain a high degree of correlation and redundancies. Using raw observations in a classification task would very likely produce inaccurate results. Using all features in each observation also incurs a high computation cost.

The purpose of feature extraction is to reduce the number of variables or dimensions of the model by eliminating redundant or irrelevant features. The features are extracted by transforming the original set of observations into a smaller set at the risk of losing some vital information embedded in the original set.

主站蜘蛛池模板: 广南县| 定边县| 奉节县| 灵山县| 平定县| 河西区| 泾源县| 尚义县| 长垣县| 吴忠市| 屯门区| 军事| 长岭县| 湖南省| 土默特左旗| 梓潼县| 山东| 定州市| 克什克腾旗| 南通市| 读书| 霍山县| 霍山县| 闸北区| 台安县| 和硕县| 玉林市| 常山县| 长白| 化州市| 明水县| 盈江县| 江油市| 兴国县| 福海县| 繁昌县| 灌云县| 固始县| 房产| 汝阳县| 米林县|