官术网_书友最值得收藏!

So, what is Nokogiri?

Nokogiri (htpp://nokogiri.org/) is the most popular open source Ruby gem for HTML and XML parsing. It parses HTML and XML documents into node sets and allows for searching with CSS3 and XPath selectors. It may also be used to construct new HTML and XML objects.

The Nokogiri homepage is shown in the following screenshot:

So, what is Nokogiri?

Nokogiri is fast and efficient. It combines the raw power of the native C parser Libxml2 (http://www.xmlsoft.org/) with the intuitive parsing API of Hpricot (https://github.com/hpricot/hpricot).

The primary use case for a parsing library is data scraping. Data scraping is the process of extracting data intended for humans and structuring it for input into another program. Data by itself is meaningless without structure. Software imposes rigid structure over data referred to as format.

The same can be said of spoken language. We do not yell out random sounds and expect them to have meaning. We use words to form sentences to form meaning. This is our format. It is a loose structure. You could learn ten words in a foreign language, combine those with a few hand symbols, and add in a little amateur acting to convey fairly advanced concepts to people who don't speak your native tongue. This interpretive prowess is not shared by computers. Computer communication must follow protocols; fail to follow the protocol and no communication will be made.

The goal here is to bridge the two. Take the data intended for humans, get rid of the superfluous, and parse it into a structured data format for a computer. Data intended for humans is inherently fickle as the structure frequently changes. Data scraping should be used as a last effort and is generally appropriate in two scenarios: interfacing systems with incompatible data formats, and third-party sources lacking an API. If you aren't solving one of these two problems, you probably shouldn't be scraping.

An example of this is the most common scrape and parse use case in tutorials on the Internet: Amazon price searching. The scenario is: you have a database of products and you want up-to-date pricing information. The tutorials inevitably lead you through the process of scraping and parsing Amazon's search results to extract prices. The problem is, Amazon provides an API with all of this information and more on the Amazon Product Advertising API.

It is important to remember that you are using someone else's server resources when scraping. This is why the preferred method of accessing information should always be a developer approved API. An API in general will provide faster, cleaner, and more direct access to data while not expressing undue toll on the provider's servers.

A wealth of information sits waiting on the Internet. A small fraction is made easily accessible to developers via APIs. Nokogiri bridges that gap with its slick, fast, HTML and XML parsing engine bundled in an easy to use Ruby gem.

主站蜘蛛池模板: 武安市| 赤城县| 莎车县| 綦江县| 九台市| 吴旗县| 德兴市| 隆子县| 阜新市| 宁阳县| 东源县| 新津县| 拜城县| 建德市| 凤凰县| 遂宁市| 平塘县| 淳化县| 迭部县| 乌恰县| 延庆县| 蛟河市| 包头市| 闽侯县| 平顺县| 昌黎县| 图片| 三原县| 璧山县| 汪清县| 油尖旺区| 华亭县| 天水市| 临西县| 屏东县| 佛坪县| 资兴市| 安阳县| 如东县| 万盛区| 丰镇市|