官术网_书友最值得收藏!

Sparse data formats

This dataset is in a sparse format. Each row can be thought of as a cell in a large feature matrix of the type used in previous chapters, where rows are users and columns are inpidual movies. The first column would be each user's review of the first movie, the second column would be each user's review of the second movie, and so on.

There are around 1,000 users and 1,700 movies in this dataset, which means that the full matrix would be quite large (nearly 2 million entries). We may run into issues storing the whole matrix in memory and computing on it would be troublesome. However, this matrix has the property that most cells are empty, that is, there is no review for most movies for most users. There is no review of movie number 675 for user number 213 though, and not for most other combinations of user and movie.

The format given here represents the full matrix, but in a more compact way. The first row indicates that user  number 196 reviewed movie number 242, giving it a ranking of 3 (out of five) on December 4, 1997.

Any combination of user and movie that isn't in this database is assumed to not exist. This saves significant space, as opposed to storing a bunch of zeroes in memory. This type of format is called a sparse matrix format. As a rule of thumb, if you expect about 60 percent or more of your dataset to be empty or zero, a sparse format will take less space to store.

When computing on sparse matrices, the focus isn't usually on the data we don't have—comparing all of the zeroes. We usually focus on the data we have and compare those.

主站蜘蛛池模板: 博野县| 南昌市| 镇平县| 甘德县| 出国| 望江县| 龙游县| 赣州市| 乡宁县| 泰州市| 沧源| 虹口区| 江源县| 新邵县| 宝丰县| 恩施市| 洪江市| 邵武市| 东至县| 定兴县| 双流县| 大新县| 嫩江县| 金堂县| 遂昌县| 云阳县| 赫章县| 武强县| 乡城县| 崇文区| 铁力市| 乌兰察布市| 绥阳县| 台东市| 安图县| 安徽省| 苏尼特左旗| 睢宁县| 安庆市| 阿鲁科尔沁旗| 额尔古纳市|