Advertisement

Google推荐系统Wide & Deep Learning for Recommender Systems论文翻译&解读

阅读量:

Wide & Deep Learning for Recommender Systems

推荐系统中的Wide & Deep Learning

摘要

Generalized linear models, augmented with nonlinear feature transformations, are commonly employed for large-scale regression and classification tasks involving sparse inputs. The memorization of feature interactions through a wide array of cross-product feature transformations is effective and interpretable, though it tends to demand more extensive feature engineering efforts compared to other approaches. Deep neural networks, while capable of generalizing better to unseen feature combinations through low-dimensional dense embeddings learned from sparse features, can overgeneralize and recommend less pertinent items when the user-item interactions are sparse and high-rank. This paper introduces Wide & Deep learning—a joint training framework combining wide linear models and deep neural networks—aimed at balancing memorization and generalization benefits for recommendation systems. We implemented and evaluated this system on Google Play, the commercial mobile app store with over one billion active users and over one million apps globally. Experimental results demonstrate that the Wide & Deep approach significantly outperforms standalone wide-only or deep-only models in terms of app acquisitions. Additionally, we have made our open-source implementation available in TensorFlow.

具有非线性特征变换的广义线性模型得到广泛应用,在处理大规模回归问题以及稀疏输入数据下的分类任务中表现出色。通过一系列跨产品特征交互式的提取与建模,在保持有效性的同时实现了良好的可解释性;然而,在泛化性能方面,则需依赖于更为复杂的特征工程工作量。相比之下,在较少的特征工程投入下,深度神经网络能够通过学习稀疏特征求取到低维稠密向量表示并生成更具竞争力的未知组合模式。然而,在用户-商品交互数据较为稀疏且排名位置较高的情况下,在这种情况下有向图中的深度神经网络容易出现过拟合现象,并无法有效推荐相关的内容。为此,在本文研究中我们提出了一种新的学习框架——wide & deep学习策略:该框架旨在同时训练线性模型与深度神经网络,并结合两者的优势特点为推荐系统提供更加全面的支持机制;为此我们在Google Play应用市场构建并评估了这一创新系统:其中Google Play是一个拥有超过10亿活跃用户及超过100万个移动应用商业平台;在线实验结果表明:相比于单独采用广域学习或深度学习模型而言,在wide & deep框架下的推荐系统能够显著提升应用下载量;此外在TensorFlow框架下开源了该方法的具体实现方案

在研究过程中有两个核心概念至关重要:记忆(memorization)和泛化(generalization)。在后续讨论中我们会频繁提到这两个术语。记忆(memorization)是指模型训练识别已知特征转换及其组合对结果的影响;而泛化(generalization)则是指模型训练识别未知特征转换及其组合对结果的影响。例如,在使用Google Play数据集进行分析时:通过线性模型分析用户的年龄、工作属性等已知因素与其应用下载量、类型等因素之间的相互作用来预测用户是否会下载应用;而深度模型则训练识别更为复杂的特征组合(如用户的年龄乘以工作/应用下载量加上类型)用于预测同样的结果。

1. 引言

A recommender system can be viewed as a search ranking system, where the input query is a set of user and contextual information, and the output is a ranked list of items. Given a query, the recommendation task is to find the relevant items in a database and then rank the items based on certain objectives, such as clicks or purchases.
One challenge in recommender systems, similar to the general search ranking problem, is to achieve both memorization and generalization. Memorization can be loosely defined as learning the frequent co-occurrence of items or features and exploiting the correlation available in the historical data. Generalization, on the other hand, is based on transitivity of correlation and explores new feature combinations that have never or rarely occurred in the past. Recommendations based on memorization are usually more topical and directly relevant to the items on which users have already performed actions. Compared with memorization, generalization tends to improve the diversity of the recommended items. In this paper, we focus on the apps recommendation problem for the Google Play store, but the approach should apply to generic recommender systems.
For massive-scale online recommendation and ranking systems in an industrial setting, generalized linear models such as logistic regression are widely used because they are simple, scalable and interpretable. The models are often trained on binarized sparse features with one-hot encoding. E.g., the binary feature “user_installed_app=netflix” has value 1 if the user installed Netflix. Memorization can be achieved effectively using cross-product transformations over sparse features, such as AND(user_installed_app=netflix, impression_app=pandora”), whose value is 1 if the user installed Netflix and then is later shown Pandora. This explains how the co-occurrence of a feature pair correlates with the target label. Generalization can be added by using features that are less granular, such as AND(user_installed_category=video, impression_category=music), but manual feature engineer- ing is often required. One limitation of cross-product trans- formations is that they do not generalize to query-item feature pairs that have not appeared in the training data.
Embedding-based models, such as factorization machines [5] or deep neural networks, can generalize to previously un- seen query-item feature pairs by learning a low-dimensional dense embedding vector for each query and item feature, with less burden of feature engineering. However, it is difficult to learn effective low-dimensional representations for queries and items when the underlying query-item matrix is sparse and high-rank, such as users with specific preferences or niche items with a narrow appeal. In such cases, there should be no interactions between most query-item pairs, but dense embeddings will lead to nonzero predictions for all query-item pairs, and thus can over-generalize and make less relevant recommendations. On the other hand, linear mod- els with cross-product feature transformations can memorize these “exception rules” with much fewer parameters.
In this paper, we present the Wide & Deep learning frame- work to achieve both memorization and generalization in one model, by jointly training a linear model component and a neural network component as shown in Figure 1.
The main contributions of the paper include:
• The Wide & Deep learning framework for jointly train- ing feed-forward neural networks with embeddings and linear model with feature transformations for generic recommender systems with sparse inputs.
• The implementation and evaluation of the Wide & Deep recommender system productionized on Google Play, a mobile app store with over one billion active users and over one million apps.
• We have open-sourced our implementation along with a high-level API in TensorFlow1.
While the idea is simple, we show that the Wide & Deep framework significantly improves the app acquisition rate on the mobile app store, while satisfying the training and serving speed requirements.

从功能结构来看,推荐系统本质上是一种基于搜索与排序的综合系统。其输入数据主要包括用户行为信息以及外部环境特征;输出结果是一个经过排序的商品列表。给定一个查询语句,在数据库中检索相关的商品信息后需依据特定目标(如点击率或购买率)进行商品排序以提高用户体验和商业价值。

与传统的搜索引擎优化问题存在相似之处的是,在推荐系统中实现记忆机制和泛化能力同样面临挑战。其中的记忆机制通常通过分析商品或特征的共同出现频率来量化它们的相关性;而泛化过程主要关注的是通过已有数据推断出未曾见过的新组合之间的潜在关联性。

在这里插入图片描述

2. 推荐系统概述

An overview of the app recommender system is shown in Figure 2. A query, which can include various user and contextual features, is generated when a user visits the app store. The recommender system returns a list of apps (also referred to as impressions) on which users can perform certain actions such as clicks or purchases. These user actions, along with the queries and impressions, are recorded in the logs as the training data for the learner.
Since there are over a million apps in the database, it is intractable to exhaustively score every app for every query within the serving latency requirements (often O(10) milliseconds). Therefore, the first step upon receiving a query is retrieval. The retrieval system returns a short list of items that best match the query using various signals, usually a combination of machine-learned models and human-defined rules. After reducing the candidate pool, the ranking system ranks all items by their scores. The scores are usually P(y|x), the probability of a user action label y given the features x, including user features (e.g., country, language, demographics), contextual features (e.g., device, hour of the day, day of the week), and impression features (e.g., app age, historical statistics of an app). In this paper, we focus on the ranking model using the Wide & Deep learning framework.

译文:app推荐系统的框架如图2所示。当用户访问app store的时候,会生成一个包含了丰富的用户和上下文信息的query。推荐系统会返回一个应用列表(也称为印象),用户可以在上面执行某些操作,例如点击或购买。这些用户行为以及查询和印象,会作为模型的训练数据。
数据库中有超过一百万个应用程序,因此在服务延迟要求(通常为o(10)毫秒)内为每个查询语句全面的对每个app评分是不现实的。因此,接收到查询语句的第一步是检索。检索系统通过机器学习模型和人工定义规则筛选返回与查询最匹配的item的简短列表。减少候选池中app数量后,排名系统通过分数对这些app进行排序。分数通常是根据用户特征(国家、语言、人口统计)、上下文特征(设备、时间、星期)、印象特征(应用年龄、应用历史数据)x预测的用户行为标签y=1的概率p(y|x)。在本文中,我们重点关注wide&deep学习框架在排名模型上的应用。

在这里插入图片描述

将这里的impressions直译为印象可能会让读者难以理解其具体含义。实际上,在这一上下文中更合适的翻译是(a list of apps)应用列表,并且当讨论其特征(impression features)时,则指的是这些应用所具有的各项特性。

3. wide&deep模型

3.1 wide部分

The wide component is a generalized linear model of the form y = w^Tx + b, as illustrated in Figure 1 (left). y is the prediction, x=[x_1,x_2,...,x_d]is a vector of d features, w=[w_1,w_2,...,w_d]are the model parameters and b is the bias. The feature set includes raw input features and transformed features. One of the most important transformations is the cross-product transformation, which is defined as:
\theta_k(x)=\prod^d_{i=1}x_i^{c_{ki}}, c_{ki}\in\{0,1\},where c_{ki}is a boolean variable that is 1 if the i-th feature is part of the k-th transformation φk, and 0 otherwise. For binary features, a cross-product transformation (e.g., “AND(gender=female, language=en)”) is 1 if and only if the constituent features (“gender=female” and “language=en”) are all 1, and 0 otherwise. This captures the interactions between the binary features, and adds nonlinearity to the generalized linear model.

其形式为 y = w^Tx + b 的一种广义线性模型(如图 1 所示)。其中 y 代表预测输出,在本研究中主要关注的是分类任务中的类别标签;而 x=[x_1,x_2,...,x_d] 是输入特征向量,在本研究中采用的是标准化处理后的数据;参数 w=[w_1,w_2,...,w_d] 则是由学习算法优化得到的权重系数;常数项 b 起着偏置调节作用。模型所涉及的特征种类不仅包括原始输入特征还包括经过某种变换后的特征向量集合;其中最为关键的变换方式即交叉项变换(Cross-Product Transform),其定义如下:对于第 k 个变换操作有 \theta_k(x)=\prod^d_{i=1}x_i^{c_{ki}}, 其中 c_{ki}\in\{0,1\} 是一个布尔变量;当第 i 个原始输入特征被包含在第 k 个交叉项操作中时取值为 1 否则取值为 0. 在实际应用中我们发现对于二元属性而言当且仅当所有参与的属性均为正时才会有非零输出结果(例如,在性别为女且语言为英语的情况下 AND(性别=女、语言=英语)=真其余情况下均为假)。这种设计能够有效捕捉二元属性间的相互作用关系从而显著提升了广义线性模型的表现能力并增强了其表达能力

这里的 \theta_k(x)=\prod^d_{i=1}x_i^{c_{ki}}, c_{ki}\in\{0,1\} 公式具有极强的数学抽象性,在实际应用中具体表现为一种特征组合机制。例如,在分析个体属性时可将不同维度的信息进行交互作用构建:假设我们定义两个基本属性集——性别集合 S=\{\text{男}, \text{女}\} 和语言集合 L=\{\text{中文}, \text{英语}\} ,那么两者的笛卡尔积即为所有可能的属性组合 S \times L = \{\text{男}\land\text{中文}, \text{男}\land\text{英语}, \text{女}\land\text{中文}, \text{女}\land\text{英语}\} 。对于某个样本而言、如果其性别为"女"、语言为"英语" ,则该样本在对应的属性组合\text{女}\land\text{英语}上取值为1 、其余所有属性组合上的取值均为0

3.2 deep部分

The deep component is a feed-forward neural network, as depicted in Figure 1 (right). Among categorical features, the original inputs are represented as strings (e.g., "language=en"). Each of these sparse, high-dimensional categorical features is first converted into a low-dimensional and dense real-valued vector, commonly referred to as an embedding vector. The dimensionality of these embeddings is typically within the range of O(10) to O(100). These embedding vectors are initialized randomly and then their values are trained to minimize the final loss function during model training. These low-dimensional dense embedding vectors are subsequently fed into the hidden layers of a neural network during the forward pass. Specifically, each hidden layer performs the following computation:
a^{(l+1)}=f(W^{(l)}a^{(l)}+b^{(l)})
where l denotes the layer index and f represents the activation function, typically rectified linear units (ReLU). a^{(l)},b^{(l)},W^{(l)} denote respectively the activations, bias, and model weights at layer l.

该模型的深层部分采用了前馈神经网络结构。针对类别型特征变量而言,在模型输入端接收形式为特征字符串的数据(例如,“语言=英语”)。这些稀疏分布于高维空间的类别特征会被转换为低维稠密实数向量表示,并通常被称为嵌入向量。其维度介于10到100之间,并需通过随机初始化的方式确定,并在此基础上通过模型训练最小化最终损失函数。随后将这些低维稠密向量被传递至前向传递过程中的神经网络隐藏层进行处理。具体而言,在每一层中都会执行以下计算:a^{(l+1)}=f(W^{(l)}a^{(l)}+b^{(l)})其中l代表层数指标;f表示激活函数操作;一般采用RELU激活函数单元进行非线性变换处理。”

3.3 wide模型和deep的结合

The wide component and deep component are merged through a weighted aggregation of their output log probabilities, which are then utilized as inputs in a shared logistic loss function during joint training. Note the key difference between joint training and ensemble learning: in an ensemble, individual models operate independently without collaborative training, whereas joint training optimizes all parameters collectively by incorporating both component outputs and their weightings. Consequently, this approach leads to significant variations in model architecture. Specifically, an ensemble typically requires each individual model to be more complex (e.g., featuring additional features or transformations) to achieve comparable accuracy, whereas joint training enables the wide component to complement the deep component's weaknesses with minimal cross-product feature engineering. The primary objective of joint training is thus more efficient resource utilization compared to traditional ensemble methods. From an optimization perspective, this strategy employs mini-batch stochastic gradient descent with L1 regularization for the wide component and AdaGrad for the deep part. The integrated architecture is depicted in Figure 1 (center). For logistic regression problems, our proposed model's prediction mechanism can be formally defined as follows: P(Y=1|x)=σ(w_{wide}^T [x, φ(x)] + w_{deep}^T a^{(lf)} + b), where Y represents the binary class label, σ(·) denotes the sigmoid function, φ(x) captures cross-product transformations derived from original features x, and b stands for the bias term. In this equation, w_{wide} constitutes a vector encompassing all weights associated with the wide component, while w_{deep} contains weights applied to final activations generated by the deep component.

wide与deep模型的联合训练采用将输出对数几率加权求和作为预测值的方式,并将其输入到一个联合的逻辑损失函数中进行优化。值得注意的是,在集成学习中,默认情况下各模型是独立训练并分别进行预测融合(即仅在推理阶段合并预测结果),而在联合训练过程中则是在训练阶段就综合考虑各模型之间的关系以及权重分配来进行整体优化。这种差异直接影响着模型的整体表现:相比之下,在集成学习中由于各模型在独立训练过程中已经充分考虑了各自特征空间的特点(如更多元化的特征选择与交叉组合),因此每个模型往往规模较大;而相比之下,在联合训练模式下,则通过巧妙的设计使得wide部分仅需少量跨产品特征转换即可补充深度模型难以处理的部分而不必全面复制整个深度模型架构

4. 系统实现

The development process for the apps' recommendation pipeline encompasses three key phases: data generation, model training, and model serving, as detailed in Figure 3.

见图3所示,该系统中推荐系统的管道主要实现了以下功能:首先包含数据预处理阶段;随后进行模型训练阶段;最后提供模型服务阶段。

在这里插入图片描述

4.1 数据生成

In this stage, user and app impression data within a period of time are used to generate training data. Each example corresponds to one impression. The label is app acquisition: 1 if the impressed app was installed, and 0 otherwise.
Vocabularies, which are tables mapping categorical feature strings to integer IDs, are also generated in this stage. The system computes the ID space for all the string features that occurred more than a minimum number of times. Continuous real-valued features are normalized to [0, 1] by map- ping a feature value x to its cumulative distribution function P (X ≤ x), divided into n_q quantiles. The normalized value
is \frac{i-1}{n_q-1} for values in the i-th quantiles. Quantile boundaries are computed during data generation.

在当前阶段,一段时间内收集的用户与应用程序的展示数据被用来生成训练数据。其中每个样本对应一个展示操作。为了标记这些样本的状态系统引入了一个二进制标签字段。该字段标识是否下载了展示应用。如果下载了展示应用,则该字段设为1;否则设为0。在这个过程中系统首先筛选出出现频率高于设定阈值的特征字符串,并记录这些字符串对应的唯一标识符(ID)。接着对连续型特征进行标准化处理:将每个特征值x映射为其累积分布函数F(x) = P(X ≤ x),将这些标准化后的值划分为n_q个区间,并对每个区间分配对应的数值范围。最后,在数据预处理阶段计算并记录了各个分位数所对应的边界值。

4.2 模型训练

The model structure we used in the experiment is shown in Figure 4. During training, our input layer takes in training data and vocabularies and generate sparse and dense features together with a label. The wide component consists of the cross-product transformation of user installed apps and impression apps. For the deep part of the model, A 32- dimensional embedding vector is learned for each categorical feature. We concatenate all the embeddings together with the dense features, resulting in a dense vector of approximately 1200 dimensions. The concatenated vector is then fed into 3 ReLU layers, and finally the logistic output unit.
The Wide & Deep models are trained on over 500 billion examples. Every time a new set of training data arrives, the model needs to be re-trained. However, retraining from scratch every time is computationally expensive and delays the time from data arrival to serving an updated model. To tackle this challenge, we implemented a warm-starting system which initializes a new model with the embeddings and the linear model weights from the previous model.
Before loading the models into the model servers, a dry run of the model is done to make sure that it does not cause problems in serving live traffic. We empirically validate the model quality against the previous model as a sanity check.

实验中所采用的模型架构如图4所示。接收层不仅接收本地存储的数据集以及完整的词汇表信息,并且能够生成具有标签标记的稀疏性和密集型特征表示。其中,在wide组件中将用户本地安装的应用程序与展示的应用程序进行跨平台特征转换,在deep组件中则为每个分类特征独立学习32维嵌入空间表示。经过连接处理后形成一个维度约为1200维的稠密向量表示空间。该矢量随后依次传递至三个ReLU激活层并最终映射至逻辑回归输出单元以完成分类任务

在这里插入图片描述

4.3 模型服务

To ensure each request is processed within 10 milliseconds, we enhanced system performance through multithreading parallelism by processing smaller batches in parallel rather than attempting to score all candidate apps at once.

该模型在完成训练与验证后被加载至模型服务器中。
该服务会收集一组来自应用数据源的应用数据,
为每个收集的应用数据计算基于用户的分数。
随后,
程序将这些应用按照分数从高到低排序,
并对结果进行展示。
该系统采用 wide&deep 架构进行前向传播以计算各应用的得分数。
为了快速响应10ms内的请求,
我们采用小批量处理结合多线程的方式,
而非对所有候选的应用程序逐一进行评分,
从而提高处理效率。

5. 实验结果

6. 相关工作

In machine learning, the combination of wide linear models with cross-product feature transformations and deep neural networks with dense embeddings has been influenced by prior work, such as factorization machines [5], which enhance linear models' generalization by factorizing interactions between two variables into a dot product of two low-dimensional embedding vectors. This paper significantly enhances the model's capacity by learning highly nonlinear interactions between embeddings through neural networks instead of dot products.
In language modeling, joint training of recurrent neural networks (RNNs) and maximum entropy models with n-gram features has been proposed to notably reduce RNN complexity (e.g., hidden layer sizes) by establishing direct weights between inputs and outputs [4]. In computer vision, deep residual learning [2] has been employed to alleviate the challenge of training deeper models and improve accuracy via shortcut connections that skip one or more layers. Additionally, joint training of neural networks with graphical models has been applied to human pose estimation from images [6]. In this work, we explored the joint training of feed-forward neural networks and linear models, establishing direct connections between sparse features and the output unit for generic recommendation and ranking issues involving sparse input data.
In the recommender systems literature, collaborative deep learning has been investigated by integrating deep learning for content information with collaborative filtering (CF) for rating matrices [7]. Previous studies on mobile app recommendations, such as AppJoy [8], utilized CF based on users' app usage records. Unlike prior CF or content-based approaches, this research jointly trains Wide & Deep models on user and impression data for mobile app recommendations.

该研究工作源自先前研究,在探索推荐系统文献时发现多种创新方法。例如交叉特征变换技术与稠密嵌入向量结合的深度学习模型,在语言模型领域已展现出显著优势。通过神经网络替代点积机制的学习方式,在两个低维嵌入向量之间建立高度非线性相互作用关系以扩展模型容量。在此研究中发现,在语言模型领域已有诸多尝试将递归神经网络(RNN)与n-gram特征的最大熵模型联合训练,并通过直接学习输入与输出之间的权重来显著降低RNN复杂度[4] ]。类似地,在计算机视觉领域深度残差学习已被用于优化深层网络训练过程,并通过引入跳跃连接来提高准确率[2] 。此外,在图形建模领域已有相关探索用于图像人体姿态估计[6] 。在此项工作中我们发现,在推荐系统应用中前馈神经网络与线性回归模型协同训练的方法具有广阔前景,并在稀疏输入数据处理方面展现出良好的通用性能[9] 。

7. 结论

Both memorization and generalization play a critical role in the functioning of recommender systems. While wide linear models excel at memorizing sparse feature interactions through cross-product transformations, deep neural networks demonstrate remarkable capacity to generalize by leveraging low-dimensional embeddings. Our proposed Wide & Deep learning framework integrates the strengths of both model types. Through extensive development efforts, we implemented this framework within Google Play's recommendation engine, a large-scale commercial platform. Our production implementation was rigorously tested across various scenarios to ensure robust performance. Our online experiments conclusively demonstrated that the Wide & Deep model outperforms both wide-only and deep-only approaches in terms of app acquisition metrics.

Memory and summarization play a crucial role in recommendation systems. By employing cross-product feature transformations, the wide linear model is capable of effectively capturing sparse feature interactions, whereas deep neural networks can generate previously undetectable feature interactions through low-dimensional embeddings. To leverage the strengths of both models, we propose the Wide&Deep learning framework. Through extensive experimentation within the Google Play application ecosystem, we successfully implemented and evaluated this framework on The Google Play Store, a comprehensive marketplace for mobile apps. Our experimental results demonstrate that the Wide&Deep model achieves significant improvements in both broad applicability and depth-focused performance compared to standalone models.

ps:完成了初步的论文翻译工作。如有不当之处,请多包涵。感谢您的指正!随后会更加深入地研究和实践开源代码库,并在过程中不断更新完善相关内容。如在深入研究过程中有任何新的发现或理解,请及时告知!期待与大家交流学习!

全部评论 (0)

还没有任何评论哟~