The evaluation is fully automated and performed at a large scale, on , users using 2 months of ad data and user histories consisting of 16 months of query activity. Follow us:. Share this page:. Download BibTex. View Publication Research Areas Search and information retrieval.
We consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete and propose a set of LIMBO-based techniques for finding structural clues in an instance of data, which may contain errors, missing values, and duplicate records.
The majority of the algorithms in the software clustering literature utilize structural information in order to decompose large software systems. Other approaches, such as using file names or ownership information, have also demonstrated merit. However, there is no intuitive way to combine information obtained from these two different types of techniques.
The provider may be a search engine or may be a merchant or other type of provider. The provider may be implemented using one or more computing devices such as the computing system described with respect to FIG. The provider may store and access information about the items in what is referred to as item data The item data may include information or other data about a variety of items.
In addition, each item may be organized into one or more item categories in the item data In some implementations, the categories are hierarchical. The item data may be implemented as structured data, for example. The provider may further store and access item category data The item category data may include a data structure representing the hierarchy of item categories. In some implementations, the hierarchy of item categories may be stored in the item category data as a tree with each node of the tree associated with a particular item category.
Each node of the tree may be a parent node, a child node, or both. A child node is associated with a sub-category of the item category of its parent node. Thus, the outermost nodes of the tree are associated with the most specific item categories and the internal nodes of the tree are associated with broader item categories.
The root or topmost node in the tree may be associated with the broadest item category. Other types of data structures may also be used. For example, such an example tree data structure is illustrated with respect to the item category data shown in FIG. As shown, the item category data includes nodes , , , , , , , , , , and The provider may further include a trainer The trainer may generate training data The training data may comprise a mapping or association between queries and item categories.
The training data may further include a count associated with each query and item category. The trainer may incorporate or combine the training data with the item category data The trainer may associate the count and query of each tuple in the training data with the corresponding node in the item category data based on the item category associated with each node and each tuple. In some implementations, the trainer , when associating a query and count of the training data with a node, may further associate the query and count with any parent nodes of the node.
Where a node is already associated with a query, the trainer may add the counts of queries. The provider may receive a query from a user of the client device , and determine one or more intended categories using the combined item category data and training data In some implementations, the provider may determine the one or more intended categories using a classifier One or more classifiers may be stored in the classifier data In some implementations, a classifier may take the received query and a node of the combined item category data and training data as an input, and output a probability that the query was intended to match an item associated with the item category corresponding to the node.
Alternatively, the classifier may return the probability that the query was intended to match an item of the item category corresponding to a child node of the node given that the query was intended to also match an item of the item category corresponding to the node. In some implementations, the classifier may determine the probability by taking the count associated with the node for the received query, and dividing the count by a number representing the total number of queries received in the training data The provider may recursively apply a classifier to nodes of the combined item category data and training data until a calculated probability for a node is less than a threshold probability.
The threshold probability may be selected by a user or an administrator and may be selected based on a trade-off between a desire to provide more specific categories and a desire to not return incorrect results. For example, a low threshold probability may result in the provider reaching nodes corresponding to more specific item categories. However, such item categories may not in fact accurately represent the intention of the query. The threshold probability may also be automatically determined by the computing device responsive to an input value.
The provider may recursively apply the classifier to nodes of the item category data resulting in a list of item categories and associated probabilities output by the classifier for each of the nodes that was above the threshold probability.
In some implementations, the provider may then provide the list of categories to a user. For example, the provider may provide the user a list of the matching item categories and the user may select the matching item category that they believe is correct. Alternatively, the provider may rank the item categories based on their closeness to the true intent of the received query. The closeness of categories to the true intent of the received query may be evidenced by the probability output of the classifier, for example.
In addition, rather than provide the determined item categories to the user, the provider may include a comparator or matcher that may determine items that match the received query that are also associated with one or more of the item categories in the list of categories.
The matcher may determine items that match the item categories in the list of categories and the received query in the item data In some implementations, the matcher may only match items associated with the highest ranked categories. The matcher may then provide indicators of items associated with the item category that match the received query. The indicators may be URLs uniform resource locators , for example.
Alternatively, the matcher may match items associated with some subset of the highest ranked categories. The matcher may then provide indicators of the matching items grouped by associated item category.
The method may be implemented by the provider , for example. A plurality of nodes is received at The plurality of nodes may be received by the provider Each node may be associated with an item category and a plurality of queries. In addition, each query may be associated with a count that represents that number of times that the query was submitted and resulted in a purchase or selection of an item having the same category as the node.
In some implementations, the nodes may have been generated by the trainer of the provider by combining nodes representing a hierarchy of item categories and training data collected over some period of time, for example. A threshold probability is received at The threshold probability may be received by the provider from a user or an administrator.
In some implementations, the threshold probability may also be automatically determined by the computing device responsive to an input value. The threshold probability may represent a minimum probability under which child nodes of the plurality of nodes may no longer be considered by a classifier A low probability threshold may cause a classifier to return very specific categories for a received query, while a high probability threshold may cause the classifier to return more general categories for a received query.
A query is received at The query may be received by the provider from a user of a client device The provider may then receive the query though the network A classifier is received at The classifier may be received by the provider from the classifier data In some implementations, the classifier may output a probability that a received query was intended to match an item associated with the item category corresponding to a node when applied to the node.
The node may be part of the plurality of nodes that represent the item categories. Alternatively, the classifier may return the probability that the received query was intended to match a category corresponding to a child node of the node given that the query was intended to also match the node. In some implementations, the classifier may determine the probability for a node by determining the count associated with the node and the received query and dividing the count by a number representing the total number of queries received in the training data The classifier is recursively applied to the plurality of nodes resulting in a list of item categories and a probability for each of the item categories in the list at In an implementation, the provider may recursively apply the classifier to the plurality of nodes using the received query until a generated probability for a node is below the threshold probability.
The categories are ranked based on the closeness of the categories to an intent of the received query at The categories may be ranked by the provider The closeness of each category may be evidenced by the probability output by the classifier at the node associated with that category. A subset of the ranked categories is determined at The subset may be determined by the provider The subset may include some number of the top ranked categories. For example, in some implementations, only the top ranked category may be in the subset.
In another implementation, the top five ranked categories may be in the subset.
0コメント