Most existing active learning studies focus on designing sample selection algorithms. However, several fundamental problems deserve investigation to provide deep insight into active learning. In this article, we conduct an in-depth investigation on active learning for classification from the perspective of model change. We derive a general active learning framework for classification called maximum model change (MMC), which aims at querying the influential examples. The model change is quantified as the difference between the model parameters before and after training with the expanded training set. Inspired by the stochastic gradient update rule, the gradient of the loss with respect to a given candidate example is adopted to approximate the model change.]]>

Facing a large amount of entities appearing on the web, entity linking has recently become useful. It assigns an entity from a resource to one name mention to help users grasp the meaning of this name mention. Unfortunately, many possible entities can be assigned to one name mention. Apparently, the usually co-occurring name mentions are related and can be considered together to determine their best assignments. This approach is called collective entity linking and is often conducted based on entity graph. However, traditional collective entity linking methods either consume much time due to the large scale of entity graph or obtain low accuracy due to simplifying graph.]]>

Sentiment classification aims to automatically predict sentiment polarity (e.g., positive or negative) of user generated sentiment data (e.g., reviews, blogs). In real applications, these user-generated sentiment data can span so many different domains that it is difficult to label the training data for all of them. Therefore, we study the problem of sentiment classification adaptation task in this article. That is, a system is trained to label reviews from one source domain but is meant to be used on the target domain. One of the biggest challenges for sentiment classification adaptation task is how to deal with the problem when two data distributions between the source domain and target domain are significantly different from one another.]]>

This article explores a method for more accurately estimating the main effect of the system in a typical test-collection-based evaluation of information retrieval systems, thus increasing the sensitivity of system comparisons. Randomly partitioning the test document collection allows for multiple tests of a given system and topic (replicates). Bootstrap ANOVA can use these replicates to extract system-topic interactions—something not possible without replicates—yielding a more precise value for the system effect and a narrower confidence interval around that value. Experiments using multiple TREC collections demonstrate that removing the topic-system interactions substantially reduces the confidence intervals around the system effect as well as increases the number of significant pairwise differences found.]]>

Information Retrieval (IR) is well-known for the great number of adopted evaluation measures, with new ones popping up more and more frequently. In this context, correlation analysis is the tool used to study the evaluation measures and to let us understand if two measures rank systems similarly, if they grasp different aspects of system performances or actually reflect different user models, if a new measure is well motivated or not. To this end, the two most commonly used correlation coefficients are the Kendall’s τ correlation and the AP correlation τAP. The goal of the article is to investigate the properties of the tool, that is, correlation analysis, we use to study evaluation measures.]]>

We propose the Assessor-driven Weighted Averages for Retrieval Evaluation (AWARE) probabilistic framework, a novel methodology for dealing with multiple crowd assessors that may be contradictory and/or noisy. By modeling relevance judgements and crowd assessors as sources of uncertainty, AWARE takes the expectation of a generic performance measure, like Average Precision, composed with these random variables. In this way, it approaches the problem of aggregating different crowd assessors from a new perspective, that is, directly combining the performance measures computed on the ground truth generated by the crowd assessors instead of adopting some classification technique to merge the labels produced by them.]]>

The interplay between the response latency of web search systems and users’ search experience has only recently started to attract research attention, despite the important implications of response latency on monetisation of such systems. In this work, we carry out two complementary studies to investigate the impact of response latency on users’ searching behaviour in web search engines. We first conduct a controlled user study to investigate the sensitivity of users to increasing delays in response latency. This study shows that the users of a fast search system are more sensitive to delays than the users of a slow search system.]]>

Cold-start recommendation is one of the most challenging problems in recommender systems. An important approach to cold-start recommendation is to conduct an interview for new users, called the interview-based approach. Among the interview-based methods, Representative-Based Matrix Factorization (RBMF) [24] provides an effective solution with appealing merits: it represents users over selected representative items, which makes the recommendations highly intuitive and interpretable. However, RBMF only utilizes a global set of representative items to model all users. Such a representation is somehow too strict and may not be flexible enough to capture varying users’ interests. To address this problem, we propose a novel interview-based model to dynamically create meaningful user groups using decision trees and then select local representative items for different groups.]]>

Many applications require semantic understanding of short texts, and inferring discriminative and coherent latent topics is a critical and fundamental task in these applications. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Recent studies show that the Dirichlet Multinomial Mixture (DMM) model is effective for topic inference over short texts by assuming that each piece of short text is generated by a single topic. However, DMM has two main limitations. First, even though it seems reasonable to assume that each short text has only one topic because of its shortness, the definition of “shortness” is subjective and the length of the short texts is dataset dependent.]]>

Similarity search is a fundamental problem in network analysis and can be applied in many applications, such as collaborator recommendation in coauthor networks, friend recommendation in social networks, and relation prediction in medical information networks. In this article, we propose a sampling-based method using random paths to estimate the similarities based on both common neighbors and structural contexts efficiently in very large homogeneous or heterogeneous information networks. We give a theoretical guarantee that the sampling size depends on the error-bound ϵ, the confidence level (1-Δ), and the path length T of each random walk. We perform an extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges.]]>

Given a social network, the Influence Maximization (InfMax) problem seeks a seed set of k people that maximizes the expected influence for a viral marketing campaign. However, a solution for a particular seed size k is often not enough to make an informed choice regarding budget and cost-effectiveness. In this article, we propose the computation of Influence Spectrum (InfSpec), the maximum influence at each possible seed set size k within a given range [klower,kupper], thus providing optimal decision making for any availability of budget or influence requirements. As none of the existing methods for InfMax are efficient enough for the task in large networks, we propose LISA (sub-Linear Influence Spectrum Approximation), an efficient approximation algorithm for InfSpec (and also InfMax) with the best-known worst-case guarantees for billion-scale networks.]]>

Compressing textstreams generated by social networks can both reduce storage consumption and improve efficiency such as fast searching. However, the compression process is a challenge due to the large scale of textstreams. In this article, we propose a textstream compression framework based on compressed sensing theory and design a series of matching parallel procedures. The new approach uses a linear projection technique in the textstream compression process, achieving fast compression speed and low compression ratio. Two processes are executed by designing elaborated parallel procedures for efficient compressing and decompressing of large-scale textstreams. The decompression process is implemented for approximate solutions of underdetermined linear systems.]]>