There are two important papers at Yahoo! Research dealing with the problems in distributed information retrieval and web topology for detecting web spam. The paper 'Challenges in Distributed Information Retrieval' is by Flavio Junqueira, Ricardo Baeza-Yates, Fabrizio Silvestri, Vassilis Plachouras and Carlos Castillo. As the web sites are increasing at a great rate with over 20 billion indexed pages the centralized systems of the search engines will not be able to handle such a large data. There will be requirement of fully distributed search engines. In this paper all the researchers have put together the recent research results and talk about the challenges that distributed Web retrieval system faces.
The other paper 'Know your Neighbors: Web Spam Detection using the Web Topology' is by Vanessa Murdock, Carlos Castillo, Fabrizio Silvestri, Debora Donato and Aristides Gionis who in the paper “present a spam detection system that uses the topology of the Web graph by exploiting the link dependencies among the Web pages, and the content of the pages themselves. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam that can be applied in practice to large-scale Web data.”