The accessibility of Twitter history is soon going to increase. The social media giant announced it is providing users the ability to search the complete archive of public tweets. Infrastructure engineer Yi Zhuang has stated in a blog post:
“Since that first simple Tweet over eight years ago, hundreds of billions of Tweets have captured everyday human experiences and major historical events. Our search engine excelled at surfacing breaking news and events in real time, and our search index infrastructure reflected this strong emphasis on recency. But our long-standing goal has been to let people search through every Tweet ever published.”
As Twitter earlier focused on real time feeds, the search has seemed incomplete. This is because only a week long public content was available to the users. The Twitter blog post foes into extensive details about “how we built a search service that efficiently indexes roughly half a trillion documents and serves queries with an average latency of under 100ms.”
The change will be seen in mobile and desktop versions of Twitter within the next few days. Twitter will now index every tweet since the year 2006. More comprehensive results will now be provided for complete TV and sports seasons as well as important conference and industry gatherings apart from places and businesses or hashtag conversations on important topics.
The most critical factors in the design for the indexing were modularity, scalability, cost effectiveness and a simple interface. Incremental development of the indexing system occurred wherein the goal of indexing each tweet was not born in one quarter. A complete index builds on foundational projects. In the year 2012, a small historical index of approximately 2 billion top Tweets was created according to the Twitter blog which then spoke of the process of “developing an offline data aggregation and preprocessing pipeline. In 2013, we expanded that index by an order of magnitude, evaluating and tuning SSD performance. In 2014, we built the full index with a multi-tier architecture, focusing on scalability and operability.”
Main parts of the system are as follows:
- batched data aggregation
- pre process pipeline
- invested index builder
- Earlybird shards
- Earlybird roots
Batched data aggregation and preprocessing involve the following. Real time index processes individual Tweets one at a time. The complete index employees a batch processing pipeline with each batch comprising the day’s worth of tweets. The real time infection code was packaged into Pig User defined functions so that it could be used in Pig jobs and create a pipeline of Hadoop jobs for data aggregation and preprocessing tweets on Hadoop.
The daily data aggregation and preprocess pipeline consists of the following components:
- Engagement aggregator
The pipeline is designed to run against a single day of Tweets. The pipeline has been set up to process data in an incremental fashion.
This set up has the following benefits according to the Twitter blog:
“It allowed us to incrementally update the index with new data without having to fully rebuild too frequently. And because processing for each day is set up to be fully independent, the pipeline could be massively parallelizable on Hadoop. This allowed us to efficiently rebuild the full index periodically (e.g. to add new indexed fields or change tokenization).”
Inverted Index Building: Segment Partitioner and Segment Indexer
Daily data aggregation and preprocess job outputs one record per tweet. As this output is tokenized already, yet not inverted, Twitter’s next step wa the establishment of single threaded, stateless inverted index builders running on Mesos.
The inverted index builder consists of the following components:
- Segment partitioner
- Segment indexer
The inverted index builders are completely simple. These small builders can be parallelized big time on Mesos and coordinate with each other through locks on Zookeeper for two builders to be prevented from building the same segment. The bottleneck used for the inverted builder was actually a Hadoop namenode.
The inverted index builders created many inverted index segments which were distributed into machines known as Earlybirds. As each Earlybird could produce only a small portion of the complete Twitter repertoire, sharing had to be introduced. Previously, segments had been distributed into different hosts via the hash function which worked really well with the real time index.
Twitter has now “created a two-dimensional shardincheme to distribute index segments onto serving Earlybirds. With this two-dimensional sharding, we can expand our cluster without modifying existing hosts in the cluster:
Temporal sharding: The Tweet corpus was first divided into multiple time tiers.
Hash partitioning: Within each time tier, data was divided into partitions based on a hash function.
Earlybird: Within each hash partition, data was further divided into chunks called Segments.
Segments were grouped together based on how many could fit on each Earlybird machine.
Replicas: Each Earlybird machine is replicated to increase serving capacity and resilience.”
The sharding can be seen in this diagram:
Time tiers have been added to grow data capacity. Existing time tiers remain unchanged. For growth of serving capacity, more replicas were added. This enables avoidance of the has partition and cluster size was reduced by:
- Packing more segments onto each Earlybird (reducing hash partition count).
- Increasing the amount of QPS each Earlybird could serve (reducing replicas).
A different storage medium was needed for packing more segments into Earlybird.
For this reason, SSD was used rather than RAM though this impacted QPS capacity considerably. For increasing serving capacity, multiple optimizations were used.
To prevent API clients from scattering gather from hash partitions and time tiers to serve a single query, roots were introduced in abstracting the internal details of partitioning and tiering in the fill index.
As per the Twitter blog, “The roots perform a two level scatter-gather as shown in the below diagram, merging search results and term statistics histograms. This results in a simple API, and it appears to our clients that they are hitting a single endpoint. In addition, this two level merging setup allows us to perform additional optimizations, such as avoiding forwarding requests to time tiers not relevant to the search query.”
Looking to the Future
Complete results from the full index will now be seen in the All tab of the search results on Twitter web client and Twitter for iOS and Android. More tweets from the index will now appear in the top tab of search results and new product experiences developed by this index.
The full index is a major investment in terms of infrastructure. It aims to improve the ongoing changes to the search as well as discovery experience on the search engine giant. Optimization for smart caching is the next step for the social media major. You too join the flock to help Twitter for this next change.Now Search Through the Complete Twitter Archive as the Social Network will Index Every Public Tweet Since 2006!,