Crawling Twitter to look for shadow banned accounts

Home / Research / Crawling Twitter to look for shadow banned accounts

Motivation

Twitter released this statement about their practice of shadow banning. They mostly claim that Shadow Banning on Twitter Online Social Network (OSN) is a bug.

The motivation behind this work is to demonstrate that shadow banning practice on this OSN is unlikely to be a bug.

We decided to go for a statistical demonstration. To do so, we had to massively crawl Twitter accounts and execute shadow banning oracles from shadowban.eu project.

A crawler to build shadow banning ego graphs

Our data model is a collection of ego graphs, and a collection is associated to a population. To build a data model, we go through a collection of landmarks, either randomly selected or from a specific population. We then execute a depth limited Breadth First Search (BFS) starting from every selected landmarks to compute an ego graph. The model for a population is the union of all computed ego graphs for a population of landmarks. The next listing gives an abstract representation of the BFS we have developed.

BFS(landmark):
  q = Queue()
  l = AdjList()
  Enqueue(q, [landmark])
  Loop until Empty(q) or Depth(Head(q)) > 3:
    account = Dequeue(q)
    r = ShadowBanningTest(account)
    h = FetchHistory(account, [2019; now])
    siblings = FetchSiblings(h, 33)
   AdjListAdd(l, account, siblings, r)
   Enqueue(q, siblings)
 return l

The output of this BFS is a graph, an ego graph for a specific landmark, in which the vertices represent an account and the edges an interaction as in : like; reply; repost; etc. To understand our data model, you can check our website whosban.eu.org in which you’ll find a graphical representation of an ego graph starting from a landmark provided by the user.

The BFS is executed for all the landmarks of a given population. We define a model for a population as the union of all the ego graphs generated from all the landmarks of a given population.

Let us note that the ego graphs of the different populations are containing several million of account vertices, for which we have often downloaded the entire contents. This is why performance is key.

Technical details

Here are the essential parts of the crawling code, starting from the depth limited BFS implementation.

"""
Build a graph from landmark screen_name using breadth
first search algorithm.
It spots when max_depth is reached.
Year is the oldest year of relevant tweets.
"""
async def tree_screen_name_root(self, screen_name):
  # Adjacency list
  adj_list = []
  # Get profile id from screen_name
  profile = await self._session.profile_raw(screen_name)
  if twitter_session.is_error(profile):
    self._l.note(
        "Could not get root profile for %s, skipping" % \
        (screen_name))
    return adj_list
  user_id = str(profile["id"])
  # Breadth-first search
  marks = []
  queue = []
  # Enqueue first job
  await self.job_enqueue(queue, user_id, screen_name)
  # Depth tracking
  current_depth = 0
  current_depth_vertices_index = 0
  current_depth_vertices_count = 1
  next_depth_vertices_count = 0
  # Means not enough siblings regarding current landmark
  while len(queue) > 0:
    # Dequeue head
    current_user = await self.job_dequeue(queue,
        current_depth_vertices_index,
        current_depth_vertices_count)
    current_user_id = current_user["user_id"]
    current_screen_name = current_user["screen_name"]
    # Mark the vertex, not visited twice
    marks.append(current_user_id)
    # Execute the treatment : shadow banning test,
    # history fetch and siblings computation
    vertex = await self.job_vertex(current_user_id,
        current_screen_name)
    # Add the vertex to the adjacency list
    adj_list.append(vertex)
    # Mangagement of depth
    # Add the child nodes to the next depth number of
    # nodes and enqueue them iff current depth is not the
    # max depth -> no queue management for nothing
    if current_depth + 1 < self._max_depth:
      for i, s in vertex["neighbours"].items():
        if i not in marks:
          if not in_queue(queue, i):
            self._l.debug("Queuing @%s" % s)
            # Enqueue this job !
            await self.job_enqueue(queue, i, s)
            next_depth_vertices_count = \
                next_depth_vertices_count + 1
          else:
            self._l.debug("@%s already in queue" % s)
        else:
          self._l.debug("@%s already marked" % s)
    # Check current depth
    current_depth_vertices_index = \
        current_depth_vertices_index + 1
    if current_depth_vertices_index == \
        current_depth_vertices_count:
      current_depth = current_depth + 1
      current_depth_vertices_count = \
          next_depth_vertices_count
      next_depth_vertices_count = 0
      current_depth_vertices_index = 0
    if current_depth == self._max_depth:
      self._l.debug("Max depth of %s reached" % \
          (self._max_depth))
      # Save adjacency list to json format
      return adj_list
  # Breadth first search ended (unlikely :))
  return adj_list

This implementation eludes concurrency details. We use a master / slave RPC architecture, in which the master send jobs to be executed by slaves in an execution queue. The production release allows to execute job_vertex method in a slave, as soon as it is enqueued by the master which executes tree_screen_name_root method. In order to preserve consistency of the search, the BFS synchronises the search depth wise. Concurrency is implemented using arq library.

Associated publications

Erwan Le Merrer, Benoit Morgan and Gilles Tredan - Setting the Record Straighter on Shadow Banning, INFOCOM 2021 Arxiv
Erwan Le Merrer, Benoit Morgan and Gilles Tredan - Bug ou ban? Une Perspective Topologique sur le Shadow Banning, AlgoTel 2020
A topological perspective of Shadow Banning

Benoît Morgan

Research

Teaching resources

TLS-SEC Trainings

ACADIE team @ IRIT

INP-ENSEEIHT University

Motivation

A crawler to build shadow banning ego graphs

Technical details

Associated publications