Research and teaching in information system security
Twitter released this statement about their practice of shadow banning. They mostly claim that Shadow Banning on Twitter Online Social Network (OSN) is a bug.
The motivation behind this work is to demonstrate that shadow banning practice on this OSN is unlikely to be a bug.
We decided to go for a statistical demonstration. To do so, we had to massively crawl Twitter accounts and execute shadow banning oracles from shadowban.eu project.
Our data model is a collection of ego graphs, and a collection is associated to a population. To build a data model, we go through a collection of landmarks, either randomly selected or from a specific population. We then execute a depth limited Breadth First Search (BFS) starting from every selected landmarks to compute an ego graph. The model for a population is the union of all computed ego graphs for a population of landmarks. The next listing gives an abstract representation of the BFS we have developed.
1 BFS(landmark):
2 q = Queue()
3 l = AdjList()
4 Enqueue(q, [landmark])
5 Loop until Empty(q) or Depth(Head(q)) > 3:
6 account = Dequeue(q)
7 r = ShadowBanningTest(account)
8 h = FetchHistory(account, [2019; now])
9 siblings = FetchSiblings(h, 33)
10 AdjListAdd(l, account, siblings, r)
11 Enqueue(q, siblings)
12 return l
The output of this BFS is a graph, an ego graph for a specific landmark, in which the vertices represent an account and the edges an interaction as in : like; reply; repost; etc. To understand our data model, you can check our website whosban.eu.org in which you’ll find a graphical representation of an ego graph starting from a landmark provided by the user.
The BFS is executed for all the landmarks of a given population. We define a model for a population as the union of all the ego graphs generated from all the landmarks of a given population.
Let us note that the ego graphs of the different populations are containing several million of account vertices, for which we have often downloaded the entire contents. This is why performance is key.
Here are the essential parts of the crawling code, starting from the depth limited BFS implementation.
"""
Build a graph from landmark screen_name using breadth
first search algorithm.
It spots when max_depth is reached.
Year is the oldest year of relevant tweets.
"""
async def tree_screen_name_root(self, screen_name):
# Adjacency list
adj_list = []
# Get profile id from screen_name
profile = await self._session.profile_raw(screen_name)
if twitter_session.is_error(profile):
self._l.note(
"Could not get root profile for %s, skipping" % \
(screen_name))
return adj_list
user_id = str(profile["id"])
# Breadth-first search
marks = []
queue = []
# Enqueue first job
await self.job_enqueue(queue, user_id, screen_name)
# Depth tracking
current_depth = 0
current_depth_vertices_index = 0
current_depth_vertices_count = 1
next_depth_vertices_count = 0
# Means not enough siblings regarding current landmark
while len(queue) > 0:
# Dequeue head
current_user = await self.job_dequeue(queue,
current_depth_vertices_index,
current_depth_vertices_count)
current_user_id = current_user["user_id"]
current_screen_name = current_user["screen_name"]
# Mark the vertex, not visited twice
marks.append(current_user_id)
# Execute the treatment : shadow banning test,
# history fetch and siblings computation
vertex = await self.job_vertex(current_user_id,
current_screen_name)
# Add the vertex to the adjacency list
adj_list.append(vertex)
# Mangagement of depth
# Add the child nodes to the next depth number of
# nodes and enqueue them iff current depth is not the
# max depth -> no queue management for nothing
if current_depth + 1 < self._max_depth:
for i, s in vertex["neighbours"].items():
if i not in marks:
if not in_queue(queue, i):
self._l.debug("Queuing @%s" % s)
# Enqueue this job !
await self.job_enqueue(queue, i, s)
next_depth_vertices_count = \
next_depth_vertices_count + 1
else:
self._l.debug("@%s already in queue" % s)
else:
self._l.debug("@%s already marked" % s)
# Check current depth
current_depth_vertices_index = \
current_depth_vertices_index + 1
if current_depth_vertices_index == \
current_depth_vertices_count:
current_depth = current_depth + 1
current_depth_vertices_count = \
next_depth_vertices_count
next_depth_vertices_count = 0
current_depth_vertices_index = 0
if current_depth == self._max_depth:
self._l.debug("Max depth of %s reached" % \
(self._max_depth))
# Save adjacency list to json format
return adj_list
# Breadth first search ended (unlikely :))
return adj_list
This implementation eludes concurrency details. We use a master / slave
RPC architecture, in which the master send jobs to be executed by slaves in an
execution queue. The production release allows to execute
job_vertex
method in a slave, as soon as it is enqueued by the master
which executes tree_screen_name_root
method.
In order to preserve consistency of the search, the BFS synchronises the search
depth wise. Concurrency is implemented using
arq library.