This appeared as “Twitter Needs a Terms of Sharing” on Medium.
Recently, some colleagues (wonderful team of @MCForelle, @saiphchen, @andresmh) and I came together to study the impact of bots on political life in Venezuela. There have been multiple stories over the years about the government there using automated scripts to manipulate public conversation over twitter. But after a query from an AP correspondent I found I couldn’t point any good studies of the impact of such bots in Venezuela. So we did one, but it was hard.
Looking around, I found surprisingly few systematic studies of political botnets. There are plenty of reports (several on Syria from one of my collaborators, but also Russia from Global Voices), but few systematic studies of the impact of botnets on public conversation in a particular country. We know that several governments, including the U.S., have mass surveillance programs that include the production of social media identities for the purpose of promoting political information. And we know that some of these programs do such promotion with automated scripts designed to manipulate public opinion. But we don’t know how much of this public conversation is affected, or in what ways. But do such efforts actually have an impact on public opinion? How would we know if they did?
Twitter Doesn’t Make it Easy
Tracking bots — especially political bots — requires careful understanding of how the design features of platforms may constrain the sampling strategy. For starters, is impossible to report the total number of bots engaged in Venezuelan politics. Twitter itself only allows researchers to get information from the 100 most recent retweets. Rather than trying to determine which specific tweet was generated by a bot we looked at the type of platform used to retweet or create a tweet to determine the probability that message was bot-generated. We couldn’t query or examine every account, so we assumed that accounts using platforms like Botize or Masterfollow are bots because that is what those platforms are designed to support, and the accounts that use those platforms all retweet the same content at the same time. The Botize service advertises itself as a way to create your own bot tasks after you have set up your own Twitter account. Indeed, many of the accounts that we identified as likely being bots were suspended by Twitter shortly after we caught them — the company also considered them to be bots.
To track political bots in Venezuela, we captured and analyzed all of the tweets that five key politicians generated between January 1st and May 31st 2015. We collected a total of 11,796 tweets. We then collected the retweet information of a subset of these tweets — who was retweeting, and platforms used to retweet. This process generated 205,077 retweets. Some two percent of these all of these retweets were bot generated. For some politicians as much as five percent of their retweet traffic was coming from bots.
After taking an initial sample of tweets from our list of politicians, retweet information from the top 15 percent of most frequent retweets was gathered over the course of several days following each initial tweet — amounting to 1,721 tweets. Twitter also restricts how many queries researchers can make each hour so we took the most aggressive approach possible in collecting the retweet information of the most noteworthy tweets. And it really took a mixed team of social and computer scientists to do this: @MCForelle is Venezuelan and helped with language and interpretation, @saiphchen did the computational work, @andresmh helped with Twitter API sampling strategy, I helped with context and framing.
To summarize our sampling caveats: we started with only the 5 most popular politicians, worked with the top 15 percent of their most retweeted tweets, and only got the 100 most recent of those retweets. To top it all off by the time we put our findings on SSRN many of the bot accounts had been suspended, so it was difficult to check by hand which accounts had been bot accounts.
Small Data Sets: Uncertain Frequency and Sampling Distributions
Twitter is full of rumors, and it is certainly full of rumors about bots. There is good evidence that governments and politicians use bots and corporate lobbyists use bots. Political actors of many ideological stripes, operating in authoritarian or democratic contexts, seem interested in automating their civic engagement and political attacks.
There have been notable incidents of political bot activity, and analysts can say poignant things about the rhetorical positioning of a particular tweet or study the visual sociology of the tweet a digital artifact. You can look at a particular account and if it is tweeting rapidly, has unusual ratio of followers to followed accounts, is tweeting strange content or using a known bot service to release tweets, the account is probably a bot. But looking at accounts by hand is tedious and doesn’t help with the question of public impact. Even downloading all the tweets around a particular hashtag yields a constrained data set, and Twitter has no way of letting you know the details of your sample frame. You might be able to compose a frequency distribution but have no way of knowing the sample distribution — a relatively “old problem” in internet-based research.
Big Data Sets: Limited Sharing
For researchers who can play with big Twitter data sets, the terms of service constraints mean that the nicely groomed data set can’t be shared with researchers. In 2011 we did one of the first big data analyses of Arab Spring tweets, working through several million tweets in multiple languages. Other researchers wanted to play with the data too, but sharing the data would have put us afoul of Twitter’s terms of service agreement. Twitter eventually clarified that specific tweet or user identifiers could be shared. So we were able to offer other researchers the code to reconstruct the data set — a laborious process still limited by Twitter’s API query restrictions.
This means that it is tough to do verification. Most research, if it survives peer review and comes with good methodology details, doesn’t need to be verified. But the constraint on sharing still has an impact on scientific progress because it discourages follow-on research. Significant findings from big data analysis always lead us to exciting new questions. Or other research teams get excited about plumbing the same data set with a slightly different statistical transformation. Or a student gets enervated about testing a modified hypothesis. But after all the effort put into cleaning a data set, the original research team risks banishment if they encourage any follow on research.
By the time a team has gone through the effort of cleaning up a Twitter data set they usually want to squeeze several research papers out of it. Still, most principal investigators are quite happy to share data, to support their colleagues and involve graduate students. This share and share alike is one of the good norms in science. Still, Twitter doesn’t make it easy. I have queried colleagues about whether anyone has ever been sanctioned by Twitter for sharing data. I came up with a few stories of being asked to stop sharing, but nobody seems to have been fully cut off. Twitter has set up the quandary: if you promote good science by sharing with the research community, you take on the individual risk of being unable to do more good science yourself!
From Terms of Service to Terms of Sharing
Our findings on Venezuela were interesting, as systematically produced as possible, but not explosive. The findings would almost certainly be different during a political crisis. Bots that are too active get quickly caught and disabled by Twitter. Bots that are waiting to be activated may go to work during a political crisis when they may well do the most damage. Right now, we wouldn’t know if there were lots of un-activated bot accounts waiting for a political crisis.
I expect if anyone else gets interested in Venezuela they would want to see the data. Political bots will be active in Canada, the UK, and many other countries facing elections (rigged or otherwise), so this will be a problem for the sociology of media, communication and politics going forward. I think it would be a good habit to start publishing Twitter object codes the same way other quantitative researchers publish their data sets and code books when they disseminate research. Maybe one day we’ll be able to share the nicely cleaned data that so many of us work hard to prepare and analyze. For now, I think political scientists are stuck with small samples from which we can’t easily generalize, or large samples from which generalizations can’t be easily verified.