UNM Computer Science Graduate Student Jeffrey Knockel, and professors Jared Saia and Jed Crandall, have presented an interesting challenge to the computing world. The challenge is in the form of a paper titled, "Three Researchers, Five Conjectures: An Empirical Analysis of TOM Online,Skype Censorship and Surveillance" presented at the FOCI '11 workshop held recently in San Francisco.
In the paper they present the results of research they did last spring. Knockel was looking online at an encrypted file that intrigued him. "It's reverse engineering, so this is basically me playing around in a program called a debugger, "he says. "You normally use it to find bugs in programs, but in the case of a program you didn't write, you can also use it to try to figure out what other programmers are doing."
He was able to download a specific key file from the TOM Online, Skype server. TOM Online-Skype is an internet telephony and chat program that is a joint venture between TOM Online a mobile internet company in China, and Skype Ltd. The file contained the word "fuck," a word he knew was censored in China. It was a trigger that eventually led him to the entire encryption algorithm that allowed them to decrypt all the words in the file. Working with his advisors, they determined that about 15 percent of the words were of prurient interest. More than 35 percent were political and others were religious, concerned government officials involved in scandal, locations of protests or neighborhood dislocations or information about spying.
Knockel discusses research
The research led them to a series of conjectures about Internet censorship they think warrant further investigation by the computing community.
1. Effectiveness Conjecture: "Censorship is effective, despite attempts to evade it."
Knockel: "We also found for political key words, we found some specific phrases that appear to be taken from documents such as hold the microphone to indicate whether – that seems to be a way that Chinese protestors can in some way protest. Show that they are for liberty and at the same time it's such an ambiguous thing, they can't actually be arrested by the police and this seems to be something – you just hold an invisible microphone to your mouth or have a microphone attached to your clothing or something like this."
2. Spread Skew Conjecture: "Censored memes spread differently than uncensored memes."
Knockel: "So the second conjecture is the spread skew conjecture and this is that censored memes spread differently than uncensored memes. The inspiration for this conjecture was two girls one cup on Google trends. This was referring to the name of an obscene video. We looked it up. We looked at the Google trends data, which is essentially the popularity of the search term, in both English and in Chinese. When we did it in English we saw that the tail of the popularity – it initially was very popular, but then it started becoming very searchable. It had a very shallow tail. But when we looked up the Chinese phrase for this, we found out that the tail seemed to be much larger in comparison to like the initial surge of popularity. So we conjecture that this is the result of the phrase being censored. "
3. Interactions of Secrecy and Surveillance Conjecture: "Keyword based censorship is more effective when the censored keywords are unknown and the online activity is, or is believed to be, under constant surveillance."
Knockel: "The third conjecture is the Interactions of Secrecy and Surveillance Conjecture: and this is that censorship is more effective when the list of keywords and the actual surveillance message is being sent through triggering. This is censorship in secret. And this is essentially inspired by – it seems like TOM Online, Skype and others that have been analyzed in the past like Green Dam – they needed to go to a lot of trouble to encrypt the actual key words that trigger censorship."
4. Peer to peer vs. Client-Server Conjecture: "The types of keywords censored in peer- to- peer communications are fundamentally different than the types of keywords censored in client-server communications."
Knockel: "The fourth conjecture is the peer to peer vs. client server conjecture which is the types of keywords censored in a peer to peer communications such as different people talking to each other such as the case we have in TOM Online, Skype is different than the types of keywords censored in a client-server type of communications such as because we have a website and a server then a bunch of people accessing the website as clients. So what inspired this conjecture was the amount of proper nouns that we found in the list of keywords that Tom Online, Skype is censoring and it seemed to be a lot compared to lists that we have for GET- request filtering which is a way of censoring the way people browse the web. "
5. Neologism Conjecture: "Neologisms are an effective technique in evading keyword based censorship, but censors frequently learn of their existence."
Knockel: "Our final conjecture is the neologism conjecture. And that's that neologisms are commonly used by people to try to evade censorship, but the censors to some extent seem to be able to cope with this. This was inspired by our finding of the word 64 in Chinese, which is by itself a neologism for June 4 and we also found another word which was a homophonic way of saying 64 in Chinese, which was itself a neologism of the neologism. But at the same time, while we found those on the lists we did find mote abstract neologisms. So we had seen for instance 32 + 32 and 8 are used in web forums, but we didn't see those neologisms censored in these lists."
The researchers hope the research community will test these and other conjectures with more data and appropriate computational and analytic techniques. They point out that these conjectures are testable. They don't expect them all to be true, but they do believe that they can be empirically tested to determine whether they are true or false.
Their analysis of keywords shows that the lists of keywords that evokes both censorship and surveillance breaks down into the general categories shown in Figure 1. The political terms are in reference to Tiananmen Square protects that occurred on June 4, 1989. Religious terms on the list include"Falun" and "Quan Yin Method," a Buddhist meditation method.
Location refers to the locations of planned events such as protests. News and information sources on the list include Wikipedia and the Canadian Broadcast Corporation. Government officials include Liu Yandong, the highest ranking female in the communist party who is currently caught up in a scandal involving her son-in-law. Information about spying refers to a free phone tapping download and a product name at sunlips.com that appears to be a remote microphone for spying.
On April 22, 2011, they downloaded key files from different URL's. Each was identical and contained 442 words in either English of Chinese. After April 22, 2011, the lists changed and some words were removed.
Figure 2 below shows the changes in categories after the words were removed. The researchers say they do not know the reason for the change, but mention one possible reason is human rights talks between China and the United States were scheduled for April 27-28, 2011.
Figure 2 shows the changes in categories after the words were removed.
Shows changes in the categories of remaining words. The words themselves
are available at the end of this article under "Supplemental Materials."
Shows the distribution of 158 keywords on the keyword list that evokes only surveillance. Most of the words on this list are specific demolition sites in Beijing or references to the demolition of neighborhoods where people have been forced from their homes, and their houses demolished to make way for new construction.
Crandall and Saia jointly advise Knockel. Both Computer Science professors conduct research and work with research groups in related areas.
Knockel's web page
Crandall's Home Page
Saia's Home Page