A series of media leaks in recent months have put a spotlight on Chinese firms engaged in global social media data collection—one of several prerequisites to realizing the Chinese Communist Party (CCP)’s vision of a data-driven propaganda apparatus (China Brief, May 15).  In August and September, Western media outlets reported on such data collection efforts, to include: documents reportedly hacked from three Chinese firms that suggest those companies conduct social media monitoring and data collection for People’s Republic of China (PRC) security organs; and a reported database that allegedly contained identifying information, social media accounts, and personal history profiles for more than 2 million individuals, including American, Australian, and European politicians, military personnel, academics, and business executives. 
This article seeks to contextualize bulk social media data collection in relation to the CCP’s goals for future propaganda work, and to evaluate Chinese firms’ ability to exploit bulk data for actionable insight. To do this, the author investigated the outputs of a particular Chinese company active in this space: TRS Information Technology Company Ltd. (北京拓尔思信息技术股份有限公司, Beijing Tuoer Si Xinxi Jishu Gufen Youxian Gongsi). 
Bulk Data Collection in Context
Systematically accessing the opinions, interests, and behavior of social media users both in China and abroad is critical to the future of the CCP’s “public opinion guidance” (舆论引导, yulun yindao) work. The Party hopes to create an early warning system by monitoring public opinion and sentiment to pre-empt the destabilizing effects of so-called “black swan” or “gray rhino” events (China Brief, February 20, 2019; China Brief, May 15). The CCP’s other stated goals for China’s future propaganda work, which include automated content creation and targeted distribution capabilities, likewise demand broad access to online behavioral data (China Brief, May 15). Plans for the application of such an agile, responsive propaganda apparatus are not limited to China’s borders: this system will also be critical to ensuring that the CCP can “improve [its] ability to engage in international communication so as to tell China’s stories well, make the voice of China heard, and present a true, multi-dimensional, and panoramic view of China to the world” (Xinhua, August 22, 2018).
If reports that Chinese firms are selling bulk social media data (and analysis derived from that data) to the CCP and state security organs are true, the data may be facilitating propaganda campaigns and online influence operations. One company named in recent reporting, OneSight, has allegedly held a contract to amplify the state-operated China News Service on Twitter (ProPublica, March 26). Knowlyses, another company featured in recent reports, has previously held demonstrations focused on using its services to “monitor public opinion for election” (sic) (Freedom House, 2019).
Using bulk social media data and other open-source information for the kind of propaganda work envisioned by the CCP requires breakthroughs in a host of interconnected technologies. Raw data collection is just the first step; moving toward actionable insight requires enriching the data, including through natural language processing methods like sentiment analysis, named-entity extraction, and event extraction. These techniques will be key to building a warning system that alerts the CCP to looming security crises. Since at least the early 2000s, China’s state-funded research enterprise and technology firms have been working towards many of the technologies needed to exploit massive open-source data for propaganda purposes. While these technologies have many legitimate applications in business, they are also critical to enriching bulk data to make it useful for state surveillance.
The State Key Lab of Intelligent Technology and Systems at Tsinghua University (清华大学智能技术与系统国家重点实验室, Qinghua Daxue Zhineng Jishu yu Xitong Guojia Zhongdian Shiyan Shi) and the Key State Lab of Pattern Recognition within the China Academy of Science’s Institute of Automation (中国科学院自动化研究所模式识别国家重点实验室, Zhongguo Kexueyuan Zidonghua Yanjiusuo Moshi Shibie Guojia Zhongdian Shiyan Shi) are two such government-funded programs that have produced research applicable to ends. A non-exhaustive list of relevant topics researched by these institutions follows in the table below:
Table: Topics of selected papers by researchers at the State Key Lab of Intelligent Technology and Systems and the Key State Lab of Pattern Recognition
|Year||Research Topic / Year||Source|
|2005||Automated Image Annotation||(ACL Anthology)|
|2007||Speech-based Emotion Recognition||(ACL Anthology)|
|2008||Song-based Sentiment Classification||(ACL Anthology)|
|2008||Keyphrase Extraction||(ACL Anthology)|
|2010||Word Relation-based Sentiment Classification||(ACL Anthology)|
|2012||Opinion Target Extraction||(ACL Anthology)|
|2012||Machine Translation||(ACL Anthology)|
|2012||Social Media Misinformation Identification||(ACL Anthology)|
|2014||New Sentiment Word Identification||(ACL Anthology)|
|2018||Multi-lingual Relationship Extraction||(ACL Anthology)|
|2019||Automated Text Generation||(ACL Anthology)|
|2019||Automated Response Generation||(ACL Anthology)|
|2019||Named-entity Recognition||(ACL Anthology)|
Source: Compiled by author.
As shown in the following case study on TRS Information Technology, advances in core data management and natural language processing capabilities are not limited to China’s state labs; some Chinese companies are working hard to ensure their data collection is useful to the CCP, civil government, security services, and the military.
TRS Information Technology Company’s NetInsight
TRS Information Technology Co. Ltd. is a publicly traded software company whose business strategy revolves around various CCP initiatives, including “media fusion” (媒体融合, meiti ronghe) and “military-civil fusion” (军民融合, junmin ronghe) strategies (Sina Finance, March 30, 2016, March 30, 2017, March 30, 2018, March 30, 2019, April 23, 2020). Platforms developed by TRS and its subsidiaries like Keyun Big Data (科韵大数据, Keyun Da Shuju) serve a wide array of purposes that include: conducting “full-web” (全网, quanwang) data collection and monitoring; improving the Party-state’s external communication (对外宣传, duiwai xuanchuan); and sentiment analysis on large datasets (TRS.com, undated; TRS-DSJ.com, undated; Sina Finance, March 30, 2019, March 30, 2018). The company claims that more than 8,000 organizations use its products, including “80 percent of national ministries and commissions (国家部委, guojia bu wei) and 60 percent of provincial government organs, more than 300 new and traditional media outlets… and public security [organizations], military units, and other users involved in security” (Sina Finance, March 30, 2018; TRS.com, undated). One client appears to be the PRC Ministry of State Security (TRS.com, undated).
Among TRS Information Technology’s core platforms for monitoring online public sentiment is TRS NetInsight (TRS网察舆情大数据分析平台, TRS Wang Cha Yuqing Da Shuju Fenxi Pingtai) (TRS.com, undated). Three internet data centers continually collect data from “traditional media outlets, Weibo, WeChat, and other new media sources,” including mobile applications, to provide real-time early warning alerts (TRS.com, undated; TRS.com, November 7, 2019; NLPIR, November 2, 2019). Presentation materials suggest that data is also collected from e-commerce platforms, international news outlets, Twitter, and Facebook (NLPIR, November 2, 2019). One public report from NetInsight’s website at the height of China’s fight against COVID-19 claimed that Netinsight collected and analyzed more than 9 million articles and online comments related to the virus within a 24-hour period (NetInsight, February 18).
NetInsight’s public reports often distinguish news outlet activity from netizen activity, enabling comparisons between the two spheres of communication. For each category, the reports may provide an overview of hot topics or articles by locality; a list of viral headlines and summaries of their content; word clouds to highlight heavily discussed topics; and trending hashtags on social media, both in general and in response to specific events (NetInsight, May 26; February 17; June 17). Some in-depth reports also quantify a percentage of online posts that agree with certain sentiments (NetInsight, November 4, 2019). It is unclear to what extent the issue summaries and analyses presented in NetInsight’s public reports are automatically generated. The level of detail achieved by TRS Information Technology, and its mix of analytical outputs, appears similar to public products produced by Knowlesys (Knowlesys, November 14, 2014; Sina Blog, December 24, 2018).
Examples of the platform’s insights include:
- On May 28, during the 2020 “Two Sessions,”  Chinese netizens most heavily focused on a proposal to increase protection for companion animals and ban consumption of cats and dogs. Articles on these topics attracted 210 million views and generated 33,000 posts (NetInsight, May 28).
- Chinese netizens support the Hong Kong National Security Law, as assessed through trending hashtags like “#Hong Kong National Security Takes Effect#” (#香港国安法正式生效#) and “#More Than 70 Countries in the United Nations Express Support for Hong Kong National Security Law#” (#70余国在联合国发言支持香港国 安立法#) (NetInsight, July 6).
- American netizens criticize the “double standard” (双标, shuangbiao) seen in U.S. politicians’ reaction to protests in Hong Kong and the Black Lives Matter demonstrations (NetInsight, June 17).
After a review of available public products, TRS Information Technology’s NetInsight platform appears capable of filtering incoming bulk data to answer granular questions like “What is attracting the most attention in different parts of the country?” or “How are netizens engaging or not engaging with certain news reporting?” However, on any given issue, NetInsight’s public products heavily favor Chinese media sources. Even in a special report on Black Lives Matter protests in the United States, 62.16 percent of more than 23.5 million messages and articles analyzed by NetInsight were pulled from Weibo or WeChat. In contrast, just 3.43 percent of the messages analyzed came from non-Chinese media sources. Indeed, only 0.03 percent and 0.02 percent of the data came from Twitter and Facebook, respectively (NetInsight, June 17). This kind of imbalance may suggest that NetInsight’s ability to collect and process information from non-Chinese sources is weak, with implications for whether the platform could effectively support the CCP’s international propaganda work. 
Weibo and WeChat appear to be NetInsight’s primary data sources across TRS Information Technology’s public analysis. Furthermore, these products are largely oriented toward understanding events in China. However, the formulaic approach to NetInsight’s analysis—word clouds, hashtags, and issue summaries that distinguish between location and type of media—could easily be replicated using foreign sources. If TRS Information Technology decides to invest in foreign language processing and improving collection from non-Chinese media, NetInsight could facilitate propaganda campaigns that seek to amplify societal rifts and divisive narratives, crowd out dissenting opinions, and reshape unfavorable views.  Notably, these tactics have all been seen in China’s international propaganda efforts on issues such as Taiwan, Hong Kong, and COVID-19 (China Brief, September 16, 2019; ProPublica, March 26; Belfer Center, July 2020). In theory, NetInsight could also enable the CCP to evaluate whether its current propaganda campaigns are having a desired effect, and to tweak existing campaigns to improve those in the future.
Achieving TRS Information Technology’s current capability has been a years-long pursuit. Founded in 1993, the company developed alongside the rest of China’s public sentiment analysis industry across three discrete eras, according to the company’s own histories (TRS.com, undated). Between 2006 and 2010, technology was focused on “monitoring” (监测, jiance) news, fora, and blogs on the “traditional internet” (传统互联网, chuantong hulianwang) (NLPIR, November 2, 2019). Between 2011 and 2016, technologies began “analyzing” (分析, fenxi) mobile and social media information through applications like Weibo and Weixin during the “big data era” (大数据时代, dashuju shidai) (NLPIR, November 2, 2019). Since 2017, TRS Information Technology has seen itself and the industry moving toward “situational awareness” (态势感知, taishi ganzhi) across multi-media formats, and operation at a more personal level in the “era of intelligentization.” In other words, TRS Information Technology is working toward achieving the capability to answer highly granular questions about an individual’s behavior across platforms (NLPIR, November 2, 2019; Sina Finance, April 23).
The CCP has been attempting to maximize social control through online public opinion and sentiment monitoring since at least 2004 (PRC State Council Information Office, May 20, 2013). Safeguarding the Party’s rule requires massive amounts of data from as many sources as possible, both within China and abroad. However, collecting this data is the easy part. The real difficulty is generating actionable insight through advanced data management, classification, and natural language processing techniques—technologies that numerous Chinese firms are striving to develop for legitimate purposes. Although TRS Information Technology’s current public products are not very detailed, if the company’s current development efforts are successful, it will aid the Party in eventually realizing its vision of next-generation propaganda. A combination of state-sponsored research and corporate contracting are facilitating the CCP’s realization of data-empowered “thought management.”
Devin Thorne is a former Senior Analyst at the Center for Advanced Defense Studies (C4ADS) in Washington, DC. All views expressed are his own, and do not reflect those of any current or former employer. Follow his research on Twitter @D_Thorne.
 There are many legitimate business reasons for collecting open source data in bulk, including for advertising, and such activity is not unique to Chinese firms. Indeed, some of the companies discussed in recent reports allegedly buy portions of their data from North American data providers. The key questions for investigators are to what extent and by which companies is such data being provided to Chinese government and security organizations.
 The reporting on hacked documents related to social media monitoring and data collection for Chinese security organs was published by Vice in August (Vice, August 21). The reported database containing profiles on foreign officials, academics, and other persons was reported on by the Washington Post and Australian Broadcasting Corporation News in mid-September. (Washington Post, September 14; ABC, September 14).  Exposed companies include Knowlyses (深圳市乐思软件技术有限公司), Yunrun Big Data (云润大数据), OneSight (一网互通(北京)科技有限公司), Zhenhua Data (深圳振华数据信息技术有限公司), and Global Tone Communications (中译语通科技股份有限公司) (Vice, August 21; Washington Post, September 14; ABC, September 14; ASPI, October 14, 2019).
 The “Two Sessions” (两会, Liang Hui) refers to the annual meetings of the National People’s Congress and the Chinese People’s Political Consultative Conference, which usually take place concurrently (or with overlapping dates) in the month of March. This year’s meetings were delayed until May as a result of the COVID-19 pandemic.
 The author acknowledges that public demonstration products may not reveal the full extent of any company’s capability. However, the public products assessed here provide one of the only windows into this market’s level of advancement.
 It is far from certain that TRS Information Technology has a weakness in in foreign language processing and improving collection from non-Chinese media. Even if they do not, other companies like Knowlesys that produce similar analytical products appear to have the stronger foreign language capability.