COSCUP x RubyConf TW 2021

如果「資訊作戰」研究也可以是資料科學研究
08-01, 15:00–15:30 (Asia/Taipei), RB105 - Main Track
Language: 漢語


Translate Title

What if IO research is scientific and data-driven?

Talk Length

60

您是否知悉並同意如採遠端形式分享,需提供預錄影片(您需同意大會才能接受您的稿件) – yes hackmd url

https://hackmd.io/@coscup/rymNETD0O/%2F%40coscup%2Frk8XEpDR_

slido url

https://app.sli.do/event/wbdeoysr

Abstract

IORG 以可公開驗證的資料科學方法研究「資訊操弄」(information manipulation),從中揭露「資訊作戰」(information operation),所以需要各式各樣的資料。Facebook、微博,是台灣、中國重要的社交平台,但取得資料卻比想像中還要困難。在建立爬蟲系統的過程中,IORG 面臨各式各樣的挑戰,包括取得目標列表、反制阻擋機制、控制爬蟲速度、訂定資料欄位、提高資料儲存及搜尋效率。我們是如何解決這些挑戰,確保系統持續運作的?軟硬整合的爬蟲系統,怎麼開源?

對 IORG 來說,g0v 社群長久以來持續累積的開放資料,是超級有價值的研究基礎。「Cofacts 真的假的」有 LINE 的可疑訊息資料、「鄉民看電視」有電視新聞資料、「0archive 零時檔案局」有靜態網站、PTT 的文章。加上 Facebook、微博的貼文,我們必須擴充 0archive 的開放資料標準、連結各種資料,實作儲存、索引、搜尋的方法,並且公開這些資料。這個龐大而複雜的公開資料庫,現在長成什麼樣子了?

要怎麼從龐大的資料中,找到、觀察一則謠言的生命週期和傳播網絡?除了複製貼上、分享連結之外,謠言也可能在傳播的途中變形、合併。要怎麼知道哪些訊息是屬於同一則謠言?IORG 提出「屬於同一則謠言」的數學定義,以及快速打包謠言的演算法。而在打包之後,就能嘗試 mapping 謠言的傳播網絡,我們準備了一些案例,想與大家分享。

IORG 研究成果授權公開相關資訊,請參考 https://iorg.tw/open

English Abstract

IORG studies information manipulation and identifies information operations with publicly verifiable data science methods. We need all kinds of data. Facebook and Weibo are two important social platforms in Taiwan and China, and getting data from them is more difficult than we thought. Our scrapers have encountered numerous challenges: acquiring target lists, countering blocking mechanisms, controlling scraping speed, defining data structure, enhancing efficiency on data storage and search. We would like to share our working solutions to these challenges, lessons learned for continuous operation, and how we open-sourced a hardware-software-integrated scraper system.

Over the years, the g0v community has launched open data projects, providing super valuable data for information manipulation researchers. “Cofacts” has suspicious LINE messages, “tvlogger” has TV news data, and “0archive” has web pages of static websites and forum articles from PTT. We would like to share how we extended the open data standard from “0archive” to accommodate more sources and platforms. We’d also share the way we store, index, search, and open this massive collection of data.

How do you, from the vast sea of text messages, find and observe the life cycle and dissemination network of a rumor? Aside from copy-pasting and link-sharing, a rumor can also “fork” itself or “merge” with others. Where can we draw the boundary of a rumor? We would like to share our proposed mathematical definition of messages belonging to a rumor, and an algorithm to efficiently group them. Lastly, we have mapped several rumors into their dissemination network. We’d share those too.

More information on IORG & open-source, please refer to https://iorg.tw/open.

This speaker also appears in: