Content
Crowdsourcing
Outsourcing some tasks to a crowd -> Crowdsourcing
Improve the quality, timeliness and breadth of data
将一些任务外包给人群 -> Crowdsourcing
提高数据的质量、及时性和广度
Key questions:
-
What computational problems can/should be solved?
Data augmenting, Data processing -
What are the programming paradigms/platforms?
A programming paradigm is the classification, style or way of programming. It is an approach to solve problems by using programming languages. -
How do we guarantee that the solution is accurate, efficient and economical?
Quality, cost and latency -
How do we motivate participation and leverages their unique expertise and interests of workers?
-
How do we leverage the joint efforts of both automated and
human computers as workers?
3 central aspects of crowdsourcing
- What
- What tasks can be performed by machines
- Decompose the macro and micro tasks
- Who
- Expertise of workers (如何模拟工人的专业知识)
- Manage cultural aspects and language barrier
- How
- How to design and execute tasks
- Aggregate noisy & complex output ( defines how intelligent aggregation techniques should be, such as Hierarchical—cluster-based aggregation) 聚合嘈杂和复杂的输出(定义智能聚合技术应该如何,例如分层 - 基于集群的聚合)
Overall process
Process
- 使用Parallel安排worker
- Operations & Control: 多产线并行,成本高
- Cost vs latency:cost high, low latency 成本高,延迟小
- 使用sequential安排worker
- Operations & Control: 一个接一个
- Cost vs latency:延迟高,需要等上一个工人的结果,但如果计划分配三名工人,如果他们中的两个同意结果,那么不需要执行另一个 HIT,节约成本
- Operations & Control:
- Repetition
You repeat the tasks until you are satisfied
重复任务直到满意 - Selection
You retrieve tasks using selection mechanisms
使用选择机制检索任务
- Repetition
Aggregating output
Challenges
- Outputs are noisy (lack of expertise)
- Humans are not always reliable (cheating)
- Cultural context may bias the answers
Goal
- Automatic procedure to merge HIT results
Assumptions
- There exists a “true” answer
- Redundancy helps
挑战
- 输出嘈杂(缺乏专业知识)
- 人类并不总是可靠的(作弊)
- 文化背景可能会影响答案
目标
- 自动合并 HIT 结果的程序
假设
- 存在一个“真实”的答案
- 冗余有帮助
Latent Class models
crowdsourcing
Benefits
-
Capturing important information in a timely fashion
-
Labeling datasets
-
Quality of the results
-
Breadth of data
-
及时获取重要信息
-
标记数据集
-
结果的质量
-
数据广度
Stock prediction
Investment factors
- Liquidity principle: financial assets held in rapid cash ability
- Safety principle: the value of the financial asset and and bear ability due to the loss of accident risk
- Profit principle: a financial asset investment income level
- 流动性原则:持有的金融资产具有快速变现的能力
- 安全原则:金融资产的价值和因事故风险损失而产生的承受能力
- 盈利原则:金融资产投资收益水平
Background model
modern portfolio theory (MPT)
MPT 用于选择投资以在可接受的风险水平内最大化其整体回报
利用不同的收益集(盘中、收盘和调整后收盘)和相关性(在一个行业内和与其他市场)来预测未来收益
投资者可以根据对风险承受能力的评估选择两者的最佳组合,从而获得最佳结果。 这种最佳组合构成了有效边界,它是 MPT 的基石,也是指示投资组合的基本线,这些投资组合将提供以最低的风险获得最高的回报。
efficient market hypothesis (EMH)
EMH是金融经济学中的一个假设,它指出资产价格反映了所有可用信息。 EMH 指出全球金融市场在信息上是有效的,这意味着股票价格反映了与目标公司相关的所有信息
Social media as a social sensor
社会媒体有对股票的讨论和信息
Stock-net
Stock-net是一种深度学习解决方案,具有 3 层架构,基础层是市场信息编码器,用于对推文和股票价格数据进行编码。该模型试图根据推文学习股票走势,使用基于事件的情绪分析进行股票预测。
更集中地使用tweeter数据
stock price prediction
Data collection
-
trading day data of a stock
- Basic: Date, Open price, Close price, High, Low, Adjusted close, Volume,
日期,开盘价格,收盘价格,当日最高价,最低价,修正收盘价(考虑任何公司行为后的修正收盘价),交易量(交易日交易的股票数量的价值) - More: Twitter Data for stocks
推特股票数据
- Basic: Date, Open price, Close price, High, Low, Adjusted close, Volume,
-
Cleaning the data
清洗推特数据留下text -
Data processing
- 特殊符号处理,时间统一
- 按照交易日合并股票价格和tweet text
- 用开盘价需要假设推文可能来自一天中的任何时间
- 收盘价更容易了解趋势,并有助于确定推文是否对股票有任何影响
-
Trend representation
- 收盘价和开盘价作差,正向trend标记1,负向trend标记0
-
Normalization dataset
Models
模型用tweet text来预测 trend, 时间作为index
-
使用LSTM/BiLSTM
-
BERT model
BERT 代表来自 Transformers 的双向编码器表示,它基于 Transformers,这是一种深度学习模型,其中每个输出元素都连接到每个输入元素,并且它们之间的权重是根据它们的连接动态计算的
-
dense neural network
-
Distilled BERT