ML+NLP: 一月 2015

2015年1月19日星期一

sentiment analysis

sentiment words

Feldman et al. (2011), expert investors in microblogs were identified and sentiment analysis of stocks was performed.

The Stock Sonar - Sentiment Analysis of Stocks Based on a Hybrid Approach.
Feldman et al. (2011) introduced a hybrid approach for stock sentiment analysis based on companies’ news articles。
a nonparametric approach: the Dirichlet Process Mixture (DPM) model
First, we employ a DPM to estimate the number of topics in the streaming snapshot of tweets in each day.

Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach.

In Wang et al. (2011), the authors proposed a graph-based hashtag approach to classifying Twitter post sentiments, and in Kouloumpis et al. (2011), linguistic features and features that capture information about the informal and creative language used in microblogs were also utilized.

PMI:
Turney (2002) Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews,

The PMI-IR algorithm is employed to estimate
the semantic orientation of a phrase (Turney,2001). PMI-IR uses Pointwise Mutual Information
(PMI) and Information Retrieval (IR) to measure the similarity of pairs of words or phrases. The semantic orientation of a given phrase is calculated by comparing its similarity to a positive reference
word (“excellent”) with its similarity to a negative reference word (“poor”).

Here, p(word1 & word2) is the probability that word1 and word2 co-occur. If the words are statistically
independent, then the probability that they co-occur is given by the product p(word1)
p(word2). The ratio between p(word1 & word2) and p(word1) p(word2) is thus a measure of the degree of statistical dependence between the words. The
log of this ratio is the amount of information that
we acquire about the presence of one of the words

when we observe the other.

Turney(2001)
Thumbs up? Sentiment Classification using Machine Learning Techniques
http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf

LDA(2013)

The unsupervised approach was used too by Xianghua and Guo [50] to automatically discover the aspects discussed in Chinese social reviews and also the sentiments expressed in dif- ferent aspects. They used LDA model to discover multi-aspect global topics of social reviews, then they extracted the local topic and associated sentiment based on a sliding window con- text over the review text. They worked on social reviews that were extracted from a blog data set (2000-SINA) and a lexicon (300-SINA Hownet). They showed that their approach obtained good topic partitioning results and helped to improve SA accuracy. It helped too to discover multi-aspect fine- grained topics and associated sentiment.

There are other unsupervised approaches that depend on semantic orientation using PMI [82] or lexical association using PMI, semantic spaces, and distributional similarity to measure the similarity between words and polarity prototypes [83].

[82] Turney P. Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In: Proceedings of annual meeting of the Association for Computational Linguistics (ACL’02); 2002.
[83] Read J, Carroll J. Weakly supervised techniques for domain- independent sentiment classification. In: Proceeding of the 1st international CIKM workshop on topic-sentiment analysis for mass opinion; 2009. p. 45–52.

tweets

Sentiment classification aims to identify the sentiment (or polar- ity) of retrieved opinions. There are two categories of approaches for this task.
One approach is to develop the linguistic resources for senti- ment orientation and the structures of sentiment expression, and then classify the text based on these developed resources [16]. Linguistic re- source development aims to construct linguistic resources that provide subjectivity, orientation, and the strength of terms, and make it possible to perform further opinion mining tasks. WordNet expansion and statistical estimation [18], such as the point-wise mutual information method, are two major methods.
The second approach for analyzing sentiment is to train and deploy a sentiment classifier, which can be built with several methodologies, such as support vector machine (SVM), maximum entropy, and naïve Bayes [46].

Recently, several works on the sentiment analysis of microblog opinions have been conducted. In [8], the authors use a predefined lexicon word set of positive and negative words to classify Twitter posts and track the sentimental fluctuation to the result of polls, such as consumer confidence survey and the job approval of President Obama in the US. The authors argue that time-intensive and expensive polls could be supplemented or supplanted by simply analyzing the text on microblogs. In [9], the authors develop an analytical methodology and visual representations that could help a journalist or public affairs manager better understand the temporal dynamics of sentiment in re- action to the debate video. The authors demonstrate visuals and metrics to detect sentiment pulse, anomalies in that pulse, and indications of controversial topics that can be used to inform the design of visual analytic systems for social media events.

To classify sentiments on microblogs, machine learning should be adequate because many new sentimental words are invented and used widely on microblogs. It is difficult to determine the sentiment polarity of many exclamations and emoticons, such as “arrrg” and “>__b” by using the common sentiment linguistic sources construction approach. With large and up-to-date training data, machine learning methods are more capable to deal with those words. In our framework, an SVM classifier was used, while we apply several heuristic preprocesses and test different features to provide a more accurate classification.

ASPECT EXTRACTION

an opinion always has a target.

Extraction based on frequent nouns and noun phrases
Extraction by exploiting opinion and target relations
Extraction using supervised learning
Extraction using topic modeling

Zhu et al. (2009) proposed a method based on the Cvalue measure from (Frantzi et al. (2000) for extracting multi-word aspects.

which is then refined using a bootstrapping technique with a set of given seed aspects

The idea of refinement is based on each candidate’s co-occurrence with the seeds.

linear regression

discriminative: SVM, LR,
generative:
smoothing,

Linear Regression
Regularized Linear Regression – Ridge regression, Lasso
Polynomial Regression
Kernel Regression
Gaussian Process Regression
Regression Trees, Splines, Wavelet es$mators, …

Linear Regression

Learn to derive the least squares estimate by optimization.

(1)另一种线性回归方法：Normal Equation；

(2)Gradient Descent与Normal Equation的优缺点；

前面我们通过Gradient Descent的方法进行了线性回归，但是梯度下降有如下特点：

(1)需要预先选定Learning rate；

(2)需要多次iteration；

(3)需要Feature Scaling；

因此可能会比较麻烦，这里介绍一种适用于Feature数量较少时使用的方法：Normal Equation；

当Feature数量小于100000时使用Normal Equation；

当Feature数量大于100000时使用Gradient Descent；

Normal Equation的特点：简单、方便、不需要Feature Scaling；

How to evaluate:

While this largely depends on exactly what your goals are, a simple and standard way to do this would be measuring the mean squared error (MSE). So if you have your test dataset

 which consist of input/output pairs,

={(x1,y1),(x2,y2),…,(xn,yn)} and your parameters

a and

b, then the MSE can be calculated as

MSE a, b = 1 n \sum i = 1 n (y i - (a x i + b)) 2 .

This is probably a sensible way to measure your error also since this is likely the criteria you used for finding the parameters

a and

b. If you want to get a better idea of how well your estimated parameters generalize, you should look into something like cross validation.

Regularized Linear Regression

The big question is how do we
choose the regularization coefficient,
the width of the kernels or the
polynomial order?
Solution: cross-validation

the L1 norm
This type of regularization is at the heart of a recent revolution in data
acquisition known as compressed sensing.

NB-naive bayes

We use Naïve Bayes in many cases anyway, and
it often works pretty well
– often the right classification, even when not the right
probability (see [Domingos&Pazzani, 1996])

topic model

2015年1月18日星期日

probability

tom

gradient ascent

Gradient ascent is simplest of op$miza$on approaches
– e.g., Newton method, Conjugate gradient ascent, IRLS (see Bishop 4.3.3)

Gradient Ascent for Logistic Regression

Comparison of Machine Learning Algorithms

table of algs

generative and discriminative classifiers: NB and LR

Tom-NB&LR

• We can use Bayes rule as the basis for designing learning algorithms (function
approximators), as follows: Given that we wish to learn some target
function f : X → Y, or equivalently, P(Y|X), we use the training data to
learn estimates of P(X|Y) and P(Y).
New X examples can then be classified using these estimated probability distributions, plus Bayes rule. This type of classifier is called a generative classifier, because we can view the distribution P(X|Y) as describing how to generate random instances X conditioned on the target attribute Y.

• Learning Bayes classifiers typically requires an unrealistic number of training
examples (i.e., more than |X| training examples where X is the instance
space) unless some form of prior assumption is made about the form of
P(X|Y). The Naive Bayes classifier assumes all attributes describing X
are conditionally independent given Y. This assumption dramatically reduces
the number of parameters that must be estimated to learn the classi-
fier. Naive Bayes is a widely used learning algorithm, for both discrete and
continuous X.
• When X is a vector of discrete-valued attributes, Naive Bayes learning algorithms
can be viewed as linear classifiers; that is, every such Naive Bayes
classifier corresponds to a hyperplane decision surface in X. The same statement
holds for Gaussian Naive Bayes classifiers if the variance of each feature
is assumed to be independent of the class (i.e., if σik = σi).

• Logistic Regression is a function approximation algorithm that uses training
data to directly estimate P(Y|X), in contrast to Naive Bayes. In this sense,
Logistic Regression is often referred to as a discriminative classifier because
we can view the distribution P(Y|X) as directly discriminating the value of
the target value Y for any given instance X.

• Logistic Regression is a linear classifier over X. The linear classifiers produced
by Logistic Regression and Gaussian Naive Bayes are identical in
the limit as the number of training examples approaches infinity, provided
the Naive Bayes assumptions hold. However, if these assumptions do not
hold, the Naive Bayes bias will cause it to perform less accurately than Logistic
Regression, in the limit. Put another way, Naive Bayes is a learning
algorithm with greater bias, but lower variance, than Logistic Regression. If
this bias is appropriate given the actual data, Naive Bayes will be preferred.
Otherwise, Logistic Regression will be preferred.

• We can view function approximation learning algorithms as statistical estimators
of functions, or of conditional distributions P(Y|X). They estimate
P(Y|X) from a sample of training data. As with other statistical estimators,
it can be useful to characterize learning algorithms by their bias and

expected variance, taken over different samples of training data.

2015年1月17日星期六

text classification

Many of these methods, including support vector machines (SVMs), the main topic of this chapter, have been applied with success to information retrieval problems, particularly text classification.

While several machine learning methods have been applied to this task, use of SVMs has been prominent. Support vector machines are not necessarily better than other machine learning methods (except perhaps in situations with little training data), but they perform at the state-of-the-art level and have much current theoretical and empirical appeal.

It is frequently the case that greater performance gains can be achieved from exploiting domain-specific text features than from changing from one machine learning method to another.
Understanding the data is one of the keys to successful categorization.
This process is generally referred to as feature engineering . At present, feature engineering remains a human craft, rather than something done by machine learning. Good feature engineering can often markedly improve the performance of a text classifier. [http://nlp.stanford.edu/IR-book/html/htmledition/features-for-text-1.html]

semi-supervised training methods . This includes methods such as bootstrapping or the EM algorithm,which we will introduce in Section 16.5

how to adjust the weights of an SVM without destroying the overall classification accuracy.

It may be best to choose a classifier based on the scalability of training or even runtime efficiency.

Using SVM

I'm still not entirely sure, should I use this algorithm or that algorithm, that's actually okay.

When I face a machine learning problem, you know, sometimes its actually just not clear whether that's the best algorithm to use, but as you saw in the earlier videos, really, you know, the algorithm does matter, but what often matters even more is things like, how much data do you have.

And how skilled are you, how good are you at doing error analysis and debugging learning algorithms, figuring out how to design new features and figuring out what other features to give you learning algorithms and so on.

And often those things will matter more than what you are using logistic regression or an SVM.

But having said that,the SVM is still widely perceived as one of the most powerful learning algorithms, and there is this regime of when there's a very effective way to learn complex non linear functions.

And so I actually, together with logistic regressions, neural networks, SVM's,

using those to speed learning algorithms you're I think very well positioned to build state of the art you know, machine learning systems for a wide region for applications and this

is another very powerful tool to have in your arsenal.

One that is used all over the place in Silicon Valley, or in industry and in the Academia, to build many high performance machine learning system.