Skip to main content

Posts

Showing posts from 2017

Recommender System using Collaborative filtering

Recommender system using collaborative filtering approach uses the past users' behavior to predict what items the current user would like. We create a UxM matrix where U is the number of users and M is the number of different items or products. Uij is the rating expressed by the user-i for product-j. In the real world, not every user expresses an opinion about every product. For example, let us say there are five users including Bob has expressed their opinion about four movies as shown below Table 1: movie1 movie2 movie3 movie4 user1 1 3 3 5 user2 2 4 5 user3 3 2 2 user4 1 3 4 Bob 3 2 5 ?  Our goal is to predict what movies to recommend to Bob, or put it another way should we recommend movie4 to Bob, knowing the rating for four movies from other users including Bob. Traditionally, we could do item to item comparison, which means if the user has liked item1 in the past then that user may like other items similar to item1. Another way to recommend...

Text Analytics: Creating Word Cloud

In the word cloud, words that appear most frequently in documents are plotted bolder and bigger than less frequently occurring words. The size and maybe the color of the word is determined by the frequency of the word in the documents.  R package 'wordcloud' provides functions that generate such word cloud. A collection of documents is converted into a vector of characters.  sentence <- c("sentence1 ", "sentence2", "sentenceN") We will use gsub() function to remove whitespace, number, and control characters from these sentences. sentence <- gsub('[[:punct:]]', ' ', sentence) sentence <- gsub('[[:cntrl:]]', ' ', data) sentence <- gsub('\\d+', ' ', sentence) stopwords() function from text mining package 'tm' provides a list of stop words in English. We will use this list to remove stopwords from the sentences later. stopwords <-  stopw...

UnSupervised Sentimental Analysis

We are using the method where we have a list of positive words and a list of negative words. We use these words to calculate a sentiment score based on whether the sentence contains more of positive or negative words.  The code is in R. List of positive and negative words positive_words <- c('abounded', 'contentment','exceed') negative_words <- c('abolish', 'baseless','caustic') sentence <- c("manufacturing is abounded in St. Louis and exceed the expectation though it abolished traditional industries", "Acme has to deal with baseless and caustic arguments") Normally, we should be reading these sentence from some file, but here we are creating a sample text to show the process categorizing text based on the sentimental score. Convert the sentence into a list of words using str_split function. word_list = str_split(sentence, '\\s+') words = unlist(word_list) The object 'words' ...

Sentimental Analysis Using Scikit-Learn and Neural Network

Using Scikit-Learn and NLTK for Sentimental Analysis Sentimental analysis is a way of categorizing text into subgroup based on the opinion or sentiments expressed in the text. For example, we would like to categorize review or comments of people about a movie to determine how many like the movie and how many don't. In a supervised sentimental analysis, we have some training data which is already categorized or sub-grouped into different categories, for example, into 'positive' or 'negative' sentiments. We used these training data to train our model to learn what makes a text to be part of a specific group. By text I mean a sentence or a paragraph. Using this labeled sentences, we are going to build a model. So, let us say we have following training text: training_positive = list() training_positive[0] =  "bromwell high is a nice cartoon comedy perfect for family" training_positive[1] =  " homelessness or houselessness as george carlin s...

Apache Spark: ETL

At a very high level, the architecture of Apache Spark consists of Driver Program, Cluster Manager, and Worker Nodes.  An application program for Apache Spark is submitted to Driver Program, where a spark context is created. Through this spark context, a user program can access the various services through API. This driver program requests for resources through cluster manager, distributes parallel operations on these resources, and returns the output back to user's program. The data in Apache Spark is represented through Resilient Distributed Database (RDD), an abstract handle to data.  Using RDD, a user can transform the data into a different dataset or apply action on the data. Similar to RDD, Apache Spark allows data to be represented as DataFrame or DataSet both of which allow to impose some kind of structure on the distributed data so that a higher level program can manipulate data without worrying about the optimizations in Spark framework. With DataSet,...

Mixed Initiative Planning and Dialog managment

Mixed-initiative planning (MIP) is one approach to integrate or involve user through dialog management in the planning process, that is, in solving a user problem.  Planning is the process of finding a sequence of actions that will achieve the goal given the initial state.  PDDL (Planning Domain Definition Language) or its variant is used to describe the problem. Once the problem is fully describe in PDDL, we now know the four things about the problem: the initial state, the action that can be taken in any state, the result of taking such action and the goal state.  Planning graph is a special data structure that represent those four things.

Dialog Management

Dialogue management is a sequential decision-making process.  We can represent dialogue management with a dynamic Bayesian network (DBN) with two assumptions that network is stationary (that is,  probability  P ( X t | X t − 1 ) is identical for all values of t)   and Markov assumption holds true. DBN must also be able to calculate the relative utility of various actions possible in the current state. Considering the fact that nodes in our DBN may be affected the previous value of same or other nodes, the dialogue management is well represented by Dynamic Decision network (DDN). Representing the DDN as probabilistic graph model, we can use generalized variable elimination or likelihood weighting as two approaches for deriving inferences.  Using these inference algorithms, dialogue manager can update dialogue states on receiving new observations and select an appropriate action based on the new or updated state. Finding initial distribution...

Time Series Analysis

In time series, observations are not independent as opposed to cross-sectional data where one observation has no bearing on any other observations.  The goal of the time series is to find such relationship between current observation and its past observations and thus help in predicting future value. The response Y in time-series is composed of: level trend seasonality cycle auto-correlation noise Thus Yt = level + trend + season/cycle + noise.  This noise, even after removing level, trend

Information Retrieval System

Information retrieval (IR) task deals with finding all relevant documents related to user query. Central concepts to IR are removing stop words from corpus (collection of all documents) and query, stemming, lemmatization, representing documents and query to vector, and using some measure of proximity or distance to determine which documents could be relevant to query. Each word in the document after going through stemming and lemmatization is called term. Each unique term in corpus is represented as one dimension in document space. Thus the vectors representing the documents can have more than 10,000 dimension and thus suffer from high dimensionality. Since not all words occur in each document, documents vectors are very sparse and these words seems to follow Zipf's distribution. Value of each term in the document vector could be binary (term occur in the document or not), frequency (how often that term is found in the document), or using term frequency-inverse document frequen...

Sensing room occupancy

Many businesses are interested in knowing the utilization of conference rooms at their premises.  Audio devices in these conference rooms can be used to measure that utilization. When plotted, the audio activity detected by the microphone shows a bimodal characteristic with one mode representing when room is occupied and the other mode when room is not occupied. Here, I show an image of a bimodal graph generated through R-code. Here is the R-code: x <- c(rnorm(5000,1,1),rnorm(10000,9,1))  ggplot(data.frame(x=x))+geom_density(aes(x=x)) Our goal is to identify the mean and standard deviation of each mode. Following R code will tell the mean, standard deviation and proportion of data belonging to each node. set.seed(50) > bimodal <- normalmixEM(x, k = 2) number of iterations= 9  > bimodal$mu [1] 0.9992427 9.0070292 > bimodal$sigma   [1] 0.9922105 0.9964136 > bimodal$lambda [1] 0.3332021 0.6667979 > ...