Wednesday, 4 December 2019

Explainability in Data Science:- Data, Model & Prediction

XAI( Explainable AI  ) is grabbing lime-light in machine learning. How can we be sure that image classification algo is learning faces not background ? Customer wants to know why loan is disapproved? Globally important variable might not be responsible/ imp for individual prediction. Here XAI comes to rescue-

We have taken data from classification_data

This has some sensor values and an output class. 

A) Data Explainability- what are the basic understanding required from data perspective. 

1)      Identify missing values, co-linear feature, feature interaction, zero importance feature, low important feature, single value feature and handle missing values, remove/ handle features accordingly.
2)      Missing values- no missing values from data description
3)      No good correlation between variables- can be seen from correlation plots
4)      Feature interaction- tree based models would approximate integration interation in CART
5)      Zero importance, low importance, single feature value- handles through RFE and models( RF, XGboost) itself.
6)      Distribution and sampling of both the class and features is also seen as selection of model will depend of data distribution. Chances are data with lot of categorical variables is more suitable for tree based model.
7)      Box plot itself can identify important feature for classification. We can see sensor 3, 8, 6 looks important whereas 5, 7 may not have good prediction power.

B) Other Approaches- Feature selection/engineering-

1)      univariate feature selection using chi square test. ( select k best)-
2)      Recursive feature Engineering RFE- select n specific features based on underlying model used. ( used)
3)      PCA PCA- to reduce corelated feature by linear transformation ( not needed)
4)     Autoencoders-  non linear transformation of features if needed ( it will be over-kill here)
5)      Feature importance by Random forest, DT( In terms of rules), other tree ensemble models like Catboost and Xgboost.- used on our scenario

C) Feature Importance on sensor data ( Global)-  In practical I take features importance from the domain / business people,  as in our scenario sensor 7 ( one of the least important feature) might be electric current in steel mixture plant and to see impact of current in anomalies/fault it has to be on higher sampling( micro/ mili seconds) unlike temperature. Thus we will be missing an important feature as data collection rate is not correct. Such understanding can only come from domain experts. So business understanding and ML both are equally important for feature engineering.

There are white box models like DT and Random Forest to get feature importance from model itself. In our case we have taken coefficient of logistic regression in the beginning.( see all the algos comparison at github- link  Here we are relying on the models that have maximum accuracy - RF and xgboost.

Thus over all we can say that feature- 8,6, 4, 0, 1,3 looks important for classification model. Feature 7 seems having no importance in xgboost as its classification power is captured by other feature. This Important of features was visible in box-plot also.

Recursive feature Elimination is useful in selecting subset of features as it tells top feature to keep for modeling. 

D) Feature Importance on sensor data ( Local)-

With the advancement of ML and Deep learning, just global importance is not useful. Business, Data scientist are looking for local explanation too. In our analysis, we have used IBM AIX 360 framework to get importance of rules on the features( importance of feature based on the values of feature and output value). The options to use different packages/framework are-

MS Azure Explainability

The above image shows feature 8 is most important over-all but when it comes to specific predictions. Subset of feature 6 seems more importance for many predictions. We can get good insights from such rules like- sensor 6 in 1 st and 4th quadrant has less importance compare to very strong importance in quadrant 2 and 3. If we know the exact feature name we can get lot of valuable insights.

F) SHAP Values explanation-

In above plot 10 data points from class 1 is selected, we can clearly see for these data points 6 is more important and importance of 8 is changing based on values of features. At the same time feature 1,2,3,5,7, are almost not useful at all for the prediction. ( 1 Series represents 1 observation)


Above plot has 10 observations from class -1. It shows that for class -1 , feature 0 is also important for few predictions and instead of 8 and 6, 7 and 9 are more important.

Such finding are more important when we have scenarios like multiple fault prediction, anomalies classification I industrial applications. Once we know the actual name of signals we will get very insightful information.

Above plot shows how signal 6 is mostly useful in prediction but there are many instances when it has no importance on predicted value. Also feature 6 has more classifying power for class 1 rather than -1. Similar analysis can be done on other features for better and exhaustive understanding of features- importance.

Detailed code is present on Github- link to github code

Saturday, 23 November 2019

Automation of customer-care tickets resolution using NLP

When we call customer care, they keep on connecting with different department like technical department, billing department etc. What if they suggest some quick fixes even though they are not expert in providing solution.

Keeping this in mind, lets build a simple solution recommendation system for internet servive providers based on cosine similarity of earlier questions with the present question. Higher the similarity, solution might be same.Assumption is questions with similar title, content will have similar solution.

Following steps would be required to implement solution in Python 3.0+-

1) load nlp libraries.

2) create dummy data with some questions and answers.

3) create a function that calculates cosine similarity of new ticket with all existing ticket titles.

4) show the solution/answer of the ticket that has maximum similarity with present ticket.

Step 1 - import required libraries-

import pandas as pd   
import nltk
from nltk.corpus import stopwords   
from nltk.tokenize import word_tokenize'punkt') # this is tokenizer that converts words in to tokens'stopwords') # all the stop words like verbs, prepositions etc. 

Step 2 -create a dummy dataset-

question_ans_data= pd.DataFrame()

question_ans_data['question']= ['there is no internet','no ineternet since last 2 days','net speed is slow','wrong bill','too much charge']

question_ans_data['answer']= ['restart router, check if lights blinking','technician will be sent, check lights, restart router','technician will be sent','will get back to you','will get back to you'] 

have a look at data-

 Step 3- create a function ( set_con)  to do text pre-processing and calculate cosine similarity between 2 strings-

def set_con(X, Y):
    X_list = word_tokenize(X)  
    Y_list = word_tokenize(Y) # convert string into word tokens
    sw = stopwords.words('english')  
    l1 =[];l2 =[]
    X_set = {w for w in X_list if not w in sw} # remove stop words
    Y_set = {w for w in Y_list if not w in sw}
    rvector = X_set.union(Y_set)  

    # form a set containing keywords of both strings as pre-process step to calculate cosine similarity ( can be calculated from sklearn.matrics also)
    for w in rvector: 
        if w in X_set: l1.append(1) # create a vector 
        else: l1.append(0) 
        if w in Y_set: l2.append(1) 
        else: l2.append(0) 
    c = 0

    # cosine formula
    for i in range(len(rvector)): 
            c+= l1[i]*l2[i] 
    cosine = c / float((sum(l1)*sum(l2))**0.5) 

Step 4 create a subject/title of input ticket as a string-

input_ticket= 'broadband internet not working'  # input ticket

Step 5, find similar most similar ticket title/s with existing ticket-

question_ans_data['cosine_similiarity']= [set_con(x ,input_ticket) for x in question_ans_data['question']]  # calculating cosine similarity with existing tickets

sorted_main_df=question_ans_data.sort_values(by=['cosine_similiarity'], ascending=False)

output_dataset= sorted_main_df[sorted_main_df['cosine_similiarity'] == max(sorted_main_df['cosine_similiarity'])]   # most similar tickets based on similarity of questions


So if the question is ' there is no internet', solution might be to restart router, check light . Given a large data-set with many tickets and possible solution, this can provide great help for customer care executives. 

product recommendation approach in retail industry-

Product Recommendation using MBA

Wednesday, 25 September 2019

Sentiment Analysis using NLTK and Sklearn in Python

Data can be downloaded from -

Step 1 - loading required libraries

import os # to check working path
from sklearn.datasets import load_files # load_files automatically labels classes when input data is present in different folders
import re # for regular expressions
import nltk  # for nlp 
from nltk.stem import WordNetLemmatizer # to use WordNet dataset for stemming'wordnet')
from sklearn.feature_extraction.text import TfidfVectorizer # get tf-idf values
from sklearn.model_selection import train_test_split # to split testand train dataset
from sklearn.ensemble import RandomForestClassifier # for classification
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import pickle to save model

Step 2 - loading data

movie_data = load_files("C:\\D\\Learning\\Sentiment Analysis usinf sklearn\\txt_sentoken")

Step 3- data preprocessing and converting into tf-idf values ( documents are converted into array of all the words ( tf-idf value of every word in every documents)
new_X= []
for data in X:
    data1= str(data)
    data2= re.sub(r'[^\w]', " ", data1) # replaces all special characters
    data3= re.sub(r'[\s+\W+\s]', " ", data2) # replaces all single letter word
    data4= re.sub(r'[ ][ ]+', " ", data3) # removes multiple spaces
    data5 = re.sub(r'^b\s+', '', data4) #  removes leading b
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', data5) # removes single letter
    document_splitted= document.lower()
    document_splitted= document.split() # stemming has to be done on strings
    stemmer = WordNetLemmatizer()
    stemmed_doc= [stemmer.lemmatize(word) for word in document_splitted]
    stemmed_str= " ".join(stemmed_doc) # converting list back to str
    new_X.append(stemmed_str) # creating list of documents
vectorizer = TfidfVectorizer()
X= vectorizer.fit_transform(new_X)
X_arr= X.toarray()     

Step 4- Getting train and test set and fitting classification 

X_train, X_test, y_train, y_test = train_test_split(X_arr, y, test_size= .2)
classifier = RandomForestClassifier(n_estimators=1000, random_state=0), y_train)

Step 5- model Evaluation-

# model evaluation on train data
y_predicted= classifier.predict(X_train)
cf= confusion_matrix(y_train, y_predicted)

print(classification_report(y_train, y_predicted)) 
# model evaluation on test data
y_test_predicted= classifier.predict(X_test)
print(confusion_matrix(y_test, y_test_predicted))

print(classification_report(y_test, y_test_predicted))

Step 6- storing and loading model again-

with open('text_classifier', 'wb') as picklefile: 
with open('text_classifier', 'rb') as mfile:
model= pickle.load(mfile)

Step 7- test on new document

file1 = open("nerw_review.txt","r")
data_file= file1.readlines()

X1= vectorizer.transform(data_file) # vectorizer.transform is used to convert new doc into tf-idf
predict_review= classifier.predict(X1)


Tuesday, 9 July 2019

Deep Learning with H2O in Python is focused on bringing AI to businesses through software. Its flagship product is H2O, the leading open source platform that makes it easy for financial services, insurance companies, and healthcare companies to deploy AI and deep learning to solve complex problems. More than 9,000 organizations and 80,000+ data scientists depend on H2O for critical applications like predictive maintenance and operational intelligence. The company – which was recently named to the CB Insights AI 100 – is used by 169 Fortune 500 enterprises, including 8 of the world’s 10 largest banks, 7 of the 10 largest insurance companies, and 4 of the top 10 healthcare companies. Notable customers include Capital One, Progressive Insurance, Transamerica, Comcast, Nielsen Catalina Solutions, Macy’s, Walgreens, and Kaiser Permanente.

Using in-memory compression, H2O handles billions of data rows in-memory, even with a small cluster. To make it easier for non-engineers to create complete analytic workflows, H2O’s platform includes interfaces for R, Python, Scala, Java, JSON, and CoffeeScript/JavaScript, as well as a built-in web interface, Flow. H2O is designed to run in standalone mode, on Hadoop, or within a Spark Cluster, and typically deploys within minutes.

H2O includes many common machine learning algorithms, such as generalized linear modeling (linear regression, logistic regression, etc.), Na¨ıve Bayes, principal components analysis, k-means clustering, and word2vec. H2O implements bestin-class algorithms at scale, such as distributed random forest, gradient boosting, and deep learning. H2O also includes a Stacked Ensembles method, which finds the optimal combination of a collection of prediction algorithms using a process 6 | Installation known as ”stacking.” With H2O, customers can build thousands of models and compare the results to get the best predictions.

Here is an example to use H2O-deeplearning in Python- 

Step 1-  First of all , we need to install H2o package in Python.

on anaconda prompt
pip install h2o

Step 2-  Initialize and start the cluster -

from h2o.estimators.deeplearning import H2ODeepLearningEstimator

Step 3-  load train and test data set-

train = h2o.import_file("")

Step 4-  Creating test and train data set using split-

splits = train.split_frame(ratios=[0.75], seed=1234)

Step 5-  Configuring the model-

model = H2ODeepLearningEstimator(distribution = "AUTO",activation = "RectifierWithDropout",hidden = [32,32],input_dropout_ratio = 0.2,l1 = 1e-5,epochs = 10)

Step 6-  train(fit the model)-

model.train(x="sepal_len", y=["petal_len"], training_frame=splits[0])

Step 7-  predicting using trained model and creating a new column in test data-


One can compare sepal_len ( actual) and predicted_sepal_len ( forecasted )  values.

Thursday, 4 July 2019

How to survive in data science and the first steps

Few years before I read this article and it made sense in 2012-2017-

Days are gone when IT organizations are looking for core data science profile which includes doing research and complete the POC. There is a lot of hype around data science and in very near future this profile will become obsolete. People in the data science profile know it.  It’s fancy for other IT profiles because a lot of material is bombarded by training institutes and start ups. Current demand is short term( Organizations are in exploration phase, what to do with data and delivering POCs ). Most of the organization are now looking for ML-Engineer profile which is the combination of 3 profiles- data engineer, data science and someone who can deploy in production( in cloud most of the time).

The sooner the better. So-called data scientist should move into data engineering and embrace the cloud. Here I have given small introduction on how to start working on Azure-Databricks so that people like me can become a better hiring material.

Step-1 Create Azure trial account, Databricks Workspace and launch the workspace

Step-2 Data bricks quick start-

Step-3 Why not try Keras-

        A)   Sequential model is a data structure given in Keras. One needs to add layers according to NN model-

f            from keras.models import Sequential
                model = Sequential()

        B)      add the layers according to structure of neural network-

from keras.layers import Dense
model.add(Dense(units=4, activation='relu', input_dim=2))
model.add(Dense(units=1, activation='linear'))

C)      configure the model by passing arguments-

              metrics=['mae', 'mape'])

D)   creating X and Y values-

     x1 = np.random.randn(10000, 2)
     dataframe_X= pd.DataFrame(x1)
     dataframe_X.columns =['x1','x2']
     Y1 = np.random.randn(10000, 1)

   E)      fitting the model by calling, y_train, epochs=5, batch_size=32)

      F)      model evaluation-

     evaluation_metrics= model.evaluate(x_test, y_test)

        G)    use model for prediction

     predicted_value = model.predict(dataframe_X)

    f)   testing on same data-

     predicted_vals = model.predict(x_test, batch_size=32)

Although this code is written in python but now we have run first ML program on databricks. One should start replacing python commands with PySpark commands make it a habit over time.
In production, this notebook will read run time data by scheduling a job( how to schedule a job in data bricks) and from notebook one can save predicted values in any database which can further be read by visualization tool/ another application.

Data scientist should come out of pure research, statistics, R/Python profile to be stay relevant in IT industry. Remember golden words by Charles Darwin-

Friday, 14 December 2018

Easiest and most effective way of detection abnormality/ outlier in time-series data

We have read many blogs on various anomaly detection algorithms. Many a times, we don't need any algorithm to detect abnormality in a system. 

Different machine learning approaches to detect abnormality in system .

data scientists are using muti-angle PCA to auto-encoders to detect abnormality in a time series data. There are other complex techniques like ABOD ;used in high dimensional data and CBOF ; used when density based algorithms fail. These techniques are effective only if you know the properties of expected abnormality in system.

The most effective approach as mentioned in Anomaly detection approaches , is building an expected rule from the variables involved and any deviation form this rule is indication of abnormality in time series. One can use auto encoder , PCA or regression to build such rules. We are using regression so that audience understand the concept and don't get bogged down by related algorithms.

We can take any home appliance for example like Electric Fan. Let's say we know the temperature of fan's motor and current going into it.

from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# take data values from normal running scenario, hopefully there is no issue in motor now. Generally this is the time when fan is just installed -

# creating a dummy data
data = [[352,88],[350,90],[350,89],[400,95],[400,94], [390,92], [400,93], [352,88],[350,90],[352,91],[400,95],[400,94], [390,92], [400,93],[350,90],[350,89],[400,95]]

df = pd.DataFrame(data,columns=['Current','Temp'],dtype=float)

# taking independent ( current)  and dependent variable ( temperature)  for relation ( to build using regression )
X= df['Current']
X1= X.values.reshape(X.size,1)
Y= df['Temp']
Y1 = Y.values.reshape(Y.size,1)

# fitting the regression model
regr = linear_model.LinearRegression(), Y1)
predictions =regr.predict(X1)

# plotting error and analyzing  it
error =Y1- predictions

plot shows that values are lying randomly between y=0 and error is in between +- 1.5. Seems a good fit. Thus we get a relation between current and temperature of motor. If we know the actual current, we can predict temperature with some accuracy. Now the concept is - 'if actual temp is far more than what it should be ( predicted from current values), then there might be some thermal abnormality in the motor. Lets extend our example further-

Taking run time data( run time values of current and temperature) from fan now;

test_X= np.array([400,380,370,355, 370,370,350, 360, 355,352,350,350,400,400,390,400,400,380,400,380,390,400,350,350])
test_Y= np.array([96,94,93, 92, 93,98,97, 98,97,88,90,89,95,94,92,94,96,94,96,93,92,94,90,90])

# predicting temperature for the present values of current ( at run time)
test_X1= test_X.reshape(test_X.size,1)
test_Y1 = test_Y.reshape(test_Y.size,1)
run_time_predictions =regr.predict(test_X1)

# plotting the errors
plt.plot(test_Y1- run_time_predictions)

Error seems high for few minutes ( between 5 to 9) . Lets combine both test and train error values.-

# combining train and test errors to include longer period of time in analysis
X_values= np.concatenate((X1, test_X1), axis=0)
Y_values= np.concatenate((Y1, test_Y1))
prections_values= np.concatenate((predictions,run_time_predictions))
Error_values= Y_values- prections_values

Y values( errors) near x=20, shows that temperature is far more that expected for specific amount of current flow. This has to be investigated further. ( coolant might not be working, spark is happening etc) . After 23-24, motor is running fine again as error is randomly distributed along y=0.

Thus at run time high error ( positive, ie. actual more than expected) is an indicator of abnormal system. I don't know how it came like somebody is showing middle finger, but exactly the middle finger is abnormal here. haha!!

The Github link for the same is present at - Python_Regression_Anomaly_Detection

Read about the mother of all time series algorithms here- ForecastHybrid

Saturday, 6 October 2018

Religious demographics of India in future: A Machine Learning View

According to Sachar Committee ( ref-1) report in 2005, the religious demographics of India for next 100 years is below-

We took a machine learning approach and built different time series' to show demographics( of 2 major religion) in coming years. The data is taken from Wikipedia ( 2011 Census of India; ref 2) . Data used is given below-

Above image clearly shows that Hinduism is major religion followed by Islam. Lets create a new variable ratio of 'Hinduism to Islam' for these 70 years-

for 1951 ratio is 84.1/9.8, which is 8.581633, similarly for  other decades-

8.581633, 7.806361, 7.380018, 7.004255, 6.465504, 5.991065, 5.607871,

so Hinduism which was 8.5 times of Islam in 1951 is 5.6 times in 2011.

Now, let's build Arima time-series on ratio variable-

comman_ratio <- auto.arima(ratio)
forecasted_ratio <-forecast(comman_ratio, 10)

Above table and Image shows that around  2100, Islam and Hinduism will have equal number of followers. Is this forecasting correct??

Let's build another time series with different ratio, now variable is ratio of Islam to Hinduism population. This variable gives the percentage of Islam respect to Hinduism population in India.

0.1165279, 0.1281007 ,0.1355010, 0.1427704, 0.1546670, 0.1669152, 0.1783208 ( ratio1)

in 1951, Islam is 11 % of total Hinduism and in 2011 it's 17 % of total Hinduism in India.

comman_ratio1 <- auto.arima(ratio1)
forecasted_ratio <-forecast(comman_ratio, 80)

qq <- c(ratio1, forecasted_ratio$mean)
year= seq(from = 1951, to=2811, by=10)
df <- data.frame(percentage_of_islam_compare_to_hinduism= qq, year =year )
ggplot2::ggplot(df, aes(year, percentage_of_islam_compare_to_hinduism)) + geom_line()

so this forecasting says that Islam is not going to be equal but 28% of total Hinduism and with current growth rate it would take 800 years for Islam to become equal to Hinduism in terms of followers.

So what is correct composition of demographics in 2100? Machine learning is giving different results based on variable taken. Plus 7 data points are not sufficient to forecast future 70 values. ☺☺Results might be different if we had taken only population of religions not the ratios. 


1) Sachar_Committee
2) 2011_Census_of_India

Tuesday, 25 September 2018

Connectivity Based Outlier Detection and its implementation in R

Identifying abnormality in any industrial process, banking fraud, ad clicks etc is one of the major challenges for data scientist. There are many ways of detecting an abnormality.

different ways of detecting abnormalities through machine learning

There are many outlier detection techniques. One of these is connectivity based outlier factor. It is an improved version of LOF (local outlier factor) technique. 

data point away in linear set of data points
 should have been picked as outlier

The idea of Connectivity based outlier algorithm is to assign degree of outlier to each data point. This degree of outlier is called connectivity based outlier factor; COF of the data point. High COF value of data point represent the high probability of being an outlier.
Let’s understand COF step by step with an example.

Below diagram shows 9 data points in the plane. As we can see there are 2 data points P1 and P2 which are away from the trend line and seems outlier. The COF value for P1 and P2 should be higher than other data points in the trend line. Here we are taking k=5 nearest neighbor for COF calculation.
Following steps to compute the COF value for a data point P1.

1)   Find k nearest neighbor (k-NN) of the data point P. (k=5)

N5 (P1) = {P2, P5, P4, P7, P6} create set of all data points nearer to P1.

2)   Find Set based nearest (SBN) path: represent k nearest data points in order s={P1,P2,……., Pk}
SBN path = {P1, P2, P5, P4, P6, P7}, arrange data points in such a way that it should create a path, like P2 is the nearest data point from P1 then P5 is the nearest data point from P2, then either P6 or P4 can be choose as nearest data from P5 then P7 is the nearest data point from P6. All chosen data points must be available in nearest neighbor data points N5 (P1) set.

3)    Find set based nearest (SBN) trail: represent sequence of edges based on SBN path e={e1,e2, …,ek}. SBN trail = {(P1, P2), (P2, P5), (P5, P4), (P5, P6), (P6, P7)} arrange set of data points with respect to edges e1, e2, e3, e4, e5 respectively.

4)    Find the cost of SBN trail: represent the distance between 2 data point (edge value) - Cost description = {3, 2, 1, 1, 1} weight of each edge.

5)   Find Average chaining distance of the data point
dist(ei) denotes distance between 2 data points, an edge, ex-

Like P1, find average chaining distance for all 5 nearest neighbor P2, P4, P5, P6 and P7.
Formula Explanation:
Total no of edges = {(P1,P2),(P1,P4),(P1,P5),(P1,P6),(P1,P7),(P2,P4),(P2,P5),(P2,P6), (P2,P7),(P4,P5),(P4,P6),(P4,P7),(P5,P6),(P5,P7),(P6,P7)} =15

k(k+1)/2 = 5(5+1)/2=15

·       Sum of all edges weight during traversal of nearest data point
Ø Edge weight from P1 to P2 = 3
Ø Edge weight from P1 to P5 = 3+2 =5
Ø Edge weight from P1 to P4 = 3+2+1 =6
Ø Edge weight from P1 to P6 = 3+2+1+1 =7
Ø Edge weight from P1 to P7 = 3+2+1+1+1 =8
Total edge weight =(3+5+6+7+8) = 29

ac-dist(P1) = 29/15 = 1.933

6)     Find COF value of the data point-

COF is the ratio of average chaining distance of data point and the average of average chaining distance of k nearest neighbor of the data point.

 Like COF(P1), find COF for all the data points available in diagram, the data points having high COF values will be considered as outliers.

Darker data points showing most outlier data points. One can compare CBOF with Angle based outlier detection techniques ( ABOD).