XAI( Explainable AI ) is grabbing lime-light in machine learning. How can we be sure that image classification algo is learning faces not background ? Customer wants to know why loan is disapproved? Globally important variable might not be responsible/ imp for individual prediction. Here XAI comes to rescue-
We have taken data from classification_data
This has some sensor values and an output class.
A) Data Explainability- what are the basic understanding required from data perspective.
1) Identify missing values, co-linear feature, feature interaction, zero importance feature, low important feature, single value feature and handle missing values, remove/ handle features accordingly.
2) Missing values- no missing values from data description
3) No good correlation between variables- can be seen from correlation plots
4) Feature interaction- tree based models would approximate integration interation in CART
5) Zero importance, low importance, single feature value- handles through RFE and models( RF, XGboost) itself.
6) Distribution and sampling of both the class and features is also seen as selection of model will depend of data distribution. Chances are data with lot of categorical variables is more suitable for tree based model.
7) Box plot itself can identify important feature for classification. We can see sensor 3, 8, 6 looks important whereas 5, 7 may not have good prediction power.
B) Other Approaches- Feature selection/engineering-
1) univariate feature selection using chi square test. ( select k best)-
2) Recursive feature Engineering RFE- select n specific features based on underlying model used. ( used)
3) PCA PCA- to reduce corelated feature by linear transformation ( not needed)
4) Autoencoders- non linear transformation of features if needed ( it will be over-kill here)
5) Feature importance by Random forest, DT( In terms of rules), other tree ensemble models like Catboost and Xgboost.- used on our scenario
C) Feature Importance on sensor data ( Global)- In practical I take features importance from the domain / business people, as in our scenario sensor 7 ( one of the least important feature) might be electric current in steel mixture plant and to see impact of current in anomalies/fault it has to be on higher sampling( micro/ mili seconds) unlike temperature. Thus we will be missing an important feature as data collection rate is not correct. Such understanding can only come from domain experts. So business understanding and ML both are equally important for feature engineering.
There are white box models like DT and Random Forest to get feature importance from model itself. In our case we have taken coefficient of logistic regression in the beginning.( see all the algos comparison at github- link Here we are relying on the models that have maximum accuracy - RF and xgboost.
Thus over all we can say that feature- 8,6, 4, 0, 1,3 looks important for classification model. Feature 7 seems having no importance in xgboost as its classification power is captured by other feature. This Important of features was visible in box-plot also.
Recursive feature Elimination is useful in selecting subset of features as it tells top feature to keep for modeling.
D) Feature Importance on sensor data ( Local)-
With the advancement of ML and Deep learning, just global importance is not useful. Business, Data scientist are looking for local explanation too. In our analysis, we have used IBM AIX 360 framework to get importance of rules on the features( importance of feature based on the values of feature and output value). The options to use different packages/framework are-
MS Azure Explainability
The above image shows feature 8 is most important over-all but when it comes to specific predictions. Subset of feature 6 seems more importance for many predictions. We can get good insights from such rules like- sensor 6 in 1 st and 4th quadrant has less importance compare to very strong importance in quadrant 2 and 3. If we know the exact feature name we can get lot of valuable insights.
F) SHAP Values explanation-
In above plot 10 data points from class 1 is selected, we can clearly see for these data points 6 is more important and importance of 8 is changing based on values of features. At the same time feature 1,2,3,5,7, are almost not useful at all for the prediction. ( 1 Series represents 1 observation)
Above plot has 10 observations from class -1. It shows that for class -1 , feature 0 is also important for few predictions and instead of 8 and 6, 7 and 9 are more important.
Such finding are more important when we have scenarios like multiple fault prediction, anomalies classification I industrial applications. Once we know the actual name of signals we will get very insightful information.
Above plot shows how signal 6 is mostly useful in prediction but there are many instances when it has no importance on predicted value. Also feature 6 has more classifying power for class 1 rather than -1. Similar analysis can be done on other features for better and exhaustive understanding of features- importance.
Detailed code is present on Github- link to github code