Thursday, 4 July 2019

How to survive in data science and the first steps


Few years before I read this article and it made sense in 2012-2017-



Days are gone when IT organizations are looking for core data science profile which includes doing research and complete the POC. There is a lot of hype around data science and in very near future this profile will become obsolete. People in the data science profile know it.  It’s fancy for other IT profiles because a lot of material is bombarded by training institutes and start ups. Current demand is short term( Organizations are in exploration phase, what to do with data and delivering POCs ). Most of the organization are now looking for ML-Engineer profile which is the combination of 3 profiles- data engineer, data science and someone who can deploy in production( in cloud most of the time).


The sooner the better. So-called data scientist should move into data engineering and embrace the cloud. Here I have given small introduction on how to start working on Azure-Databricks so that people like me can become a better hiring material.

Step-1 Create Azure trial account, Databricks Workspace and launch the workspace



Step-2 Data bricks quick start-



Step-3 Why not try Keras-



        A)   Sequential model is a data structure given in Keras. One needs to add layers according to NN model-

f            from keras.models import Sequential
                model = Sequential()

        B)      add the layers according to structure of neural network-

from keras.layers import Dense
model.add(Dense(units=4, activation='relu', input_dim=2))
model.add(Dense(units=1, activation='linear'))

C)      configure the model by passing arguments-

model.compile(loss='mean_squared_error',
              optimizer='sgd',
              metrics=['mae', 'mape'])


D)   creating X and Y values-

     x1 = np.random.randn(10000, 2)
     dataframe_X= pd.DataFrame(x1)
     dataframe_X.columns =['x1','x2']
     Y1 = np.random.randn(10000, 1)

   E)      fitting the model by calling model.fit

     model.fit(x_train, y_train, epochs=5, batch_size=32)




      F)      model evaluation-

     evaluation_metrics= model.evaluate(x_test, y_test)



        G)    use model for prediction

     predicted_value = model.predict(dataframe_X)

    f)   testing on same data-

     predicted_vals = model.predict(x_test, batch_size=32)


Although this code is written in python but now we have run first ML program on databricks. One should start replacing python commands with PySpark commands make it a habit over time.
In production, this notebook will read run time data by scheduling a job( how to schedule a job in data bricks) and from notebook one can save predicted values in any database which can further be read by visualization tool/ another application.

Data scientist should come out of pure research, statistics, R/Python profile to be stay relevant in IT industry. Remember golden words by Charles Darwin-






No comments:

Post a Comment