• Sameer Darekar

    I have no special talent. I am only passionately curious

    - Albert Einstein.​


    Indiana University Bloomington (School Of Informatics and Computing)

    MS in Data Science (Computational and Analytical Track)

    January 2016 to December 2017

    GPA: 3.62

    University of Pune (Sinhgad Academy of Engineering)

    Bachelor of Engineering in Information Technology

    May 2008 to May 2012

    Grade: First Class with Distinction


    QxBranch LLC

    Duration - January 2017 to August 2017

    Title - Data Scientist Intern


    Modelling Cyber Risk

    • Worked on creating a fully integrated solution that provides reinsurance analysts with a solution for assessing the cyber risk of an organization to effectively write policies and build portfolios.
    • Implemented Positive Unlabeled Learning, and Domain Infused Machine Learning techniques for predicting external breach
    • Designed and implemented a change data capture framework for augmenting master dataset over time from multiple data sources
    • Scraped Google News and Identified keywords using TFIDF and Text Rank and hence increasing the AUC score by 10%
    • Found temporal patterns before cyber breach by merging censys and ARIN dataset with other sources

    Stochastic Model for Fantasy League Prediction

    • Used Monte Carlo sampling to estimate posterior distributions from likelihood and priors for different outcomes of each player pair
    • Created hierarchical logistic regression models and infused domain knowledge for generating ball by ball simulations of a match
    • Predicted fantasy points a player will score in a cricket match using multiple numbers of simulations with a RMSE of 15 points
    • Generated equipoised matchups for maximizing profits in bets by expanding the Elo ranking system in chess to cricket.
    • Wrote scrapers to gather all the data from ESPN and cricsheets and updating it every match.


    Accenture India Pvt. Ltd.

    Duration- June 2012 to November 2016 (3 years and 6 months)

    Title - Software Engineering Analyst


    CID Rationalization and Inventory Alignment

    • Performed root cause analysis of the anomalous data generated due to faulty design of BFG (Big Friendly Giant) DB
    • Created scripts for data cleansing with the help of Python and PL-SQL.

    SONAR-AML (Anti-Money Laundering)

    • Performed migration of Historical data of customers to HDFS and cleaning data with Hive for inconsistencies.
    • Awarded with Accenture Stellar Award for generating reports using UDFs of potential money laundering customers.
    • Completed Accenture’s Banking Generalist Certification Program a fundamental banking domain certification program.

    Enterprise Service Delivery System(ESDS)

    • Examined the backend flow of application from upstream and downstream applications while working in Orchestration Engine Team.
    • Automated the manual task of validation of data in the system with the help of Python and VBA scripts.​

    Sands Technologies Pvt. Ltd.

    Duration - February 2012 to April 2012 (3 months)

    Title - Project Intern

    Worked on Multi -Tracking System


  • My Encounters with Data


    San Francisco Crime Analysis - Kaggle Competition

    Python(Scikit Learn, Numpy) , R(ggmap), Tableau

    Found some meaningful patterns and stories by plotting Spatial and Temporal Data that can give an insight of the Crime scene in San Francisco, like increase in crime rate in the month of October and decrease in December, Major crime hotspots in Tenderloin and Southern region. Took part in Kaggle competition for classification of crime and finished in top 40% out of 2335 teams. Used R(ggmap) and Tableau for visualization and used Python(Scikit Learn, Numpy) for Classification. Worked in a team with one other member for visualization.



    Tetris Game Playing AI Bot​


    Created an evaluation function which measured how good a game state was considering parameters like structure height, smoothness, lines cleared, blockages etc. These parameters needed carefully selected weights to determine the relative importance of each one in the final measure. Used 2 stage look-ahead of the next piece by finding probability distribution and getting the most probable next piece. In the class of 100 students considering of Ph.D. and master students, my group stood 2nd in a tournament held by our Professor.

    POS Tagging using Hidden Markov Model​


    Implemented Parts of Speech Tagging using Hidden Markov Model(HMM using Viterbi Algorithm) and higher-order HMM. with an accuracy of 94% at word level and 90% at sentence level on Brown Corpus.

    Statistical Analysis of Indian Cricket Team

    Python(Pandas, Scikit Learn, Numpy, Seaborn), Data Scrapping(Beautiful Soup, Request)

    Compared the high profile Indian batsmen by using kernel density estimate, analysed their perfomances in Wins and Losses. Used Linear Regression to predict the number of matches VIrat Kohli will take to break the record of Sachin Tendulkar.



    Topic Modelling using Expectation Maximization


    Used EM algorithm for clustering articles from 20 Newsgroup data set which contains articles belonging to a particular topics. Along with that we also had a parameter which controlled the percentage of labels we were allowed to observe. When the parameter was 1 (100% labels could be observed) the accuracy was 81%. Many stop words were removed and regular expressions were applied to clean the individual words. Successful application of EM kept the accuracy to around 72% even when the parameter was 0.1 (10% labels could be observed).

    Image Orientation Classification

    Python, Numpy

    For this task I converted the image to a vectors of length 192 and a class label for 4 different rotations viz. 0, 90,180, 270. I implemented the Neural Network using Numpy arrays. It also involved several other design decisions like choosing the activation function, number of hidden layers and nodes in them, bias variables, batch learning or online learning. I went with the tanh activation function, one hidden layer with 90 nodes and an output layer with 4 nodes for each class label, bias for each neuron and an online learning approach with 50 iterations.

    I was able to reach an accuracy of 74%

    Web Traffic Forecasting using Time Series

    Time Series, ARIMA, LSTM, Keras, Statsmodels

    In this project, I dealt with the problem of forecasting the future values of Time Series of the Wikipedia pages traffic. First I make the series stationary and then applied ARIMA model by estimating the p, d, q parameters of the model using ACF and PACF plots. Later used the LSTM network and ensembled the two to get a better estimate.

    Feedback to Business Entities using Topic modelling and Sentimental Analysis

    Python(NLTK), D3.js, Yelp Dataset

    Implemented Topic Modeling on Yelp Dataset for getting feedback of business entities using Python (NLTK) and deriving which services or products are good, bad and average. Worked in a team with two other members


    Movie Recommendation Engine (K Nearest Neighbor)

    Python, Movielens Dataset

    Created a movie recommendation engine using movielens dataset. Used the k-nearest neighbor approach for clustering and calculated the Mean Absolute difference comparing the different distance functions like Euclidean, Manhattan and Lmax. Implemented the KNN approach from scratch using Python.


    Decision Tree Implementation on UCI Datasets

    Python, UCI Datasets

    Implemented a greedy algorithm that learns a classification tree given a data set using Gini and Entropy(Information Gain). Evaluation using Cross Fold Validation and implementing overfitting prevention methods from scratch in Python without libraries for decision Tree.


    Association Rules Mining Ranking Nursery Application​

    Python, UCI Datasets

    Implemented Association rules using Apriori algorithm by first determining frequent itemsets using methods Fk−1 × F1 and Fk−1 × Fk−1 for itemset generation and then proceeding to identify association rules. Implementing confidence-based pruning and lift as the measure of rule interestingness to enumerate all association rules for a given set of frequent itemsets from scratch using Python without any libraries
    Further extended the same algorithm for car safety classification and contraceptive method classification using UCI datasets


    Data Gathering, Analysis of Voter Data and Sentimental Analysis.

    Python(Beautiful Soup, Request,NLTK), Tableau 

    Part of Data Gathering team of Member of Legislative Assembly of Maharashtra during his campaign for the 2014 Lok Sabha Elections in India. Scraping the web for the voter data on the website of Election Commission of India using beautiful soup (Python library for parsing) and Requests (an Http library for Python). Created sentimental analysis engine using SVM and NLTK for data collected via his Twitter feeds and then visualizing the feedback using Tableau in various ways possible.

    Multi Tracking System (Sands Technologies)

    C#.Net,.Net Framework, MS SQL

    Implemented a web based GPS Tracking System for the fulfilment of Bachelor’s Degree requirement. Socket Programming using C# to receive data from the tracking device and storing to server. Creating SQL jobs for handling the large flow of data, creating live track using KML for Google Map, algorithmic implementation of Geo fencing Module

    Presented Papers in International Conference and Journal for the Multi Tracking System – a web based GPS tracking system.


    Tracking System using GPS and GSM: Practical Approach

    International Journal of Scientific and Engineering Research Volume 3, Issue 5, May-2012, ISSN 2229-5518


    Multi Tracking System using GPS and GSM

    Proceedings of Advances in Computer and Communication Technology (ACCT - 2012) organized by Institute of Electronics and Telecommunication Engineers (IETE) Mumbai.

  • Resume

  • Twitter Feed


All Posts