Sunanda Bansal

Artificial Intelligence Solution Architect
Data Scientist
Machine Learning Engineer
Natural Language Processing Specialist

GitHub: github.com/sunandabansal
LinkedIn: linkedin.com/in/sunandabansal

Contact me: Google form


Summary


Work History

DATA SCIENCE CONSULTANT, 3.5 years
Deloitte, Dataperformers (Acquired by Deloitte) (Jun. 2019 – Nov. 2023)

Skills/Technologies: ML, NLP, Cloud (GCP), Web Dev, UX/UI, Scikit-learn, Gensim, SpaCy, Keras, Docker, Flask, React, Django.

RESEARCH ASSISTANT, 2 years
Computational Linguistics Lab at Concordia (CLaC) (Sep. 2017 – Apr. 2019)

Skills/Technologies: Emotion Analysis, Document Classification, Topic Modelling, LDA, LSA, Doc2Vec, BERTopic, Word Embeddings, Document Embeddings, Word2Vec, GloVe.

TEACHING ASSISTANT, 2 years
Concordia University (Sep. 2017 – Apr. 2019)

FOUNDING ENGINEER / DIGITAL DESIGN ARCHITECT, 2 years
Poplify / Intuzion Technologies (Jan. 2013 – Nov. 2014)

Skills/Technologies: HTML, CSS, SCSS, jQuery, JavaScript, CMS, Ruby, Ruby on Rails, PHP, PHP Codeigniter.


Education

Master of Computer Science (Thesis), Concordia University, Montreal (Sep. 2017 – Aug. 2021)
Thesis – Vector Representation of Documents using Word Clusters

Bachelor of Engineering in Computer Science (Hons.), Panjab University (Jul. 2009 – Jun. 2013)


Publication

Vector Representation of Documents using Word Clusters
Graduate Thesis · Concordia University · Aug 2021

For processing the textual data using statistical methods like Machine Learning (ML), the data often needs to be represented in the form of a vector. With the dawn of the internet, the amount of textual data has exploded, and, partly owing to its size, most of this data is unlabeled. Therefore, often for sorting and analyzing text documents, the documents have to be represented in an unsupervised way, i.e. with no prior knowledge of expected output or labels. Most of the existing unsupervised methodologies do not factor in the similarity between words, and if they do, it can be further improved upon. This thesis discusses Word Cluster based Document Embedding (WcDe) where the documents are represented in terms of clusters of similar words and, compares its performance in representing documents at two levels of topical similarity - general and specific. This thesis shows that WcDe outperforms existing unsupervised representation methodologies at both levels of topical similarity. Furthermore, this thesis analyzes variations of WcDe with respect to its components and discusses the combination of components that consistently performs well across both topical levels. Finally, this thesis analyses the document vector generated by WcDe on two fronts, i.e. whether it captures the similarity of documents within a class, and whether it captures the dissimilarity of documents belonging to different classes. The analysis shows that Word Cluster based Document Embedding is able to encode both aspects of document representation very well and on both of the topical levels.