Data science master’s student at Harvard, former competitive violinist, amateur tennis enthusiastic

About me

Hi! I'm Luc, a master’s student at Harvard University studying Data Science. I am passionate about the intricacies of data science in solving real-world problems and making an impact. I have experience working in healthcare through internships and projects, and I’m currently doing research on anomaly detection. Fun fact - I’ve spent many years playing competitive violin and now just enjoy it as a hobby!

Check out some of my research & projects

Let's connect! Email me at lucchenn@gmail.com or find me on LinkedIn

Harvard Engineering

Master's, Data Science

Profile

University of CA, Davis

Bachelor's, Statistics & Data Science & Economics

SELECTED PROJECTS

Below is a collection of data science projects I’ve worked on, spanning fields like autonomous vehicles, healthcare, and education. Through these projects, I’ve gained invaluable insights and honed my skills in solving real-world challenges. The titles provide links to full papers.

I’m always open to collaborating on new project proposals - feel free to reach out!

An Efficient Outlier Detection Algorithm for Data Streaming

The paper addresses the challenge of real-time outlier detection, particularly in fields like finance and healthcare, where traditional methods like the Local Outlier Factor (LOF) struggle with continuous data streams. While the Incremental LOF (ILOF) algorithm improves real-time detection, it remains computationally expensive and degrades over time. The authors propose the Efficient Incremental LOF (EILOF) algorithm, which only computes LOF scores for new data points without recalculating existing ones. This approach leverages dataset noise to maintain or even enhance detection performance while significantly reducing computational costs. Experiments on simulated and real-world datasets show that EILOF outperforms ILOF in both efficiency and accuracy as data volume increases.

Data Efficient Dense Cross-Lingual Information Retrieval

This project enhances cross-lingual information retrieval (CIR) for low-resource languages through two main strategies: query augmentation, which generates diverse queries using large language models to enrich training data, and a weighted loss function that prioritizes underrepresented languages for balanced model training. Our methods significantly improve retrieval quality, showcasing the potential for effective CIR systems even in the face of limited resources.

Curriculum Learning For Autonomous Vehicles

This study investigates how the sequence of training environments affects performance in simple driving tasks for an autonomous driving agent. By training agents solely through interaction with maps of varying difficulty, we demonstrate that transfer learning enhances performance within single-environment driving scenarios. However, we find that agents struggle to master advanced driving capabilities and fail to generalize well to new environments, regardless of the sequence of training data. We conclude by looking at areas to build on this work such by combining imitation learning with curriculum learning and developing curriculum-specific MDP.

Student Future Academic Performance after Being Placed on Dean’s List

This study examines the causal impact of being placed on the Dean’s List, a positive education incentive, on future student performance using a regression discontinuity design. The results suggest that for students with low prior academic performance and who are native English speakers, there is a positive impact of being on the Dean’s List on the probability of getting onto the Dean’s List in the following year. However, being on the Dean’s List does not appear to have a statistically significant effect on subsequent GPA, total credits taken, dropout rates, or the probability of graduating within four years. These findings suggest that a place on the Dean’s List may not be a strong motivator for students to improve their academic performance and achieve better outcomes.

Predicting Movie Ratings Based On Reviews

This study explores machine learning algorithms to predict movie ratings (positive or negative) using 2,500 IMDB reviews. The data was preprocessed into a TF-IDF matrix with 17,000 features, reduced via PCA and heuristic term selection. Logistic regression with gradient descent achieved a 20% test error rate (5.35% cross-validation error) using 1,000 terms. K-nearest neighbors (K-NN) reached a 26.2% test error rate (23.7% cross-validation error) with 100 terms and k=8. Logistic regression outperformed K-NN in accuracy but required more computational resources. The findings highlight trade-offs between accuracy and efficiency in text classification, suggesting potential for advanced methods like neural networks or support vector machines.

Diabetes Data Analysis

This study applies Quadratic Discriminant Analysis (QDA) to predict diabetes status using the Pima Indians Diabetes dataset, comprising 392 observations and 8 predictors. Exploratory analysis revealed significant covariance differences between diabetes-positive and diabetes-negative groups , supporting the use of QDA over Linear Discriminant Analysis. QDA achieved an average accuracy of 79% with a 70-30 train-test split and 77.69% via 5-fold cross-validation. Principal Component Analysis (PCA) was used to reduce dimensionality, with the five-component model achieving 78.5% accuracy while retaining 80% of explained variance. The study demonstrates QDA’s utility in diabetes classification, with PCA balancing model complexity and performance.

Page updated

Google Sites

Report abuse