User Data and User-Generated Apps
User data is what you get when you instrument an existing product. Classic examples include the MovieLens dataset, the Netflix Prize dataset, the Amazon Product dataset, and Microsoft's MIND dataset. The advantages are powerful: user data reflects actual behavior (not what people say they do), it's granular, and it's continuous as long as users keep using the platform.
The limitations are serious: no existing users means no user data (a chicken-and-egg problem for startups); ethical and legal considerations around consent apply; and your user base is not the general population.
Electronics
Wireless Noise-Cancelling Headphones
30-hour battery life, adaptive ANC, foldable design. Pairs instantly with up to 3 devices. Premium drivers deliver studio-quality sound across all frequencies.
Rate this item
Interact with the page to generate signals.
Interact with the page to see the different data that can be collected when you interact with a website.
Web apps for data collection are a related but distinct approach. Instead of collecting data from existing users of an existing product, you build an app specifically to collect data for a research project. The advantages: customization, accessibility from any device, and real-time analysis. The limitations: you have to build it, users must actively engage, and you take on a security responsibility for whatever data you collect.
For rapid prototyping, Streamlit or Gradio let you spin up data collection apps in Python quickly. Pair them with a cloud bucket (S3, GCS, or Azure Blob) for storage and host on Streamlit Cloud, Hugging Face Spaces, or Vercel.
Building the Collection Tool Is Part of the Project
Some of the best student projects I've seen involved building a polished web app, using it to collect a nice dataset, and then turning around and using that dataset to power the actual model. The "data collection" step and the "model" step are not separate — they inform each other. Building a tool to collect your own data is often the fastest path to a strong portfolio project.
A startup wants to build a recommendation model. They plan to train it on user behavior data collected from their app. What is the most significant risk with this approach?