A small text-corpus visualiser built as a teaching exercise on top of nltk, scikit-learn, and the wordcloud library. Tokens are weighted by TF-IDF rather than raw frequency so that domain-specific terms surface above generic high-frequency words. The renderer respects an arbitrary alpha mask (a Pokémon Arceus silhouette, a fall-leaf shape, a Mexico outline), and an animated variant updates the layout frame-by-frame so the cloud "grows" as the corpus is fed in. This is an exploration / demonstration project, not a research result.
nltk.corpus.stopwords, lemmatise with WordNetLemmatizer.TfidfVectorizer on the cleaned token stream against a background corpus (English Wikipedia sample). The score for term t in document d is:
tf-idf(t, d) = tf(t, d) · log(N / df(t))
Where tf is term frequency in d, N is total documents in the background corpus, and df is document frequency of t. This dampens generic English words and amplifies domain-specific ones.WordCloud(mask=...). The layout engine packs words inside the mask; pixels with alpha = 0 are forbidden.imageio.mimsave.Each mask is a binary alpha image. Static cloud (PNG) shows the final layout; animated cloud (GIF) shows the per-chunk evolution as the corpus fills in.
Built collaboratively with Connor Carpenter, Ryan Lay, Samyak Karnavat, and Yash Shah.