Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Its target audience is the natural language processing (NLP) and information retrieval (IR) community.
This is a tool for discovering the semantic structure of documents by examining the patterns of words (or higher-level structures such as entire sentences or documents). gensim accomplishes this by taking a corpus, a collection of text documents, and producing a vector representation of the text in the corpus. The vector representation can then be used to train a model, which is an algorithms to create different representations of the data, which are usually more semantic.
gensim makes heavy use of Python’s built-in generators and iterators for streamed data processing.
Features include:
- Processes large, web-scale corpora using incremental online training algorithms.
- All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core). They use highly optimized math routines.
- Distributed versions of several algorithms to speed up processing and retrieval on machine clusters.
- Intuitive interfaces:
- easy to plug in your own input corpus/datastream (trivial streaming API).
- easy to extend with other Vector Space algorithms (trivial transformation API).
- Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.
- Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.
- Converters and I/O formats: contains memory-efficient implementations to several popular data formats including Matrix Market, SVMlight, Blei’s LDA-C and more.
- Fast indexing of documents in their semantic representation, and retrieval of topically similar documents.
- Extensive documentation and Jupyter Notebook tutorials.
Dependencies:
- Python >= 2.7 (tested with versions 2.7, 3.5 and 3.6).
- NumPy >= 1.11.3.
- SciPy >= 0.18.1.
- Six >= 1.5.0.
- smart_open >= 1.2.1.
Website: radimrehurek.com/gensim
Support: QuickStart, GitHub Code Repository, Mailing List, Gitter
Developer: RaRe Technologies / Radim Řehůřek
License: GNU LGPLv2.1 license
Gensim is written in Python. Learn Python with our recommended free books and free tutorials.
Return to Natural Language Processing | Return to Python Natural Language Tools
Popular series | |
---|---|
The largest compilation of the best free and open source software in the universe. Each article is supplied with a legendary ratings chart helping you to make informed decisions. | |
Hundreds of in-depth reviews offering our unbiased and expert opinion on software. We offer helpful and impartial information. | |
The Big List of Active Linux Distros is a large compilation of actively developed Linux distributions. | |
Replace proprietary software with open source alternatives: Google, Microsoft, Apple, Adobe, IBM, Autodesk, Oracle, Atlassian, Corel, Cisco, Intuit, and SAS. | |
Awesome Free Linux Games Tools showcases a series of tools that making gaming on Linux a more pleasurable experience. This is a new series. | |
Machine Learning explores practical applications of machine learning and deep learning from a Linux perspective. We've written reviews of more than 40 self-hosted apps. All are free and open source. | |
New to Linux? Read our Linux for Starters series. We start right at the basics and teach you everything you need to know to get started with Linux. | |
Alternatives to popular CLI tools showcases essential tools that are modern replacements for core Linux utilities. | |
Essential Linux system tools focuses on small, indispensable utilities, useful for system administrators as well as regular users. | |
Linux utilities to maximise your productivity. Small, indispensable tools, useful for anyone running a Linux machine. | |
Surveys popular streaming services from a Linux perspective: Amazon Music Unlimited, Myuzi, Spotify, Deezer, Tidal. | |
Saving Money with Linux looks at how you can reduce your energy bills running Linux. | |
Home computers became commonplace in the 1980s. Emulate home computers including the Commodore 64, Amiga, Atari ST, ZX81, Amstrad CPC, and ZX Spectrum. | |
Now and Then examines how promising open source software fared over the years. It can be a bumpy ride. | |
Linux at Home looks at a range of home activities where Linux can play its part, making the most of our time at home, keeping active and engaged. | |
Linux Candy reveals the lighter side of Linux. Have some fun and escape from the daily drudgery. | |
Getting Started with Docker helps you master Docker, a set of platform as a service products that delivers software in packages called containers. | |
Best Free Android Apps. We showcase free Android apps that are definitely worth downloading. There's a strict eligibility criteria for inclusion in this series. | |
These best free books accelerate your learning of every programming language. Learn a new language today! | |
These free tutorials offer the perfect tonic to our free programming books series. | |
Linux Around The World showcases usergroups that are relevant to Linux enthusiasts. Great ways to meet up with fellow enthusiasts. | |
Stars and Stripes is an occasional series looking at the impact of Linux in the USA. |