perl-Algorithm-VSM

Perl module for retrieving files and documents from a software

*Algorithm::VSM* is a _perl5_ module for constructing a Vector Space Model (VSM) or a Latent Semantic Analysis Model (LSA) of a collection of documents, usually referred to as a corpus, and then retrieving the documents in response to search words in a query. VSM and LSA models have been around for a long time in the Information Retrieval (IR) community. More recently such models have been shown to be effective in retrieving files/documents from software libraries. For an account of this research that was presented by Shivani Rao and the author of this module at the 2011 Mining Software Repositories conference, see <a href="http://portal.acm.org/citation.cfm?id=1985451">http://portal.acm.org/citation.cfm?id=1985451</a> . VSM modeling consists of: (1) Extracting the vocabulary used in a corpus. (2) Stemming the words so extracted and eliminating the designated stop words from the vocabulary. Stemming means that closely related words like 'programming' and 'programs' are reduced to the common root word 'program' and the stop words are the non-discriminating words that can be expected to exist in virtually all the documents. (3) Constructing document vectors for the individual files in the corpus --- the document vectors taken together constitute what is usually referred to as a 'term-frequency' matrix for the corpus. (4) Normalizing the document vectors to factor out the effect of document size and, if desired, multiplying the term frequencies by the IDF (Inverse Document Frequency) values for the words to reduce the weight of the words that appear in a large number of documents. (5) Constructing a query vector for the search query after the query is subject to the same stemming and stop-word elimination rules that were applied to the corpus. And, lastly, (6) Using a similarity metric to return the set of documents that are most similar to the query vector. The commonly used similarity metric is one based on the cosine distance between two vectors. Also note that all the vectors mentioned here are of the same size, the size of the vocabulary. An element of a vector is the frequency of occurrence of the word corresponding to that position in the vector. LSA modeling is a small variation on VSM modeling. Now you take VSM modeling one step further by subjecting the term-frequency matrix for the corpus to singular value decomposition (SVD). By retaining only a subset of the singular values (usually the N largest for some value of N), you can construct reduced-dimensionality vectors for the documents and the queries. In VSM, as mentioned above, the size of the document and the query vectors is equal to the size of the vocabulary. For large corpora, this size may involve tens of thousands of words --- this can slow down the VSM modeling and retrieval process. So you are very likely to get faster performance with retrieval based on LSA modeling, especially if you store the model once constructed in a database file on the disk and carry out retrievals using the disk-based model.

There is no official package available for openSUSE Leap 15.3

Distributions

openSUSE Tumbleweed