Cosine Similarity

Getting to Know Your Data

Jiawei Han , ... Jian Pei , in Data Mining (Tertiary Edition), 2012

two.4.seven Cosine Similarity

Cosine similarity measures the similarity between two vectors of an inner production space. It is measured by the cosine of the angle between ii vectors and determines whether ii vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

A certificate can exist represented by thousands of attributes, each recording the frequency of a item give-and-take (such equally a keyword) or phrase in the certificate. Thus, each document is an object represented by what is chosen a term-frequency vector. For instance, in Table two.v, we see that Document1 contains five instances of the word team, while hockey occurs three times. The discussion jitney is absent from the entire document, as indicated by a count value of 0. Such data can be highly asymmetric.

Table 2.v. Document Vector or Term-Frequency Vector

Document team coach hockey baseball soccer penalty score win loss flavour
Document1 five 0 3 0 2 0 0 2 0 0
Document2 3 0 2 0 1 1 0 i 0 1
Document3 0 seven 0 two 1 0 0 three 0 0
Document4 0 1 0 0 ane 2 2 0 iii 0

Term-frequency vectors are typically very long and sparse (i.east., they have many 0 values). Applications using such structures include information retrieval, text document clustering, biological taxonomy, and gene feature mapping. The traditional distance measures that we have studied in this affiliate do not piece of work well for such sparse numeric data. For example, two term-frequency vectors may accept many 0 values in common, meaning that the corresponding documents practise not share many words, only this does not make them similar. We need a mensurate that will focus on the words that the two documents do have in common, and the occurrence frequency of such words. In other words, we need a measure out for numeric information that ignores zilch-matches.

Cosine similarity is a measure of similarity that tin be used to compare documents or, say, give a ranking of documents with respect to a given vector of query words. Let x and y exist two vectors for comparison. Using the cosine mensurate as a similarity part, we have

(two.23) southward i m ( 10 , y ) = 10 y | | x | | | | y | | ,

where || x || is the Euclidean norm of vector x = ( x 1 , x 2 , , x p ) , divers as x i 2 + x ii two + + x p two . Conceptually, information technology is the length of the vector. Similarly, || y || is the Euclidean norm of vector y . The measure computes the cosine of the angle between vectors 10 and y . A cosine value of 0 means that the two vectors are at 90 degrees to each other (orthogonal) and accept no lucifer. The closer the cosine value to 1, the smaller the bending and the greater the match between vectors. Note that because the cosine similarity measure does not obey all of the backdrop of Section 2.4.four defining metric measures, it is referred to as a nonmetric mensurate.

Case ii.23

Cosine similarity between two term-frequency vectors

Suppose that x and y are the showtime two term-frequency vectors in Tabular array two.5. That is, x = ( 5 , 0 , 3 , 0 , 2 , 0 , 0 , 2 , 0 , 0 ) and y = ( three , 0 , ii , 0 , 1 , 1 , 0 , ane , 0 , 1 ) . How similar are x and y ? Using Eq. (2.23) to compute the cosine similarity between the two vectors, we get:

x t y = v × 3 + 0 × 0 + 3 × two + 0 × 0 + 2 × 1 + 0 × ane + 0 × 0 + 2 × one + 0 × 0 + 0 × i = 25 | | x | | = 5 2 + 0 2 + 3 two + 0 2 + 2 two + 0 2 + 0 two + 2 2 + 0 2 + 0 2 = 6.48 | | y | | = 3 2 + 0 two + ii ii + 0 2 + 1 two + 1 2 + 0 2 + 1 2 + 0 2 + 1 2 = 4.12 s i one thousand ( ten , y ) = 0.94

Therefore, if nosotros were using the cosine similarity measure to compare these documents, they would be considered quite similar.

When attributes are binary-valued, the cosine similarity function can be interpreted in terms of shared features or attributes. Suppose an object ten possesses the ithursday aspect if xi = 1. Then 10 t y is the number of attributes possessed (i.e., shared) past both x and y , and | ten || y | is the geometric mean of the number of attributes possessed by x and the number possessed past y . Thus, sim( x , y ) is a measure of relative possession of common attributes.

A unproblematic variation of cosine similarity for the preceding scenario is

(ii.24) s i g ( x , y ) = x y ten x + y y ten y ,

which is the ratio of the number of attributes shared by ten and y to the number of attributes possessed by x or y . This role, known as the Tanimoto coefficient or Tanimoto distance, is often used in information retrieval and biology taxonomy.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780123814791000022

Deep similarity learning for disease prediction

Vagisha Gupta , ... Neha Dohare , in Trends in Deep Learning Methodologies, 2021

iii.iv.2 Similarity learning

Similarity learning [22] deals with measuring the similarity between a pair of images and objects, and has application in tasks related to classification and regression. The aim is to learn the similarity function that finds an optimal relation between ii relatable or like objects in a quantitative way. Some applications of finding similarity measures are handwritten text recognition, face up identification, search engines, signature verification, etc. Typically, similarity learning involves giving a pair of images as input and discovering how similar they are to each other. The output can be a nonnegative similarity score between 0 and 1, ane if the 2 images are completely similar to each other, otherwise 0. Fig. 8.4 shows the calculation of the similarity score between ii images. The images are embedded into vector representation using a deep learning architecture for learning the representation of features of the images followed by passing it to the similarity metric learning part, which measures the similarity score betwixt two images, unremarkably a value betwixt 0 and 1.

Figure 8.four. Calculating the similarity score. DL, Deep learning.

Consider two vectors of features, x and y; some of the similarity calculation measures are:

ane.

Cosine similarity: This measures the similarity using the cosine of the angle between two vectors in a multidimensional infinite. It is given past:

(8.ii) similarity ( ten , y ) = cos ( θ ) = 10 · y | x | | y |

2.

Euclidean altitude: This is the most common similarity distance measure and measures the distance between any two points in a euclidean space. It is given by:

(8.3) d ( 10 , y ) = i = 1 north ( x i y i ) 2

3.

Manhattan distance: This is a similarity distance measure out in which the distance between two points is calculated past the sum of the absolute differences of the points. It is given past:

(8.four) d ( x , y ) = | i = 1 north ( x i y i ) |

Some research [23] shows disease prediction using the traditional similarity learning methods (cosine, euclidean) directly measuring the similarity on input feature vectors without learning the parameters on the input vector. They do not perform well on original data, which is highly dimensional, noisy, and sparse. Therefore nosotros follow an approach used in [28] to measure the similarity between patients past outset learning the parameters on the input value. Nosotros telephone call this novel approach the softmax-based framework for calculating the similarity between a pair of vectors.

4.

Softmax-based supervised classification framework: This method performs classification on the learned representation by measuring a similarity probability betwixt pairs of objects and using it every bit a score for ranking the similarity. To ensure that pairwise labels are classified correctly, a fully connected softmax layer is added. The similarity is calculated by using a bilinear distance given past:

(viii.5) sim ( ten , y ) = W thou ten i Westward k y i ,

where Westward k R and is a bitwise improver.

Read total chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780128222263000088

Self-Supervised Learning from Spider web Data for Multimodal Retrieval

Raul Gomez , ... Dimosthenis Karatzas , in Multimodal Scene Understanding, 2019

9.viii.one Experiment Setup

To evaluate how the CNN has learned to map images to the text embedding infinite and the semantic quality of that space, we perform the following experiment: We build random image pairs from the MIRFlickr dataset and we compute the cosine similarity between both their image and their text embeddings. In Fig. nine.12 we plot the images embeddings distance vs. the text embedding distance of xx,000 random image pairs. If the CNN has learned correctly to map images to the text embedding space, the distances between the embeddings of the images and the texts of a pair should be like, and points in the plot should fall around the identity line y = 10 . Also, if the learned space has a semantic structure, both the distance between images embeddings and the altitude between texts embeddings should be smaller for those pairs sharing more tags: The plot points' color reflects the number of common tags of the epitome pair, and so pairs sharing more tags should be closer to the origin of the centrality.

Figure 9.12

Figure 9.12. Text embeddings altitude (10) vs. the images embedding distance (Y) of dissimilar random image pairs for LDA, Word2Vec and GloVe embeddings trained with InstaCities1M. Distances have been normalized betwixt [0,1]. Points are blood-red if the pair does not share any tag, orange if it shares one, light orange if it shares two, yellowish if it shares three and greenish if it shares more. Rii is the coefficient of decision of images and texts distances.

As an case, have a dog paradigm with the tag "domestic dog", a true cat epitome with the tag "cat" and one of a scarab with the tag "scarab". If the text embedding has been learned correctly, the distance between the projections of domestic dog and scarab tags in the text embedding space should be bigger than the one between dog and true cat tags, but smaller than the one between other pairs not related at all. If the CNN has correctly learned to embed the images of those animals in the text embedding space, the distance between the dog and the cat epitome embeddings should be similar than the one between their tags embeddings (and the aforementioned for whatever pair). So the indicate given by the pair should fall in the identity line. Furthermore, that altitude should be closer to the coordinates origin than the bespeak given by the dog and scarab pair, which should also fall in the identity line and nearer to the coordinates origin that some other pair that has no relation at all.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128173589000159

Metrics, similarity, and sets

Leigh Metcalf , William Casey , in Cybersecurity and Applied Mathematics, 2016

2.7 Similarities

A metric defines the altitude between two objects or how far apart ii objects are. If we desire to mensurate closeness in terms of similarity, nosotros can use another office chosen a similarity measure or similarity coefficient, or sometimes just a similarity.

The similarity part operates on the cross product of a prepare similar to the distance function metric. A similarity function is divers as s : X × X R . Such a function is oftentimes limited to the range [0,one] but there are similarities that render negative results. In the example of a metric we know that if d(10,y) = 0 then x = y. For a similarity office with a range of [0,1], if southward(x,y) = i then 10 = y. This hateful that the larger the value of the similarity role, the closer the two objects are.

Similarity functions must too be symmetric, meaning s(x,y) = southward(y,x). Depending on the definition of the function, in that location could exist a variation of the triangle inequality, simply a similarity function is not required to satisfy the triangle inequality axiom. As opposed to the distance role, a similarity is more than vaguely defined.

If we have a similarity function where the range is [0,i], then we can at least derive a semimetric from the function. If southward : Ten × 10 → [0,1] is the similarity function, then the semimetric will exist given past d(10,y) = i − s(ten,y). If information technology can be shown that the triangle inequality holds for the semimetric, then nosotros've created a distance function from the similarity. If it is the instance that a similarity doesn't fulfill the coincidence requirement of a altitude office, the derived function is a quasimetric.

Example ii.7.1

Nosotros'll beginning with the prepare R n and two vectors x , y R n . The dot product xy is an operation on the vectors that returns a single number. It is the equation x y = i = i due north x i y i . The norm of a vector x is x = i = ane n 10 i 2 :

The cosine similarity is in Eq. (2.iii).

(2.3) C Southward = x y ten y

The cosine similarity is a number between 0 and 1 and is commonly used in plagiarism detection. A document is converted to a vector in R north where due north is the number of unique words in the documents in question. Each element of the vector is associated with a word in the document and the value is the number of times that word is plant in the document in question. The cosine similarity is then computed between the two documents.

Simulated positives may occur if both documents heavily use common words. This can skew the computation and brand it appear that the ii documents are related when they are not. In this example it may be necessary to consider phrases or sentences in addition to single words in the documents.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128044520000026

Recommendation Engines

Vijay Kotu , Bala Deshpande , in Data Science (2nd Edition), 2019

Step 1: Identifying Similar Users

The users are like if their rating vectors are shut co-ordinate to a altitude measure out. Consider the rating matrix shown in Tabular array 11.two equally a set of rating vectors. The rating for the user Amelia is represented as r amelia ={v,1,4,4,1}. The similarity between the 2 users is the similarity betwixt the rating vectors. A quantifying metric is needed in order to measure the similarity between the user's vectors. Jaccard similarity, Cosine similarity, and Pearson correlation coefficient are some of the commonly used distance and similarity metrics. The cosine similarity measure betwixt 2 nonzero user vectors for the user Olivia and the user Amelia is given by the Eq. (11.2)

(11.ii) Cosine similarity | x . y | = x y | | ten | | | | y | | Cosine similarity ( r olivia , r amelia ) = 0 × 5 + 0 × 1 + 2 × four + 2 × 4 + 4 × ane 2 2 + 2 2 + 4 two × v 2 + one 2 + 4 ii + four 2 + 1 2 = 0.53

Annotation that the cosine similarity measure out equates the lack of ratings every bit nil value ratings, which tin also exist considered as a low rating. This assumption works fine for the applications for where the user has purchased the item or not. In the movie recommendation case, this assumption can yield the incorrect results considering the lack of rating does not mean that the user dislikes the movie. Hence, the similarity measure needs to be enhanced to take into consideration the lack of rating being unlike from a low rating for an item. Moreover, biases in the ratings should also be dealt with. Some users are more generous in giving ratings than others who are more critical. The user's bias in giving ratings skews the similarity score between users.

Centered cosine similarity measure addresses the trouble by normalizing the ratings across all the users. To attain this, all the ratings for a user is subtracted from the average rating of the user. Thus, a negative number means below average rating and a positive number means above boilerplate ratings given past the same user. The normalized version of the ratings matrix is shown in Table 11.3. Each value of the ratings matrix is normalized with the average rating of the user.

Table xi.3. Normalized Ratings Matrix

The Godfather 2001: A Space Odyssey The Hunt for Cerise October Fargo The Simulated Game
Josephine 2.iii one.3 −one.8 −ane.8
Olivia −0.seven −0.vii 1.3
Amelia 2.0 −2.0 ane.0 i.0 −2.0
Zoe −0.ii −0.2 two.8 −1.2 −1.2
Alanna 2.0 ii.0 −2.0 −2.0
Kim i.0 −ii.0 −1.0 2.0

Centered cosine similarity or Pearson correlation between the ratings for the users Olivia and Amelia is calculated by:

Centered cosine similarity ( r olivia , r amelia ) = 0 × 2 . 0 + 0 × - two . 0 + - 0 . seven × 1 . 0 + 0 . 7 × 1 . 0 + i . 3 × - 2 . 0 ( - 0 . 7 ) 2 + ( 0 . seven ) 2 + ( 1 . 3 ) ii * ( 2 . 0 ) 2 + ( - ii . 0 ) ii + ( one . 0 ) 2 + ( 1 . 0 ) 2 + ( - 2 . 0 ) 2 = 0.65

The similarity score can be pre-computed betwixt all the possible pairs of users and the results tin can be kept prepare in a user-to-user matrix shown in sample Table 11.4 for ease of adding in the further steps. At this point the neighborhood or cohort size, g, has to exist declared, like to m in the g-NN algorithm. Assume k is 3 for this example. The goal of this pace is to observe three users similar to the user Olivia who have too rated the movie 2001: A Space Odyssey. From the table, the top three users can be found similar to the user Olivia, who are Kim (0.90), Alanna (−0.xx), and Josephine (−0.xx).

Table xi.iv. User-to-User Similarity Matrix

Josephine Olivia Amelia Zoe Alanna Kim
Josephine 1.00 −0.20 0.28 −0.30 0.74 0.eleven
Olivia 1.00 −0.65 −0.l −0.20 0.ninety
Amelia ane.00 0.33 0.13 −0.74
Zoe i.00 0.30 −0.67
Alanna 1.00 0.00
Kim 1.00

Read total affiliate

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780128147610000113

Classification

Vijay Kotu , Bala Deshpande , in Data Scientific discipline (Second Edition), 2019

Cosine similarity

Standing with the example of the certificate vectors, where attributes stand for either the presence or absence of a word. It is possible to construct a more informational vector with the number of occurrences in the document, instead of merely ane and 0. Document datasets are usually long vectors with thousands of variables or attributes. For simplicity, consider the case of the vectors with 10 (1,ii,0,0,3,4,0) and Y (five,0,0,vi,7,0,0). The cosine similarity measure for ii information points is given by:

(four.12) Cosine similarity ( | X , Y | ) = x y | | 10 | | | | y | |

where x·y is the dot production of the x and y vectors with, for this instance,

10 y = i = 1 n x i y i and | | x | | = ten x

x y = 1 × 5 + two × 0 + 0 × 0 + 0 × 6 + three × vii + four × 0 + 0 × 0 = five.1 | | x | | = 1 × one + 2 × two + 0 × 0 + 0 × 0 + three × 3 + iv × 4 + 0 × 0 = 5.5 | | y | | = 5 × 5 + 0 × 0 + 0 × 0 + 6 × vi + 7 × 7 + 0 × 0 + 0 × 0 = 10.5 Cosine similarity ( | x y | ) = x y | | 10 | | | | y | | = 5.1 5.5 × ten.five = 0.08

The cosine similarity measure out is one of the most used similarity measures, merely the determination of the optimal mensurate comes down to the data structures. The choice of altitude or similarity measure can also be parameterized, where multiple models are created with each different measure. The model with a distance measure that best fits the data with the smallest generalization error tin can be the appropriate proximity mensurate for the data.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128147610000046

Text Mining and Network Analysis of Digital Libraries in R

Eric Nguyen , in Data Mining Applications with R, 2014

4.5.1 Computing Similarities Between Documents

Let us determine how documents chronicle to each other in our corpus. Let t ane and t two exist two vectors, respectively, representing the topic associations of documents d ane and d 2, where t 1 (i) and t 2 (i) are, respectively, the number of terms in d 1 and d two, which are associated with topic i . One tin can then utilise the cosine similarity to derive a measure of document similarity:

i t 1 i * t 2 i t 1 * t 2

Here, ∥ t j ∥ denotes the norm of vector t j .

In the current example, we volition apply the rows of the matrix res$document_sums as the list of features. Hence, two documents are similar if they share a similar topic distribution. The following lines volition compute and output the similarity matrix for the documents.

> mat <- t(as.matrix(res$document_sums)) %*% every bit.matrix(res$document_sums)

> d <- diag(mat)

> sim <- t(t(mat/sqrt(d))/sqrt(d))

> sim[1:5, 1:five]

  [,one]   [,ii]   [,3]   [,4]   [,5]

[ane,] 1.0000000   0.46389797   0.52916839   0.53162745   0.26788474

[2,] 0.4638980   one.00000000   0.84688328   0.90267821   0.06361709

[3,] 0.5291684   0.84688328   ane.00000000   0.97052892   0.07256801

[4,] 0.5316274   0.90267821   0.97052892   1.00000000   0.07290523

[5,] 0.2678847   0.06361709   0.07256801   0.07290523   1.00000000

The resulting matrix is a symmetric matrix where the entry in row i and column j represents the cosine similarity measure between documents d i and d j . The larger the entries, the more similar the publications are in terms of topic associations. An entry of 1 indicates identical publications in terms of topic associations.

In our example, documents 3 and 5 are completely dissimilar and documents 2 and 3 are somewhat similar. As a affair of fact, document 3 relates to the analysis of partial differential equations and document 5 discusses breakthrough algebra. Document 2 in our corpus is a scientific newspaper discussing the analysis of fractional differential equations as well.

Alternatively, i can use the res$document_sums matrix to compute distances between the documents, instead of using cosine similarity measure out. The dist office in R allows ane to do so.

> as.matrix(dist(t(res$document_sums)))[ane:five, ane:five]

  i   two   3   4   5

1   0.00000   38.11824   41.96427   36.27671   50.45790

2 38.11824     0.00000   26.49528   11.00000   46.03260

three 41.96427   26.49528   0.00000   20.12461   57.50652

iv 36.27671   11.00000   xx.12461   0.00000   46.28175

five l.45790   46.03260   57.50652   46.28175   0.00000

Again, the distance between documents 2 and 3 is relatively small compared to other distance values, which reflects the fact that they are somewhat like.

The dist function accepts many arguments, but the most important one is the method used for calculating distances. This makes it easier to adjust the altitude calculation method to the underlying dataset and objectives. The provided options are the euclidean, which happens to be the default one, the maximum, the manhattan, the canberra, the binary, and the minkowski distance methods. The different altitude methods are detailed in the dist function assist page.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124115118000049

Confront MODELING BY INFORMATION MAXIMIZATION

In Face up Processing, 2006

vii.4.1 Face-Recognition Functioning: Architecture I

Face up recognition functioning was evaluated for the coefficient vectors b by the nearest neighbour algorithm, using cosines as the similarity measure. Coefficient vectors in each test fix were assigned the class label of the coefficient vector in the training gear up that was most similar as evaluated past the cosine of the angle betwixt them:

(12) c = b test b train b test b train .

Face-recognition performance for the principal component representation was evaluated past an identical procedure, using the principal-component coefficients independent in the rows of R 200.

In experiments to date, ICA performs significantly better using cosines rather than Euclidean distance as the similarity mensurate, whereas PCA performs the aforementioned for both. A cosine similarity measure is equivalent to length-normalizing the vectors prior to measuring Euclidean distance when doing nearest neighbor:

(13) d 2 ( x , y ) = ten 2 + y 2 2 10 y = x 2 + y 2 2 ten y cosα . Thus if x = y = 1 , min y d ii ( x , y ) max y cos α .

Such normalization is consistent with neural models of primary visual cortex [27]. Cosine similarity measures were previously found to be effective for computational models of linguistic communication [28] and face processing [55].

Figure seven.half-dozen gives face up-recognition performance with both the ICA and the PCA based representations. Recognition performance is also shown for the PCA based representation using the commencement xx principal component vectors, which was the eigenface representation used by Pentland, Moghaddam, and Starner [60]. Best operation for PCA was obtained using 200 coefficients. Excluding the first 1, 2, or 3 principal components did not improve PCA functioning, nor did selecting intermediate ranges of components from 20 through 200. At that place was a trend for the ICA representation to give superior face-recognition operation to the PCA representation with 200 components. The deviation in functioning was statistically significant for test gear up three (Z = 1.94, p = 0.05). The difference in operation between the ICA representation and the eigenface representation with 20 components was statistically significant over all three test sets (Z = ii.5, p < 0.05) for examination sets 1 and 2, and (Z = 2.4, p < 0.05) for test set iii.

FIGURE vii.6. Percentage correct face recognition for the ICA representation using 200 contained components, the PCA representation using 200 principal components, and the PCA representation using 20 principal components. Groups are performances for exam set 1, exam set 2, and test prepare 3. Error bars are ane standard deviation of the estimate of the success rate for a Bernoulli distribution.

Recognition performance using different numbers of independent components was also examined by performing ICA on xx to 200 image mixtures in steps of 20. All-time performance was obtained by separating 200 independent components. In general, the more independent components were separated, the meliorate the recognition performance. The basis images as well became increasingly spatially local as the number of separated components increased.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012088452050008X

Using clickstream data to enhance reverse engineering of Spider web applications

Marko Poženel , Boštjan Slivnik , in Advances in Computers, 2020

6 Metrics for clustering the ATG of a Web awarding

To cluster Spider web application pages and thus relevantly visualize the awarding'due south ATG, the distance between these pages must patently be defined. Amid many possible definitions of altitude, two definitions are used:

i.

the one based on page names and

two.

the other based on awarding usage data.

The first definition relies on the names found in the source code and thus exposes the static information about the Web awarding, whereas the 2nd definition uses Web server access log files and exposes the dynamic information nigh the behavior of a Web application.

6.1 The altitude based on the component name

The first approach is based on the supposition that application developers used appropriate and systematic naming of pages depending on the part of the awarding to which an individual page belongs. This is based on a well-established finding that automated programme analysis and comprehension can be based on identifiers and names of program entities in general [81, 82].

Therefore, i should exist able to identify the central parts of an application just by observing that pages with similar names are probable to vest to the same application part while pages with different names are likely to belong to different parts of the awarding.

To cluster the set Γ of all Spider web application components, all components must be mapped into appropriate vector space in social club to define distances between any two components.

A component proper noun is usually a multiword identifier which must be split into individual words get-go. Dissimilar naming conventions are used for amalgam multiword identifiers, with letter of the alphabet-case separated words (also called medial capitals in Oxford English Dictionary or just CamelCase) and delimiter separated words (called snake case if the underscore serves as a delimiter) beingness most widely used.

The permit function split up maps the name of a Web application component into a multiset of all words appearing in its name (a multiset A is a pair (A, m) where A is a ready and grand : A N ; furthermore, let a A iff aA). The actual function split depends heavily on the naming convention used in the Web application which is being opposite engineered. A set of all the words which component names consist of can thus be expressed as

W = { w | γ Γ : west s p l i t ( γ ) } .

Let w i be the ith discussion in Due west according to some stock-still ordering of Due west.

All components of Γ are mapped into an north dimensional vector space V over Z where n = |Westward|. Each word w i West is mapped into a basis vector w i = ( a 1 , a 2 , , a due north ) where a j = 1 if i = j and a j = 0 if ij. Hence, if split up(γ) = (A γ , m γ ), a component γ ∈ Γ is mapped into a vector

v γ = i = 1 n a i w i where a i = 0 due west i A γ one thousand γ ( w i ) w i A γ

In one case all components are mapped into vector infinite, the cosine similarity between vectors five 1 and five two defined equally

s i grand ( v 1 , v two ) = cos ϕ = v 1 v 2 v 1 5 two ,

is used as the distance between two vectors.

Equally the component a i of the vector v γ represents the number of occurrences of the word due west i in the proper noun of the Web component γ, it is always nonnegative, i.due east., a i ≥ 0, and therefore southward i m ( v one , five 2 ) [ 0 , 1 ] .

Based on the cosine similarity the distance matrix D n Z north × n (index n means names) contains elements d i,j for i, j ∈{1, 2, …, n} where d i , j = south i thousand ( v i , v j ) .

The data about all application pages is as well stored in a data Webhouse. Nosotros acquired 354 singled-out awarding pages from a star schema page dimension representing awarding pages. The number of pages obtained from page dimension (354) is greater than the number of pages obtained from clickstream data (318), since some application pages had never been accessed in the menstruation when clickstream data were collected. However, these pages still exist in the arrangement and have to be dealt with. Application folio names were tokenized and mapped into vector space using cosine similarity between vectors (Section 6). Awarding page names were given by developers so that they could be easily broken into page name tokens, e.chiliad., VZ63_STUDENT_EXAM_APPLICATION. The data about cosine similarity betwixt page vectors was stored to a altitude matrix D north (alphabetize north denotes names) of size 354 × 354.

half-dozen.2 The distance based on Web application usage

Subsequently a session is reconstructed, a gear up of all pages for which at least one request is recorded in the log file(due south), and a prepare of user sessions become available. Permit P and S denote the set of all pages and the gear up of all user sessions, respectively.

Each session in Due south is a sequence of Web page requests issued one later another. Formally, session s Due south consisting of n(south) consecutive requests for Spider web pages p 1, p 2, …, p n(s) is

southward = p i , p 2 , , p due north ( south ) × i = 1 n ( southward ) P

and thus S × i = 1 P .

In club to carry out the clustering process, attributes or measures take to exist defined. Web pages in a Spider web log file have few attributes which can be relied upon in the clustering process. Each Spider web folio has the attributes such as the click rate or the number of boilerplate distinct appearances in a session, which tin be used for aspect based clustering, just clustering results based on those attributes were not very promising. In club to utilise the multidimensional scaling (MDS) clustering algorithm which tin employ the distance betwixt two objects for the clustering procedure, we had to determine the advisable measure which defines the similarity between two pages which would be used as input to the altitude based clustering algorithm. We first divers the similarity between ii individual Web pages, which represents a basis for determining the distance between two individual pages.

The similarity betwixt 2 Spider web pages p i and p j is based on the number of times both Web pages appear in the aforementioned Spider web session s and is defined as

s i one thousand ( p i , p j ) = i p i = p j | s Due south | p i s p j s | | S | otherwise

Further, the altitude between two Web pages is derived from the similarity between these ii Web pages and is divers as

d i south t ( p i , p j ) = 1 s i m ( p i , p j ) s i thou ( p i , p j ) > 0 otherwise

With the defined distance between ii pages, we tin further define the distance matrix D u Z | P | × | P | (index u means usage) with elements dist i,j for i , j { one , 2 , , | P | } , where p i and p j are the ith and jth page in P according to some fixed ordering of pages and P . If two pages announced in same sessions many times, it implies that these two pages should nigh probable belong to the same cluster, and then their distance must be small. To summarize, two pages without direct transitions are at altitude , while ii pages which both appear in every user Spider web session are at distance 0. Over again, matrix D u represents the input to the clustering algorithms computed past Orange and the custom implemented graph drawing programme proposed by Kamada–Kawai [83].

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/S0065245819300324

Recommendation Algorithms for Implicit Information

Xinxin Bai , ... Jin Dong , in Service Science, Direction, and Technology:, 2012

five.6.2.1 Results of the kNN Models

Offset, we implement the kNN models with different similarity measures [(5.ii) and (five.seven)] and rating strategies [(five.3) and (5.8)]. When the conventional cosine measure (five.ii) is used to compute similarity between two users and the most frequently purchased production strategy (v.3) is used to produce ratings and and then recommendations, kNN finds an average recommendation precision of 0.2328 in our 5 random splits of the whole data set if the number of neighbors is chosen to be 150. The numbers of neighbors for all the kNN models are called to obtain the best recommendation accuracies, respectively. The average precision increases to 0.2343 with the aforementioned number of neighbors, if nosotros use the weighted cosine similarity (5.7) and the log-rating strategy (5.8), where the shrinkage hyperparameter c = 1. Our new similarity measure and rating strategy improve the recommendation accuracy.

For comparing, we besides use user features to attain recommendations. The feature variables are first normalized to the same range [0, 1] and so used to calculate the similarity between dissimilar users. We use the cosine measure with the shrinkage technique (and the shrinkage parameter c = 2) because of the sparsity of the characteristic values. The final average recommendation precision is 0.2518 (the number of neighbors is still 150), which is much lower than 0.2343.

To improve the accuracy of our new kNN model [(3-5)+(3-viii)], we linearly combine its similarity and the similarity calculated from the features. The combining proportion is determined by filigree search from 1.0 to 0.0 with a footstep size of 0.05 to reach the all-time recommendation accuracy. Finally, we obtain the optimal combining proportion of 0.65, and the average recommendation precision increases from 0.2343 to 0.2376 with the same number of neighbors. We refer to the model equally hybrid kNN.

Tabular array five.1 shows the average recommendation precisions of the above kNN models and their parameter settings in detail.

Tabular array 5.one. Average recommendation precisions of the kNN models in five random splits of the whole data set.

Simirity Measure Rating Strategy Shrinkage Parameter Precision
S f sc (3-8) γ = 2 0.2518
S c (3-iii) 0.2328
Southward wc (3-iii) 0.2337
S swc (3-viii) γ = 1 0.2343
0 . 35 S f sc + 0 . 65 South swc (3-8) 0.2376

Different kNN models use dissimilar similarity measures and rating strategies to obtain recommendations. The number of neighbors is chosen to be 150, which produces the best recommendation accuracies for all the kNN models. The superscripts "s", "c", and "westward" stand for "shrunk", "cosine", and "weighted", respectively. The subscript "f" indicates the respective similarity calculated based on the user features.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/commodity/pii/B9780123970374000053