Commit beeb826a authored by Jim Hoekstra's avatar Jim Hoekstra 👋🏻
Browse files

Merge branch 'JBM_Intersection_terms' into 'master'

Code that expands the graph by intersection of concepts = GEIC

See merge request !10
parents 155eb9ca 8e8e9f07
# -*- coding: utf-8 -*-
Created on Wed Mar 31 12:09:38 2021
@author: Julian Bianco-Martinez
Graph Expansion by Intersection of Concepts (GEIC)
import pandas as pd
import numpy as np
def concepts_extraction_list(lst1):
Extract concepts from word2vec output
lst3 = [value[0] for value in lst1]
return lst3
def GEIC(concepts, words_to_exclude = [], topX_intersect_concepts = 5, topXsimilarConcepts = 100, threshold = 0.5 ):
From a set of concepts = C0, this algorithm collects and weight similar concepts that intersect with concepts in C0
concepts: Original Concepts (C0)
words_to_exclude: The code will remove new concepts that are duplicated in this list. If it is empty words_to_exclude = concepts
topX_intersect_concepts: Only collect the top X concepts found from the intersection of similar concepts between pairs of original concepts C0. ie.
C0(i) int C0(j) = IC(i,j)[::topX_intersect_concepts]
topXsimilarConcepts: Retrieve top X similar concepts per concept in C0
threshold: cut of threshold that retrieves only intersected concepts that apear in theshold * 100 percent of all combine C0 pairs.
Data frame with 4 columns:
Column 1: C0(i)
Column 2: C0(j)
Column 3: IC(i,j)[::topX_intersect_concepts]
Column 4: Importance of the intersected concept defined as the percentage of time the concept appear in pairwise combination of C0 concepts.
if len(words_to_exclude) == 0: # words_to_exclude = concepts
words_to_exclude = concepts.copy()
df_temp = pd.DataFrame({'Concept 1' : [], 'Concept 2' :[], 'Intersection' : []})
# Creation of triangular data (due to symmetry).
for i in range(len(concepts)-1):
for j in range(i+1, len(concepts)):
concepts1 = concepts_extraction_list(model.most_similar(concepts[i], topn=topXsimilarConcepts))
concepts2 = concepts_extraction_list(model.most_similar(concepts[j], topn=topXsimilarConcepts))
inter = [v for v in concepts1 if v in concepts2][0:topX_intersect_concepts]
if len(inter) > 0:
df_temp = df_temp.append(pd.DataFrame({'Concept 1' : concepts[i],
'Concept 2' : concepts[j],
'Intersection' : inter}))
df_extension = df_temp
#[v for v in df_extention['Intersect'].values if v not in words_to_exclude]
#Remove words that contain less than 4 characters.
logical_temp = [True if len(v) > 3 else False for v in df_extension['Intersection'].values ]
df_extension = df_extension[logical_temp]
#Weight Creation
weights = df_temp['Intersection'].value_counts().rename_axis(['Intersection']).reset_index(name='weight')
weights['weight'] = weights['weight'] /(0.5 * np.math.factorial(len(concepts))/(np.math.factorial(len(concepts)-2) )) #Use for normalization.
logical_temp = [True if v not in words_to_exclude else False for v in df_extension['Intersection'].values ]
df_extension = df_extension[logical_temp]
df_extension = df_extension.merge(weights, on = "Intersection")
df_extension = df_extension.loc[df_extension['weight'] > threshold]
concepts = ['king', 'queen', 'prince']
GEIC(concepts, topX_intersect_concepts = 15)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment