Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
FoodInformatics
msx-tool
Commits
b0ac3562
Commit
b0ac3562
authored
Mar 31, 2021
by
Bianco Martinez, Julian
Browse files
Code that expand the graph by intersection of concepts = GEIC
parent
0c49510d
Changes
1
Hide whitespace changes
Inline
Side-by-side
JBM_Intersection/Graph_Expantion_by_Intersection.py
0 → 100644
View file @
b0ac3562
# -*- coding: utf-8 -*-
"""
Created on Wed Mar 31 12:09:38 2021
@author: Julian Bianco-Martinez
Graph Expation by Intersection of Concepts (GEIC)
"""
import
pandas
as
pd
import
numpy
as
np
def
concepts_extraction_list
(
lst1
):
'''
Extract concepts from word2vec output
'''
lst3
=
[
value
[
0
]
for
value
in
lst1
]
return
lst3
def
GEIC
(
concepts
,
words_to_exclude
=
[],
topX_intersect_concepts
=
5
,
topXsimilarConcepts
=
100
,
threshold
=
0.5
):
'''
From a set of concepts = C0, this algorithm collects and weight similar concepts that intersect with concepts in C0
Inputs:
concepts: Original Concepts (C0)
words_to_exclude: The code will remove new concepts that are duplicated in this list. If it is empty words_to_exclude = concepts
topX_intersect_concepts: Only collect the top X concepts found from the intersection of similar concepts between pairs of original concepts C0. ie.
C0(i) int C0(j) = IC(i,j)[::topX_intersect_concepts]
topXsimilarConcepts: Retrieve top X similar concepts per concept in C0
threshold: cut of threshold that retrieves only intersected concepts that apear in theshold * 100 percent of all combine C0 pairs.
Output:
Data frame with 4 columns:
Column 1: C0(i)
Column 2: C0(j)
Column 3: IC(i,j)[::topX_intersect_concepts]
Column 4: Importance of the intersected concept defined as the percentage of time the concept appear in pairwise combination of C0 concepts.
'''
if
len
(
words_to_exclude
)
==
0
:
# words_to_exclude = concepts
words_to_exclude
=
concepts
.
copy
()
df_temp
=
pd
.
DataFrame
({
'Concept 1'
:
[],
'Concept 2'
:[],
'Intersection'
:
[]})
# Creation of triangular data (due to symmetry).
for
i
in
range
(
len
(
concepts
)
-
1
):
for
j
in
range
(
i
+
1
,
len
(
concepts
)):
concepts1
=
concepts_extraction_list
(
model
.
most_similar
(
concepts
[
i
],
topn
=
topXsimilarConcepts
))
concepts2
=
concepts_extraction_list
(
model
.
most_similar
(
concepts
[
j
],
topn
=
topXsimilarConcepts
))
inter
=
[
v
for
v
in
concepts1
if
v
in
concepts2
][
0
:
topX_intersect_concepts
]
if
len
(
inter
)
>
0
:
df_temp
=
df_temp
.
append
(
pd
.
DataFrame
({
'Concept 1'
:
concepts
[
i
],
'Concept 2'
:
concepts
[
j
],
'Intersection'
:
inter
}))
df_extension
=
df_temp
#[v for v in df_extention['Intersect'].values if v not in words_to_exclude]
#Remove words that contain less than 4 characters.
logical_temp
=
[
True
if
len
(
v
)
>
3
else
False
for
v
in
df_extension
[
'Intersection'
].
values
]
df_extension
=
df_extension
[
logical_temp
]
#Weight Creation
weights
=
df_temp
[
'Intersection'
].
value_counts
().
rename_axis
([
'Intersection'
]).
reset_index
(
name
=
'weight'
)
weights
[
'weight'
]
=
weights
[
'weight'
]
/
(
0.5
*
np
.
math
.
factorial
(
len
(
concepts
))
/
(
np
.
math
.
factorial
(
len
(
concepts
)
-
2
)
))
#Use for normalization.
logical_temp
=
[
True
if
v
not
in
words_to_exclude
else
False
for
v
in
df_extension
[
'Intersection'
].
values
]
df_extension
=
df_extension
[
logical_temp
]
df_extension
=
df_extension
.
merge
(
weights
,
on
=
"Intersection"
)
df_extension
=
df_extension
.
loc
[
df_extension
[
'weight'
]
>
threshold
]
return
(
df_extension
)
#Example
concepts
=
[
'king'
,
'queen'
,
'prince'
]
GEIC
(
concepts
,
topX_intersect_concepts
=
15
)
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment