a code

a code

public void m(){
	n(this);
}

public static n(Object obj){
	if(obj==null){
		doSomething();
	}
}

inlining

public void m(){
	if(this==null){
		doSomething();
	}
}

null check folding

public void m(){
	if(false){
		doSomething();
	}
}

dead code elimination

public void m(){

}

and last but not least

-XX:+UseFastEmptyMethods

address entry_point = __ pc();

// If we need a safepoint check, generate full interpreter entry.
Label slow_path;
__ cmp32(ExternalAddress(SafepointSynchronize::address_of_state()),
         SafepointSynchronize::_not_synchronized);
__ jcc(Assembler::notEqual, slow_path);

// do nothing for empty methods
// (do not even increment invocation counter)
// Code: _return
// _return
// return w/o popping parameters
__ pop(rax);
__ mov(rsp, r13);
__ jmp(rax);

WAT?

be a hero

5058896 6644313672 50292

of your data with Neo4j

about me

twitter:@j_palka

  • code mangler at Allegro.tech, analytics, user profile and behavior

  • JDD/4Developers conferences, dictator for life and trouble maker

  • anything about JVM, bytecode, parsers, compilers and graphs of course

  • former chief architect, manager, database admin and performance freak

graphs: the primer

a sip of theory

a graph is an ordered pair G = (V, E) comprising a set V of vertices, nodes or points together with a set E of edges, arcs or lines, which are 2-element subsets of V (i.e., an edge is related with two vertices,and the relation is represented as an unordered pair of the vertices with respect to the particular edge). To avoid ambiguity, this type of graph may be described precisely as undirected and simple
— Wikipedia

neo4j: the primer

neo4j in 3 sentences

A Graph — records data in › Nodes — which have › Properties
Nodes — are organized by › Relationships — which also have › Properties
Nodes — are grouped by › Labels — into › Sets

no fluff just stuff

Neo4j is a graph database,
it is not another layer of abstraction over existing database engine

it has its own storage mechanism for persisting nodes and edges

building blocks

  • node

    • id

    • labels

    • properties

node

diag d22ee3d9a405b645a19290c960d8a360

building blocks

  • relationship

    • relationship type

    • direction

    • properties

relationship

diag 3b3674aa54af4b122d8e3cac137fab77

building blocks

  • schema

    • index

    • constraints

properties

  • Java primitives (long, int, float, etc.)

  • String

  • arrays of the above

cypher: the primer

introduction to cypher

looks like SQL, feels like pattern-matching

MATCH (n:Label) RETURN n;
MATCH (n)-[:LIKES]->m return m;
MATCH (n)-[r:LIKES]->m return r;

pattern matching in cypher

(a)-->(b)
(a)-->(b)<--(c)
(a:User)-->(b)
(a:User:Admin)-->(b)
(a)-[r:REL_TYPE]->(b)
(a)-[*2]->(b)
(a)-[*3..5]->(b)

patterns are expressions too

patterns return collection of paths

MATCH a-[:REL]-b, (c:Label) WHERE NOT b-[:REL]-c RETURN b;
MATCH a-[:REL]-b, (c:Label) RETURN a,b-[:REL]-c;
MATCH a-[:REL]-b, (c:Label) WITH a, b-[:REL]-c RETURN a, count(distinct c);

collections

RETURN [0,1,2,3,4,5,6,7,8,9] AS collection
RETURN range(0,10)[3]
RETURN length(range(0,10)[0..3])
RETURN [x IN range(0,10) WHERE x % 2 = 0 | x^3] AS result

predicates

RETURN [x IN range(0,10) WHERE x % 2 = 0 | x^3] AS result;

WHERE a.name='Alice' AND b.name='Daniel' AND
ALL (x IN nodes(p) WHERE x.age > 30);

WHERE a.name='Eskil' AND ANY (x IN a.array WHERE x = "one");

WHERE n.name='Alice' AND NONE (x IN nodes(p) WHERE x.age = 25);

WHERE n.name='Alice' AND
SINGLE (var IN nodes(p) WHERE var.eyes = "blue");

MATCH (n)
WHERE EXISTS(n.name)
RETURN n.name AS name, EXISTS((n)-[:MARRIED]->()) AS is_married

paths

MATCH p = ((a)--(b)) RETURN nodes(p)
MATCH p = ((a)--(b)) RETURN relationships(p)
MATCH p = ((a)-[:RAILS]-(b)) UNWIND nodes(p) AS n return DISTINCT n;

pipes

WITH clause works like | (pipe) in *NIX,
it allows you to pipeline subsequent queries

MATCH (david { name: "David" })-[]->(otherPerson)
WITH otherPerson, count(*) AS foaf
WHERE foaf > 1
RETURN otherPerson

caught in a trap

because Cypher is syntactically, the younger brother of SQL,
you fall into the trap of writing relational queries,
which will use the Cartesian product.

This is not a blessed way of graph Gods!!!

AngryGod1
logo showmethecode

understanding graphs

diag 28ef8e1c29f4f6819f3609469b1870f9

universal law of graph modeling

sentences are patterns

nouns are vertexes

verbs are edges

feeding the graph

CREATE
	(t0:Training {title : "Code archeology and architecture" }),
      	(n0:Trainer { name : "Jarosław Pałka"}),
        (s0:Skill { name : "Git" }),
        (s1:Skill { name : "SonarQube"}),
	t0-[:AUTHORED_BY]->n0,
	t0-[:REQUIRES]->s0,
	t0-[:INTRODUCES]->s1;

the shortest path

MATCH p=shortestPath(
	(s0:Skill {name : "Java"})
	-[:REQUIRES|INTRODUCES*]-
	(s1:Skill {name : "REST"})
WHERE NOT ()<-[:REQUIRES]-s1
RETURN p;

the shortest path

MATCH (a:Skill {name:"Java"})<-[:REQUIRES]-(b:Training)
MATCH (c:Training)-[:INTRODUCES]->(d:Skill {name:"REST"})
MATCH p=allShortestPaths(b-[rels:REQUIRES|INTRODUCES*]-c)
WHERE length(p)%2=0 AND
ALL(idx in range(0, length(p)-2,2) WHERE
	type(rels[idx])="INTRODUCES" AND type(rels[idx+1]) = "REQUIRES")
RETURN p

show me your inbox

model first

first class citizens of my inbox

  • mails

  • people

asciidoctor diagram classes

you’ve got mail

there are hidden relationships in our graph

first question
where are email threads?

small hint
emails have Message-Id and In-Reply-To headers

materialize

MATCH (n:Message)
MATCH (m:Message)
WHERE n.`Message-Id`=m.`In-Reply-To`
CREATE n<-[:IN_REPLY_TO]-m;

but first, indexes

CREATE INDEX ON :Message(`In-Reply-To`);
CREATE INDEX ON :Message(`Message-Id`);

weapon of mass destruction

elephant talk

so, what’s the longest email thread in our inbox?

small hint
stop, and think for a moment, what is email thread?

diag 22a6f313f9024adb19d712f0d30a3afd
MATCH p=((n:Message)<-[:IN_REPLY_TO*]-(m:Message))
WHERE NOT ()<-[:IN_REPLY_TO]-n AND NOT (m)<-[:IN_REPLY_TO]-()
WITH n, length(nodes(p)) AS c
RETURN n.Subject, c
ORDER BY c DESC
LIMIT 5;

notes on performance

Cypher has EXPLAIN and PROFILE clauses

EXPLAIN
MATCH (p:Person { name:"Tom Hanks" })
RETURN p

PROFILE
MATCH (p:Person { name:"Tom Hanks" })
RETURN p

be specific, use labels and relationship types,
it will save hours of debugging and CPU cycles

evolution of model

asciidoctor diagram classes

evolution of model

diag f3aa1b239af5d29ad43a198493bbd158

let’s go social

or let’s go for a drink :)

small hint
if I know your email address, we are friends, aren’t we?

OPTIONAL MATCH (m:Message)<-[:IN_REPLY_TO]-(n:Message)-[:TO|CC|BCC]->r
WHERE NOT m-[:TO|CC|BCC]->r
WITH n,r
MATCH n-[:FROM]->f
MERGE r-[:KNOWS]-f;

evolution of model

asciidoctor diagram classes

evolution of model

diag f3aa1b239af5d29ad43a198493bbd158

evolution of model

diag ae123a418f7be74bd39c3122e5aa1bba

with a little bit of help from my friends

clusters - tightly knit groups characterized by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes
— Wikipedia

our graph is biased,
because owner of the inbox has relationships to every message

clustering coefficient

a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterized by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes
— Wikipedia

a tasty bite of theory

local clustering coefficient of a vertex (node) in a graph quantifies how close its neighbors are to being a clique (complete graph)

it is then given by the proportion of links between the vertices within its neighborhood divided by the number of links that could possibly exist between them.

\$C_i=(|{e_(jk) : v_j,v_k in N_i,e_(jk) in E}|)/(k_i(k_i-1))\$

where neighborhood \$N_i\$ is:

\$N_i={v_j : e_(ij) in E vv e_(ji) in E}\$

and \$k_i\$ is number of neighboors of vertex \$i\$

MATCH (a:InternetAddress)-[:KNOWS]-(b:InternetAddress)
WITH a, count(distinct b) as neighbours
MATCH (a)-[:KNOWS]-()-[r:KNOWS]-()-[:KNOWS]-(a)
WHERE neighbours>1
WITH a, neighbours, count(distinct r) AS connected_neighbours
RETURN
a.address,
(2*toFloat(connected_neighbours))/(neighbours*(neighbours-1)) as coefficient
ORDER BY coefficient DESC;

go deeper

we can even go deeper and try to play with cliques

is a subset of vertices of an undirected graph such that its induced subgraph is complete; that is, every two distinct vertices in the clique are adjacent.
— Wikipedia

or strongly connected components

a graph is said to be strongly connected if every vertex is reachable from every other vertex
— Wikipedia

let’s go for a movie

we are now going to work with bigger model,
publicly available movies recommendation data set from MovieLens, created by GroupLens Research.

Data set contains movies recommendations, gathered through their website.

model

diag bbe862c978f0cc3b7db7de6f399c3649

how to grow your graph

LOAD CSV FROM 'file:data.csv' AS line  FIELDTERMINATOR '|'
WITH line[0] AS firstname, line[0] AS lastname
CREATE (:Person {firstname : firstname, lastname : lastname});
LOAD CSV WITH HEADERS FROM 'file:data.csv' AS line
CREATE (:Person {firstname : line.firstname, lastname : line.lastname});

skip jump

diag a1d75c1283236dd709f188c8132df1ca
MATCH (a:User)-[:RATED]->(m:Movie)<-[:RATED]-(b:User)-[:RATED]->(n:Movie)
WHERE a.userId="333" AND NOT (a-[:RATED]->n)
RETURN n.title LIMIT 50;

collaborative filtering

k-nearest neighbors and cosine similarity
this technique uses vector of ratings

\$cos(theta)=(A*B)/(|\ |A|\ |*|\ |B|\ |)=(sum_(i=1)^n A_i xx B_i)/(sqrt(sum_(i=1)^n (A_i)^2) * sqrt(sum_(i=1)^n (B_i)^2))\$

MATCH (p1:User)-[x:RATED]->(m:Movie)<-[y:RATED]-(p2:User)
WITH  SUM(x.rating * y.rating) AS xyDotProduct,
      SQRT(REDUCE(xDot = 0.0, a IN COLLECT(x.rating) | xDot + a^2)) AS xLength,
      SQRT(REDUCE(yDot = 0.0, b IN COLLECT(y.rating) | yDot + b^2)) AS yLength,
      p1, p2
MERGE (p1)-[s:SIMILARITY]-(p2)
SET   s.similarity = xyDotProduct / (xLength * yLength);
MATCH 	 (p1:User {userId:333})-[s:SIMILARITY]-(p2:User)
WITH 	 p2, s.similarity AS sim
ORDER BY sim DESC
LIMIT 	 5
RETURN 	 p2.userId AS Neighbor, sim AS Similarity

appendix

Q&A