Sanjeev – Abhi – Kartavya
Date:
18-11-03
Page: 1/9
Only
2–3 days back, I sent to Kartavya a mention from PC World magazine, of
an anti-spam software which works on the principle of:
Frequency
of usage of words in an email.
If
a human keeps telling this software:
(A)
“This email is a spam.”
(B)
“This email is not a spam.”
…it
keeps learning!
Simple.
The
software simply keeps totaling up the words (and computes their usage
frequencies) in each category (A) and (B).
Hence,
probabilities of occurrence of any given word keep changing as the word
population keeps growing in each population (of good/spam emails).
This
is a free software, and Abhi should download it for R&D.
Now,
if you simply replace:
- “not-a-spam”
= a resume email,
- “spam”
= any other email,
then
get a manual segregation of both types of populations (of emails)—which, I
think, Reema does on a daily basis—and then ask this software to “process”
like-index words & compute their frequency of usage in both populations,
then very soon you will get this software to learn to recognize:
a
“resume email” = not-a-spam
an
“ordinary email” = spam
PRESTO!
You’ve
got a ready-made solution (a free software too!) which, as soon as it reads an
incoming email, is able to decide accurately:
→
whether that email is a resume or not!
And
what is more, this software keeps learning with addition of each incoming
email!
Do
try.
If
you wish to learn how precisely this software works, visit
👉 www.paulgraham.com
That
brings me to the small news item pasted at the start of this note.
Once
again, the researchers are falling back on Theory of Probability
(and
ubiquitous Frequency Distribution Curves)
to
figure out
Good
= Plays written by Shakespeare
Bad
= Plays written by someone else
(but
being palmed off as original Shakespearean!)
This
they are trying to do, by simply trying to figure out the
Patterns
of Usage of Words
in
each of Shakespeare’s plays.
At
a very crude level, it is no more than computing:
a)
Probabilities of (usage of) keywords in all the plays written or claimed to
have been written by Shakespeare — the entire population of plays.
b)
Probability of usage of keywords in some smaller (suspected) sub-set /
sub-population of plays.
c)
Keywords used in a single play.
Then
compute Coefficient of Correlation (like a line-of-best-fit).
- The
higher the coefficient, the greater the probability that a given play was
truly/genuinely written by Shakespeare himself.
- On
the other hand, a low coefficient of correlation would indicate that the
divergence is so much that it is highly unlikely to have been written by
Shakespeare.
The
pattern (of usage of words) is at considerable variance (more than ±3σ)
from the mean.
Quite
like an SQC Control Chart
(Sketch
included showing a fluctuating line with “Upper Control Limit (+3 σ), Lower
Control Limit (–3 σ), Mean,” and two outlier points beyond limits labeled 99.9
%.)
Obviously
these two readings could not have occurred due to chance variation
(since they are falling outside the control limits).
So
they are not part of a pattern — hence these two plays must have been
written by someone other than Shakespeare!
What,
if any, is the significance of:
- Probabilities
- Mean
/ Skewness
- Standard
Deviations
- Upper
Control Limit / Lower Control Limit
- Line
of Best Fit
- Coefficient
of Correlation
- Patterns
- etc.
etc.
to
our business?
Do
all these have …
…
any practical use for us?
Any
specific competitive advantage that we can gain over our competitors
(such as Monster / Naukri)?
You
bet!
Our
Function Profile (with its Raw Score / Percentile / Population-Sample
Size) is a proof.
With these profiles, suddenly a resume carries far more meaning compared
to a plain email resume.
The
ImageBuilder, with its Function Profiles, makes much more sense.
The profile of a jobseeker comes under a much sharper focus — it is no
more diffused.
Now
you do not need an experienced or trained HR professional to figure out what is
the relative standing of this jobseeker in a long queue.
Yesterday
we discussed how, in future, we will construct Function Profiles
according to a Designation Level (within a function) — a truly “Apple-for-Apple”
comparison.
Today,
when we construct function-wise profiles (Sales / Mktg / Production etc.), all
that we are doing is to separate out …
→
Vegetables (Sales)
→
Fruits (Marketing)
…and
then trying to say,
“Mango
is sweeter than Apple and Apple is sweeter than Banana!”
When
we succeed in creating Function Profiles, designation-level-wise, we
would have succeeded in separating Mangoes from Apples from Bananas!
So
this has to be a definite goal.
This
will make more sense to a recruiter (subscriber), and he will be willing to pay
a premium price for such a feature.
And
then we will do the same thing with:
- Salary
Profiles
- Designation-Level
Profiles
- Tenure
(Length of each job) Profile
- Education
Profile
- Age
Profile
- Experience
(Total Yrs) Profile
- Actual
Designation Profile (Vacancy/Position Name)
- etc.
etc.
That
brings me to the question of “Duplicate Resumes (and Duplicate Profiles?)”
Initially,
when a subscriber is trying to process/extract 50,000 resumes lying on his PC,
it is understandable that he may come across identical resumes (same first
name/last name/DOBs) and may delete one of these (obviously the latter
processed one) as duplicate.
However,
if we (i.e., Gururaj) come across such a duplicate resume after a
week, a month, or a year, then—
→
we must process/extract it, and
→
replace the old ImageBuilder with the latest ImageBuilder under the same
PEN.
It
is obvious why we must replace the old ImageBuilder with the new (latest)
ImageBuilder.
Even
if the latest email resume (of the same candidate) is identical with the old
one (i.e., not a single word changed), when we re-process it:
→
Raw Score may remain same,
→
but Percentile will most likely change.
Because,
during the intervening period, a lot of resumes (belonging to some function)
got added — changing the population/sample size!
So
his function-profile will be different!
(even
if keywords remain the same)
And
it is this new/revised profile that Gumline should store and make
available to both the subscriber (during search) and to the candidate
(during bounce-back).
Then
there is a good probability that in the latest resume, some keywords have also
changed.
In
that case, both the Raw Score and the Percentile would change —
maintaining the same (i.e., PEN).
It’s
only this alive and ever-changing profile (sample size / raw score /
percentile) that we can claim that our software is self-learning in the
most dramatic possible way!
This (self-learning/adaptive) method is proving this — by simply re-processing
the same resume every week for a week or two, and seeing (himself) a different
Function Profile each time.
Then he will never tire of talking about Gumline to his friends!
(Signed
/ dated: 13-11-03)