Notes

Sanjeev – Abhi – Kartavya

Date: 18-11-03

Page: 1/9

Only 2–3 days back, I sent to Kartavya a mention from PC World magazine, of an anti-spam software which works on the principle of:

Frequency of usage of words in an email.

If a human keeps telling this software:

(A) “This email is a spam.”

(B) “This email is not a spam.”

…it keeps learning!

Simple.

The software simply keeps totaling up the words (and computes their usage frequencies) in each category (A) and (B).

Hence, probabilities of occurrence of any given word keep changing as the word population keeps growing in each population (of good/spam emails).

This is a free software, and Abhi should download it for R&D.

Now, if you simply replace:

“not-a-spam” = a resume email,
“spam” = any other email,

then get a manual segregation of both types of populations (of emails)—which, I think, Reema does on a daily basis—and then ask this software to “process” like-index words & compute their frequency of usage in both populations, then very soon you will get this software to learn to recognize:

a “resume email” = not-a-spam

an “ordinary email” = spam

PRESTO!

You’ve got a ready-made solution (a free software too!) which, as soon as it reads an incoming email, is able to decide accurately:

→ whether that email is a resume or not!

And what is more, this software keeps learning with addition of each incoming email!

Do try.

If you wish to learn how precisely this software works, visit

👉 www.paulgraham.com

That brings me to the small news item pasted at the start of this note.

Once again, the researchers are falling back on Theory of Probability

(and ubiquitous Frequency Distribution Curves)

to figure out

Good = Plays written by Shakespeare

Bad = Plays written by someone else

(but being palmed off as original Shakespearean!)

This they are trying to do, by simply trying to figure out the

Patterns of Usage of Words

in each of Shakespeare’s plays.

At a very crude level, it is no more than computing:

a) Probabilities of (usage of) keywords in all the plays written or claimed to have been written by Shakespeare — the entire population of plays.

b) Probability of usage of keywords in some smaller (suspected) sub-set / sub-population of plays.

c) Keywords used in a single play.

Then compute Coefficient of Correlation (like a line-of-best-fit).

The higher the coefficient, the greater the probability that a given play was truly/genuinely written by Shakespeare himself.
On the other hand, a low coefficient of correlation would indicate that the divergence is so much that it is highly unlikely to have been written by Shakespeare.

The pattern (of usage of words) is at considerable variance (more than ±3σ) from the mean.

Quite like an SQC Control Chart

(Sketch included showing a fluctuating line with “Upper Control Limit (+3 σ), Lower Control Limit (–3 σ), Mean,” and two outlier points beyond limits labeled 99.9 %.)

Obviously these two readings could not have occurred due to chance variation (since they are falling outside the control limits).

So they are not part of a pattern — hence these two plays must have been written by someone other than Shakespeare!

What, if any, is the significance of:

Probabilities
Mean / Skewness
Standard Deviations
Upper Control Limit / Lower Control Limit
Line of Best Fit
Coefficient of Correlation
Patterns
etc. etc.

to our business?

Do all these have …

… any practical use for us?

Any specific competitive advantage that we can gain over our competitors (such as Monster / Naukri)?

You bet!

Our Function Profile (with its Raw Score / Percentile / Population-Sample Size) is a proof.
With these profiles, suddenly a resume carries far more meaning compared to a plain email resume.

The ImageBuilder, with its Function Profiles, makes much more sense.
The profile of a jobseeker comes under a much sharper focus — it is no more diffused.

Now you do not need an experienced or trained HR professional to figure out what is the relative standing of this jobseeker in a long queue.

Yesterday we discussed how, in future, we will construct Function Profiles according to a Designation Level (within a function) — a truly “Apple-for-Apple” comparison.

Today, when we construct function-wise profiles (Sales / Mktg / Production etc.), all that we are doing is to separate out …

→ Vegetables (Sales)

→ Fruits (Marketing)

…and then trying to say,

“Mango is sweeter than Apple and Apple is sweeter than Banana!”

When we succeed in creating Function Profiles, designation-level-wise, we would have succeeded in separating Mangoes from Apples from Bananas!

So this has to be a definite goal.

This will make more sense to a recruiter (subscriber), and he will be willing to pay a premium price for such a feature.

And then we will do the same thing with:

Salary Profiles
Designation-Level Profiles
Tenure (Length of each job) Profile
Education Profile
Age Profile
Experience (Total Yrs) Profile
Actual Designation Profile (Vacancy/Position Name)
etc. etc.

That brings me to the question of “Duplicate Resumes (and Duplicate Profiles?)”

Initially, when a subscriber is trying to process/extract 50,000 resumes lying on his PC, it is understandable that he may come across identical resumes (same first name/last name/DOBs) and may delete one of these (obviously the latter processed one) as duplicate.

However, if we (i.e., Gururaj) come across such a duplicate resume after a week, a month, or a year, then—

→ we must process/extract it, and

→ replace the old ImageBuilder with the latest ImageBuilder under the same PEN.

It is obvious why we must replace the old ImageBuilder with the new (latest) ImageBuilder.

Even if the latest email resume (of the same candidate) is identical with the old one (i.e., not a single word changed), when we re-process it:

→ Raw Score may remain same,

→ but Percentile will most likely change.

Because, during the intervening period, a lot of resumes (belonging to some function) got added — changing the population/sample size!

So his function-profile will be different!

(even if keywords remain the same)

And it is this new/revised profile that Gumline should store and make available to both the subscriber (during search) and to the candidate (during bounce-back).

Then there is a good probability that in the latest resume, some keywords have also changed.

In that case, both the Raw Score and the Percentile would change — maintaining the same (i.e., PEN).

It’s only this alive and ever-changing profile (sample size / raw score / percentile) that we can claim that our software is self-learning in the most dramatic possible way!
This (self-learning/adaptive) method is proving this — by simply re-processing the same resume every week for a week or two, and seeing (himself) a different Function Profile each time.
Then he will never tire of talking about Gumline to his friends!

(Signed / dated: 13-11-03)

Thursday, 13 November 2003

LOGIC