23/12/2001
Ø www.hnc.com/innovation_05/cortronics_050105
Ø www.hnc.com/innovation_05/neuralnet_050101
Ø www.zdnet.com.au/newstech/enterprise/story/0,2000025001,
20262400,00.htm
Ø www.aaai.org/AITopics/html/current.html
Ø www.dsslab.com/resources/textanal.htm
Ø www.ece.ogi.edu/~storm/research/org.htm
Yoesh – Cyril 9/6/1998
ARDIS
While discussing the “data-capture & Query” Module (module
# 1), a few days back we also talked about the “knowledge-base” already
available with us. This knowledge-base has been acquired/created over last 8
years.
The knowledge-base comprises of English language
-
Words
-
Phrases
-
Sentences
-
Paragraphs
As far as “words” are concerned, I myself worked on
“categorising” them in different “categories”. This was nearly 12 months ago,
using software tool “TELL ME” developed by Cyril.
In this connection, I enclose – Annex = A/B/C/D.
(A)
Under TELL ME”, I have already categorised over
15000 words into some 60 different categories. Some of these are shown in Annex:
C
(B)
In addition, Cyril had developed another simple
method, under which I could quickly categorise:
P= Persons
Name (Name of a Person)
C=
Company Name
Q= Edu.
Quali. of an individual
L=
Name of a location (mostly a City)
As far as these 4 Categories (out of 60 odd
categories of words) is concerned, I have already covered:
Frequency No. of words covered
>100 7056
51-100 3913
26-50 5880
11-25 13,397
30,246
(see Annex=A)
These are ISYS-Indexed words.
So,
Under both the tools combined ( A+B), I might
have already categorise over 30,000 words.
Over the last 5/6 weeks, we have already scanned/OCR’ed /
and created txt file of some 13573 pages of bio-datas. And this population (of
txt files) is growing at the rate of some 300 pages/day.
We talked about a simple software which will pick-out all
the words (except for “common” words) in each of this page
Then,
Compare each such word with the knowledge-base of 30000
words which I have already “categorised”.
If a “match” is found, the word is transferred to respective
“category” & marked “Known”.
If there is “no match”, the word gets tagged as “NEW” and
gets highlighted on the txt file.
Now, anytime, any consultant is viewing that page on the
screen and comes across a “New” marked word, whose “meaning/category” he knows,
he will have a simple “tool” (on that very screen) with which he will go ahead
and “categorise” that word. This tool could be perhaps “ TELL ME”.
We should debate whether we also give the “rights” to any
Consultant to “add” a new category itself.
It should be possible for any no. of “consultants” to work
on this tool, simultaneously, from their own individual work-stations, whenever
time permits or whenever they are “viewing” a txt page for any reason.
This arrangement would “multiply” the effort several times
as compared to my doing it “Single handedly” !
+
It has the advantage of using the knowledge of several
persons having different academic / experience background.
We could also consider hiring “experts” from different “Functional”
areas, to carry-out (this categorisation) in a dedicated manner.
Now that we have 13573 pages ready (for this simple
“match-making” process), we could seriously consider “hiring” such “experts”.
We could even take “text-books” on various SUBJECTS/CATEGORIES
and prepare an INVENTORY of all words appearing in each book and put them in
the SUBJECT category.
Many innovations are possible – if only we could make a
“beginning”. Such a beginning is possible now.
Let us give this a serious thought and discuss soon.
Regards,
Hemen Parekh
1/12/1996
ARDIS =
Automatic Resume Deciphering Intelligence Software
ARGIS = Automatic Resume Generating Intelligence Software
What are these Softwares ?
What will they do ?
How will they help us ?
How will they help our clients / Candidates ?
ARDIS
-
This software will break-up/dissect a Resume
into its different Constituents, such as
A.
Physical information (data) about a Candidate
(Executive)
B.
Academic “ “ “ “ “
C.
Employment Record (Industry – Function –
Products/Services wise)
D.
Salary
E.
Achievements / Contributions
F.
Attitudes/Attributes/Skills/Knowledge
G.
His preferences w.r.t. Industry/Function/Location
In fact, if every candidate was to fill-in our EDS, the info.
Would automatically fall into “proper” slots/fields since our EDS forces a
candidate to “dissect” himself into various compartments.
But,
Getting every applicant/executive to fill-in our standardised
EDS is next to impossible – and may not be even necessary. Executives (who have
already spent a lot of time and energy preparing/typing their biodatas) are
most reluctant to Sit-down once more and spend a lot of time once again
to furnish us the SAME information/data in our neatly arranged blocks of
EDS. For them, this duplication is a WASTE OF TIME ! EDS is designed for our
(information-handling / processing /retrieving) convenience and that is the way
he perceives it ! Even if he is vaguely Conscious that this (filling in of EDS)
would help him in the long-run, he does not see any immediate
benefit from filling this – hence reluctant to do so.
We too have a problem – a “Cost/Time/Effort” problem.
If we are receiving 100 biodatas each day (this should
happen soon), whom to send our EDS and whom NOT to ?
This can be decided only by a Senior Executive/Consultant
who goes thru each & every biodata daily and reaches a conclusion as
to
-
Which resume’s are of “interest” & need
sending on EDS
-
Which resume’s are marginal or not-of-immediate
interest where we need not spend time/money/energy of sending on EDS.
We may not be able to employ a number of Senior/Competent
Consultants who can scrutinise all incoming bio-datas and take this decision on
a DAILY basis ! This, itself would be a Costly proposition.
So,
On ONE HAND
-
We have time/cost/energy/effort of Sending EDS
to everyone
On OTHER HAND
-
We have time/cost of several Senior Consultants
to Separate out “chaffe” from “wheat”.
NEITHER IS DESIRABLE
But
From each biodata received daily, we still need to decipher
and drop into relevant slots/fields, relevant data/information which would
enable us to
OUR REQUIREMENTS / NEEDS
-
Match a Candidate’s profile with “Client
Requirement Profile” against specific requests
-
Match a Candidates profile against hundreds of
recruitment advertisement appearing daily in media (Jobs BBS)
-
Match a Candidate’s profile against “specific
Vacancies” that any corporate (client or not) may “post” on our Vacancy
bulletin-board (unadvertised Vacancies).
-
Match a candidate’s profile against “Most likely
Companies who are likely to hire/need such an executive” using our CORPORATE
DATA BASE, which will contain info. Such as
PRODUCTS/SERVICES of each & every Company
-
Convert each biodata received into a
RECONSTITUTED BIO-DATA (converted bio-data), to enable us to send it out to any
Client/Non-Client organisation at the click of mouse.
-
Generate (for commercial/profitable
exploitation), Such bye-product services as
o
Compensation Trends
o
Organisation charts
o
Job Descriptions etc. etc.
-
Permit a Candidate to log-into our database and
remotely modify/alter his bio-data
-
Permit a client (or a non-client) to log into our
database and remotely conduct a SEARCH.
ARDIS is required on the assumption that for a long time to
come “typed” bio-datas would form a major source of our database.
Other sources, such as
-
Duly filled-in EDS (hard-copy)
-
EDS on a floppy
-
Downloading EDS over Internet (or Dial-up phone
lines) & uploading after filling-in (like Intellimatch)
Will continue to play a minor role in foreseeable future.
Compare with Key-words stored in WORD-DIRECTORY
of most-frequently used words in 3500 converted bio-datas (ISYS analysis)
HOW
WILL ARDIS WORK?
Convert to English characters (by comparison) Most Commonly used
VERBS/ADVERBS/ADJECTIVES/PREPOSITION with each KEY-PHRASE, to create directory
of KEY-SENTENCES
To recapitulate
ARDIS will,
-
Recognise “characters”
-
Convert to “WORDS”
-
Compare with 6258 Key-words which we have
found in 3500 converted bio-datas (using ISYS). If a “word” has not already
appeared (> 10 times) in these 3500 bio-datas, then its chance (probability)
of occurring in the next biodata is very very small indeed.
But even then,
ARDIS Sftware will store in
memory, each “occurrence” of each WORD (old or new, first time or a thousandth time)
and
will continuously calculate its “probability
of occurrence” as
P = No. of occurrence of the given word so far
Total no. of occurrence of all the
words in the entire population so far
So that,
-
By the time we have SCANNED, 10,000 bio-datas,
we would have literally covered all words that have, even a small PROBABILITY
of OCCURRENCE !
So with each new bio-data
“Scanned” the probability of occurrence of each “word” is getting more &
more accurate !
Same logic will hold for
-
KEY PHRASES
-
KEY SENTENCES
The “name of the game” is
-
PROBABILITY OF OCCURRENCE
As someone once said,
If you allow 1000 monkeys to keep
on hammering keys of 1000 typewriters for 1000 years, you will at the end, find
that between them they have reproduced the entire literary-works of
Shakespeare!
But to-day, if you store into a
Super-Computer,
-
All the words appearing in English language
(incl- verbs/adverbs/adj. etc.)
-
The “logic” behind Construction of English
language
Then,
I am sure the Super-Computer could
reproduce the entire works of Shakespeare in 3 months !
And, as you would have noticed,
ARDIS is a self-learning type of software. The more it reads (scans), the more
it learns (memorises, words, phrases & even sentences).
Because of this SELF-LEARNING /
SELF – CORRECTING/ SELF – IMPROVING Capability,
ARDIS gets better & better
equipped, to detect, in a scanned biodata
-
Spelling mistakes (wrong word)
-
Context mistakes ( wrong prefix or suffix) – wrong
PHRASE
-
Preposition mistakes (wrong phrase)
-
Adverb/Verb mistakes - wrong SENTENCE
With minor variations,
All thoughts, words (written),
speech (spoken) and actions, Keep on repetiting again and again and again.
It is this REPETITIVENESS of
words, phrases & Sentences in Resume’s that we plan to exploit.
In fact,
By examining & memorising the
several hundred (or thousand) “Sequences” in which the words appear, it should
be possible to “construct” the “grammar” i.e. the logic behind the sequences. I
suppose, this is the manner in which the experts were able to unravel the
“meaning” of hierographic inscriptions on Egyptian tombs. They learned a
completely strange/obsure language by studying the “repetitive” &
“sequential” occurrence of unknown characters.
How to build directories of
“phrases” ?
From 6252 words, let us pick any
word, Say
ACHIEVEMENT
WORD = ACHIEVEMENT
Now we ask the software to scan the directory containing 3500
converted bio-datas, with instruction that every time the word “Achievement” is
spotted, the software will immediately spot/record the prefix, The Software
will record all the words that appeared before “Achievement” as also the
“number of times” each of this prefix appeared.
e.g.
1 Major 10 10/55
=
2 Minor 9 9/55/ =
3 Significant 8 8/55 =
4 Relevant 7 7/55 =
5 True 6
6 Factual 5
7 My 4
8 Typical 3
9 Collective 2
10 Approximate 1
Total
No. of = 55 =
1.0000
Occurrence
(population-size)
As more & more bio-datas are Scanned,
-
The number of “prefixes” will go on increasing
-
The number of “occurrences” of each prefix will
also go on increasing
-
The overall “population – size “ will also go on
increasing
-
The “probability of occurrence” of each prefix
will go on getting more & more accurate i.e. more & more
representative.
This process can go on & on & on (as long as we keep
on scanning bio-datas). But “accuracy-improvements” will decline/taper-off,
once a sufficiently large number of prefixes (to the word “ACHIEVEMENT) have
been accumulated. Saturation takes place.
The whole process can be repeated with the words that appear
as “SUFFIXES” to the word ACHIEVEMENT, and the probability of occurrence of
each Suffix also determined.
1
Attained 20 20/54
=
2
Reached 15 15/54 =
3
Planned 10 10/54 =
4
Targeted 5
5
Arrived 3
6
Recorded 1
54 1.000
(population – size
of all the occurrences)
Having figured-out the
“probabilities of occurrences” of each of the prefixes and each of the Suffixes
( to a given word – in this case “ACHIEVEMENT”), We could next tackle the issue
of “ a given combination of prefix & suffix”
e.g. what is the probability of
-
“major” ACHIEVEMENT “attained” –
Prefix Suffix
Why is all of this statistical
exercise required ?
If we wish to stop at merely deciphering
a resume, then I don’t think we need to go thru this.
For mere “deciphering”, all we
need is to create a
KNOWLEDGE-BASE
Of
-
Skills -
Functions
-
Knowledge -
Edu. Qualifications
-
Attitudes -
Products/Services
-
Attributes -
Names
-
Industries
-
Companies etc.
etc.
Having created the
knowledge-base, simply scan a bio-data, recognise words, compare with the
knowledge-base , find CORRESPONDENCE/EQUIVALENCE and allot/file each scanned word
into respective “fields” against each PEN (Permanent Executive No.)
PRESTO !
You have dissected & stored
the MAN in appropriate boxes.
Our EDS has these “boxes” problem
is manual data-entry.
The D/E operator,
-Searches
appropriate “word” from appropriate “EDS Box” and transfers to appropriate
screen.
To eliminate this manual
(time-consuming operation) we need ARDIS.
We already have a DATA-BASE of
6500 words.
All we need to do, is to write
down against each word, whether it is a
-
Skill
-
Attribute
-
Knowledge -
Location
-
Edu. -
Industry
-
Product -
Function
-
Company etc.
etc.
The moment we do this what was a
mere “data-base” becomes a “KNOWLEDGE-BASE”, ready to serve as a “COMPARATOR”.
And as each new bio-data is scanned,
it will throw-up words for which there is no “Clue”. Each such new word will
have to be manually “categorised” and added to the knowledge-base.
Then what is the advantage of
calculating for – each word
-
Each prefix
-
Each suffix
-
Each phrase
-
Each sentence
Its probability of occurrence ?
The advantages are :
# 1
Detect “unlikely”
prefix/suffix
Suppose ARDIS detects
“ Manor Achievement”
ARDIS detects that the
probability of
-
“Manor as prefix to “Achievement” is NIL
-
“Minor” “ “ “ is 0.00009
(say nil)
Hence the correct prefix has to
be
-
“Major” (and not “Manor”) for which the
probability is say, 0.4056.
# 2
ARDIS detects
Mr. HANOVAR
It recognises this as a spelling
mistake and corrects automatically to
Mr. HONAVAR
OR
It reads.
Place of Birth: KOLHAPURE
It recognises it as “KOLHAPUR”
Or Vice-Versa, if it says my name
is : KOLHAPUR
# 3
Today, while scanning (using OCR),
When a mistake is detected, it gets highlighted on the screen or an asterisk/underline
starts blinking.
This draws the attention of the
operator who manually corrects the “mistake” after consulting a dictionary or
his knowledge-base.
Once ARDIS has Calculated the probabilities
of Lakhs of words and even the probabilities of their “most likely sequence of occurrences”,
then, hopefully the OCR Can Self-correct any word or phrase without
operator intervention.
So the Scanning accuracy of OCR
should eventually become 100% - and not 75% - 85% as at present.
# 4
Eventually, we want that
-
A bio-data is scanned
And
automatically
-
reconstitutes itself into our converted BIO DATA
FORMAT.
This is the concept of ARGIS (automatic
resume generating intelligence software)
Here again the idea is to
eliminate the manual data-entry of the entire bio-data – our ultimate goal.
But ARGIS is not possible without
first installing ARDIS and that too with the calculation of the “probability of
occurrence” as the main feature of the software.
By studying & memorising
& calculating the “probabilities of occurrences” of Lakhs of
words/phrases/sentences, ARDIS actually learns English grammar thru
“frequency of usage”.
And it is this KNOWLEDGE-BASE
which enable ARGIS to reconstitute a bio-data (in our format) in a
GRAMMATICALLY CORRECT WAY.
1/12/1996
24/11/1996
BASIS
FOR
A
WORD
RECOGNITION SOFTWARE
Any given word (a
cluster of characters) can be classified (in English) into one of the following
“Categories”:
Verb
Adverb
Preposition
Adjective
Noun
Common
Noun
Proper
Noun
So the first task is to create a
“directory” of each of this category. Then each “word” must be compared to
the words contained in a given directory. If a match occurs then that WORD
would get categorized as belonging to that category. The process has to be
repeated again and again by trying to match the word with the words
contained in each of the categories TILL a match is found. If no “match” is found, that word should be
separately stored in a file marked “UNMATCHED WORDS”. Everyday,
an expert would study all the words contained in this file and assign each
of these words a definite category, using his “HUMAN INTELLIGENCE”. In this
way, over a period of time, the human intelligence will identify/categorise
each and every word contained in ENGLISH LANGUAGE. This will be the process
of transferring human intelligence to computer.
Essentially the trick lies in getting
the computer (Software) to MIMIC the process followed by a human brain
while scanning a set of words (i.e. reading) and by analysing the “Sequence” in
which these words are arranged, to assign a MEANING to each word or a string
of words (a phrase or a sentence).
I cannot believe that no one has
attempted this before (especially since it has so much commercial value). We
don’t know who has developed this software and where to find it so we must
end-up rediscovering the wheel !
Our computer-files contain some
900,000 words which have repeatedly occurred in our records-mostly converted
bio-datas or words captured from bio-datas.
We have in our files, some 3500
converted bio-datas. It has taken us about 6 years to accomplish this feat i.e.
-
approx. 600 converted biodatas/year
OR
-
approx. 2 biodatas converted every working day !
Assuming that all those
(converted) bio-datas which are older than 2 years are OBSOLETE, this means that
perhaps no more than 1200 are current/valid/useful !
So, one thing becomes clear.
The “rate of obsolescence” is
faster than the “rate of conversion” !
Of course, we can argue,
“ Why should we
waste/spend our time in “converting” a bio-data? All we need to do is to
capture the ESSENTIAL/MINIMUM DATA (from each biodata) which would qualify that
person to get searched/spotted. If he gets short-listed, we can always, at
that point of time, spend time/effort to fully convert his bio-data”.
In fact this is
what we have do so far-because there was a premium on the time of data-entry
operators. That time was best utilised in capturing the
essential/minimum data.
But if latest
technology permits/enables us to convert 200 biodatas each day (instead of just
2 biodatas) with the same effort/time/cost, then why not convert 200 ?
why be satisfied with just 2/day ?
If this can be
mode to “happen”, we would be in a position to send-out / fax-out/e: mail, converted
bio-datas to our clients in matter of “minutes” instead of “days” –
which it takes today !
That is not
all.
A converted bio-data has far more
KEYWORDS ( knowledge -skills-attributes-attitudes etc.) than the MINIMUM DATA.
So there is an improved chance of spotting the RIGHT MAN, using a QUERRY which
contains a large no. of KEYWORDS.
So, to-day, if the clients “likes”
only ONE’ Converted bio-data, out of TEN sent to him (a huge waste of everybody’s
time/effort.), then under the new situation he should be able to “like” 4 out
of every 5 converted bio-datas sent to him !
This would vastly improve the
chance of at least ONE executive getting appointed in each assignment. This
should be our goal.
This goal could be achieved only
if,
Step
# 1 – each biodata received every
day is “Scanned” on the same day
# 2 – Converted to TEXT (ASCII)
# 4 – WORD_RECOGNISED (a step
beyond OCR- optical-CHARRACTER recognised)
# 5 – Each word “categorised” and
indexed and stored in appropriate FIELDS of the DATABASE
# 6 – Database “reconstituted” to
create “converted” biodata as per our standard format
# 3 – PEN given serially
Steps # 1/2/3 are not difficult
Step #4 is difficult
# 5 is more difficult
# 6 is most difficult
But if we keep working on this
problem, it can be solved
50% accurate in 3 months
70% accurate in 6 months
90% accurate in 12 months
Even though there are about
900,000 indexed WORDS in our ISYS file, all of these do not occur (in a
biodata/record) with the same frequency. Some occur far more frequently, some
frequently, saome regularly, some occasionally and some rarely.
Then of course (in the English language)
there must be thousands of other words, which have not occurred EVEN
ONCE in any of the biodatas.
Therefore, we won’t find them
amongst the existing indexed file of 9000,000 words.
It is quite possible that some of
these (so far missing words) may occur if this file (of words) were to grow to
2 million.
As this file of words, grows and
grows, the probabilities of
-
a word having been left out
and
-
Such a left-out word likely to occur (in the
next biodata), are “decreasing”.
The frequency distribution curve
might look like follows:
10 20 30 40 50 60 70 80 90 100
(% of words in English Language or in
ISYS – 900,000)
Meaning
Some 20% of the words (in English language make-up, maybe
90% of all the “occurrences”.
This would become clear when we plot the frequency
distribution-curve of the 900,000 words which we have already indexed.
And even when this population grows to 2 million, the shape
(the nature) of the frequency- distribution curve is NOT likely to change !
Only with a much large WORD-POPULATION, the “accuracy” will marginally
increase.
So our Search is to find,
Which are these 20% (20% X 9 Lakh = 180,000) words
which make-up 90% “area under the curve” i.e. POPULATION?
Then focus our efforts in “Categorising” these 180,000 words
in the first place.
If we manage to do this, 90% of our battle is won.
Of course this pre-supposes that before we can attempt
“categorization”, we must be able to recognise each of them as a “WORD”.
No comments:
Post a Comment