Hi Friends,

Even as I launch this today ( my 80th Birthday ), I realize that there is yet so much to say and do. There is just no time to look back, no time to wonder,"Will anyone read these pages?"

With regards,
Hemen Parekh
27 June 2013

Now as I approach my 90th birthday ( 27 June 2023 ) , I invite you to visit my Digital Avatar ( www.hemenparekh.ai ) – and continue chatting with me , even when I am no more here physically

Tuesday, 7 February 2023

ARDIS ARGIS FILE

 

                                                                                                                                                                    23/12/2001

 

Ø  www.hnc.com/innovation_05/cortronics_050105

 

Ø  www.hnc.com/innovation_05/neuralnet_050101

 

Ø  www.zdnet.com.au/newstech/enterprise/story/0,2000025001, 20262400,00.htm

 

Ø  www.aaai.org/AITopics/html/current.html

 

Ø  www.dsslab.com/resources/textanal.htm

 

Ø  www.ece.ogi.edu/~storm/research/org.htm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Yoesh – Cyril                                                                                                                                       9/6/1998

 

ARDIS

 

While discussing the “data-capture & Query” Module (module # 1), a few days back we also talked about the “knowledge-base” already available with us. This knowledge-base has been acquired/created over last 8 years.

 

The knowledge-base comprises of English language

-          Words

-          Phrases

-          Sentences

-          Paragraphs

As far as “words” are concerned, I myself worked on “categorising” them in different “categories”. This was nearly 12 months ago, using software tool “TELL ME” developed by Cyril.

In this connection, I enclose – Annex = A/B/C/D.

 

(A)    Under TELL ME”, I have already categorised over 15000 words into some 60 different categories. Some of these are shown in Annex: C

(B)    In addition, Cyril had developed another simple method, under which I could quickly categorise:

P=  Persons Name (Name of a Person)

C=  Company Name

Q=  Edu. Quali. of an individual

L=  Name of a location (mostly a City)

 

As far as these 4 Categories (out of 60 odd categories of words) is concerned, I have already covered:

 

Frequency                          No. of words covered

>100                                      7056

51-100                                   3913

26-50                                     5880

11-25                                     13,397

                                                30,246

(see Annex=A)

These are ISYS-Indexed words.

 

 

So,

Under both the tools combined ( A+B), I might have already categorise over 30,000 words.

Over the last 5/6 weeks, we have already scanned/OCR’ed / and created txt file of some 13573 pages of bio-datas. And this population (of txt files) is growing at the rate of some 300 pages/day.

 

We talked about a simple software which will pick-out all the words (except for “common” words) in each of this page

 

Then,

 

Compare each such word with the knowledge-base of 30000 words which I have already “categorised”.

 

If a “match” is found, the word is transferred to respective “category” & marked “Known”.

 

If there is “no match”, the word gets tagged as “NEW” and gets highlighted on the txt file.

 

Now, anytime, any consultant is viewing that page on the screen and comes across a “New” marked word, whose “meaning/category” he knows, he will have a simple “tool” (on that very screen) with which he will go ahead and “categorise” that word. This tool could be perhaps “ TELL ME”.

 

We should debate whether we also give the “rights” to any Consultant to “add” a new category itself.

 

It should be possible for any no. of “consultants” to work on this tool, simultaneously, from their own individual work-stations, whenever time permits or whenever they are “viewing” a txt page for any reason.

 

This arrangement would “multiply” the effort several times as compared to my doing it “Single handedly” !

+

It has the advantage of using the knowledge of several persons having different academic / experience background.

 

We could also consider hiring “experts” from different “Functional” areas, to carry-out (this categorisation) in a dedicated manner.

Now that we have 13573 pages ready (for this simple “match-making” process), we could seriously consider “hiring” such “experts”.

 

We could even take “text-books” on various SUBJECTS/CATEGORIES and prepare an INVENTORY of all words appearing in each book and put them in the SUBJECT category.

 

Many innovations are possible – if only we could make a “beginning”. Such a beginning is possible now.

 

Let us give this a serious thought and discuss soon.

 

 

 

 

 

Regards,

 

Hemen Parekh

 

 

 

 

 

 

 

 

 

 

 

 

 

 

                                                                                                                                                                                1/12/1996

ARDIS = Automatic Resume Deciphering Intelligence Software

 

ARGIS = Automatic Resume Generating Intelligence Software

 

What are these Softwares ?

What will they do                 ?

How will they help us         ?

How will they help our clients / Candidates ?

 

ARDIS

 

-          This software will break-up/dissect a Resume into its different Constituents, such as

A.      Physical information (data) about a Candidate (Executive)

B.      Academic                                                            

C.      Employment Record (Industry – Function – Products/Services wise)

D.      Salary

E.       Achievements / Contributions

F.       Attitudes/Attributes/Skills/Knowledge

G.      His preferences w.r.t. Industry/Function/Location

In fact, if every candidate was to fill-in our EDS, the info. Would automatically fall into “proper” slots/fields since our EDS forces a candidate to “dissect” himself into various compartments.

 

But,

Getting every applicant/executive to fill-in our standardised EDS is next to impossible – and may not be even necessary. Executives (who have already spent a lot of time and energy preparing/typing their biodatas) are most reluctant to Sit-down once more and spend a lot of time once again to furnish us the SAME information/data in our neatly arranged blocks of EDS. For them, this duplication is a WASTE OF TIME ! EDS is designed for our (information-handling / processing /retrieving) convenience and that is the way he perceives it ! Even if he is vaguely Conscious that this (filling in of EDS) would help him in the long-run, he does not see any immediate benefit from filling this – hence reluctant to do so.

We too have a problem – a “Cost/Time/Effort” problem.

If we are receiving 100 biodatas each day (this should happen soon), whom to send our EDS and whom NOT to ?

This can be decided only by a Senior Executive/Consultant who goes thru each & every biodata daily and reaches a conclusion as to

-          Which resume’s are of “interest” & need sending on EDS

-          Which resume’s are marginal or not-of-immediate interest where we need not spend time/money/energy of sending on EDS.

We may not be able to employ a number of Senior/Competent Consultants who can scrutinise all incoming bio-datas and take this decision on a DAILY basis ! This, itself would be a Costly proposition.

 

So,

On ONE HAND

-          We have time/cost/energy/effort of Sending EDS to everyone

On OTHER HAND

-          We have time/cost of several Senior Consultants to Separate out “chaffe” from “wheat”.

                NEITHER IS DESIRABLE

But

From each biodata received daily, we still need to decipher and drop into relevant slots/fields, relevant data/information which would enable us to

OUR REQUIREMENTS / NEEDS

 
                                                                               

 


-          Match a Candidate’s profile with “Client Requirement Profile” against specific requests

-          Match a Candidates profile against hundreds of recruitment advertisement appearing daily in media (Jobs BBS)

-          Match a Candidate’s profile against “specific Vacancies” that any corporate (client or not) may “post” on our Vacancy bulletin-board (unadvertised Vacancies).

-          Match a candidate’s profile against “Most likely Companies who are likely to hire/need such an executive” using our CORPORATE DATA BASE, which will contain info. Such as

     PRODUCTS/SERVICES of each & every Company

-          Convert each biodata received into a RECONSTITUTED BIO-DATA (converted bio-data), to enable us to send it out to any Client/Non-Client organisation at the click of mouse.

-          Generate (for commercial/profitable exploitation), Such bye-product services as

o   Compensation Trends

o   Organisation charts

o   Job Descriptions    etc. etc.

-          Permit a Candidate to log-into our database and remotely modify/alter his bio-data

-          Permit a client (or a non-client) to log into our database and remotely conduct a SEARCH.

 

 

 

ARDIS is required on the assumption that for a long time to come “typed” bio-datas would form a major source of our database.

Other sources, such as

-          Duly filled-in EDS (hard-copy)

-          EDS on a floppy

-          Downloading EDS over Internet (or Dial-up phone lines) & uploading after filling-in (like Intellimatch)

Will continue to play a minor role in foreseeable future.

 

 

 

Compare with Key-words stored in WORD-DIRECTORY of most-frequently used words in 3500 converted bio-datas (ISYS analysis)

 
 


Flowchart: Merge: Typed 
BIO-DATAS
HOW WILL ARDIS WORK?

Convert to English characters

(by comparison)

 

Most Commonly used VERBS/ADVERBS/ADJECTIVES/PREPOSITION with each KEY-PHRASE, to create directory of KEY-SENTENCES

 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


To recapitulate

 

ARDIS will,

-          Recognise “characters”

-          Convert to “WORDS”

-          Compare with 6258 Key-words which we have found in 3500 converted bio-datas (using ISYS). If a “word” has not already appeared (> 10 times) in these 3500 bio-datas, then its chance (probability) of occurring in the next biodata is very very small indeed.

But even then,

ARDIS Sftware will store in memory, each “occurrence” of each WORD (old or new, first time or a thousandth  time)

and

will continuously calculate its “probability of occurrence” as

 

P =    No. of occurrence of the given word so far

          Total no. of occurrence of all the words in the entire population so far

So that,

-          By the time we have SCANNED, 10,000 bio-datas, we would have literally covered all words that have, even a small PROBABILITY of OCCURRENCE !

 

So with each new bio-data “Scanned” the probability of occurrence of each “word” is getting more & more accurate !

Same logic will hold for

-          KEY PHRASES

-          KEY SENTENCES

The “name of the game” is

-          PROBABILITY OF OCCURRENCE

As someone once said,

If you allow 1000 monkeys to keep on hammering keys of 1000 typewriters for 1000 years, you will at the end, find that between them they have reproduced the entire literary-works of Shakespeare!

But to-day, if you store into a Super-Computer,

-          All the words appearing in English language (incl- verbs/adverbs/adj. etc.)

-          The “logic” behind Construction of English language

Then,

I am sure the Super-Computer could reproduce the entire works of Shakespeare in 3 months !

And, as you would have noticed, ARDIS is a self-learning type of software. The more it reads (scans), the more it learns (memorises, words, phrases & even sentences).

Because of this SELF-LEARNING / SELF – CORRECTING/ SELF – IMPROVING Capability,

ARDIS gets better & better equipped, to detect, in a scanned biodata

-          Spelling mistakes            (wrong word)

-          Context mistakes            ( wrong prefix or suffix) – wrong PHRASE

-          Preposition mistakes      (wrong phrase)

-          Adverb/Verb mistakes  - wrong SENTENCE

With minor variations,

All thoughts, words (written), speech (spoken) and actions, Keep on repetiting again and again and again.

It is this REPETITIVENESS of words, phrases & Sentences in Resume’s that we plan to exploit.

In fact,

By examining & memorising the several hundred (or thousand) “Sequences” in which the words appear, it should be possible to “construct” the “grammar” i.e. the logic behind the sequences. I suppose, this is the manner in which the experts were able to unravel the “meaning” of hierographic inscriptions on Egyptian tombs. They learned a completely strange/obsure language by studying the “repetitive” & “sequential” occurrence of unknown characters.

 

How to build directories of “phrases” ?

From 6252 words, let us pick any word, Say

ACHIEVEMENT

 
 


WORD = ACHIEVEMENT

 
Now we ask the software to scan the directory containing 3500 converted bio-datas, with instruction that every time the word “Achievement” is spotted, the software will immediately spot/record the prefix, The Software will record all the words that appeared before “Achievement” as also the “number of times” each of this prefix appeared.

 

 

 

 


e.g.

1  Major                                                               10                                                           10/55    =

2 Minor                                                                9                                                              9/55/     =

3 Significant                                                        8                                                              8/55       =

4 Relevant                                                          7                                                              7/55       =

5 True                                                                   6

6 Factual                                                              5

7 My                                                                      4

8 Typical                                                               3

9 Collective                                                         2

10 Approximate                                                1

                                                                Total No. of = 55                                               = 1.0000

                                                                Occurrence

                                                                (population-size)            

As more & more bio-datas are Scanned,

-          The number of “prefixes” will go on increasing

-          The number of “occurrences” of each prefix will also go on increasing

-          The overall “population – size “ will also go on increasing

-          The “probability of occurrence” of each prefix will go on getting more & more accurate i.e. more & more representative.

This process can go on & on & on (as long as we keep on scanning bio-datas). But “accuracy-improvements” will decline/taper-off, once a sufficiently large number of prefixes (to the word “ACHIEVEMENT) have been accumulated. Saturation takes place.

The whole process can be repeated with the words that appear as “SUFFIXES” to the word ACHIEVEMENT, and the probability of occurrence of each Suffix also determined.

 

 

 

 

 


1         Attained                              20                                                                                           20/54    =

2         Reached                              15                                                                                           15/54     =

3         Planned                               10                                                                                           10/54     =

4         Targeted                              5

5         Arrived                                 3

6         Recorded                             1

                                                         

                                                                54                                                                                           1.000

                           (population – size of all the occurrences)

 

Having figured-out the “probabilities of occurrences” of each of the prefixes and each of the Suffixes ( to a given word – in this case “ACHIEVEMENT”), We could next tackle the issue of “ a given combination of prefix & suffix”

e.g.    what is the probability of

- “major” ACHIEVEMENT “attained” –

 

                                                                                Prefix                                    Suffix

Why is all of this statistical exercise required ?

If we wish to stop at merely deciphering a resume, then I don’t think we need to go thru this.

 

For mere “deciphering”, all we need is to create a

KNOWLEDGE-BASE

Of

-          Skills                                - Functions

-          Knowledge                   - Edu. Qualifications

-          Attitudes                       - Products/Services

-          Attributes                     - Names

-          Industries

-          Companies                   etc. etc.

 

Having created the knowledge-base, simply scan a bio-data, recognise words, compare with the knowledge-base , find CORRESPONDENCE/EQUIVALENCE and allot/file each scanned word into respective “fields” against each PEN (Permanent Executive No.)

 

PRESTO !

You have dissected & stored the MAN in appropriate boxes.

Our EDS has these “boxes” problem is manual data-entry.

The D/E operator,

-Searches appropriate “word” from appropriate “EDS Box” and transfers to appropriate screen.

To eliminate this manual (time-consuming operation) we need ARDIS.

We already have a DATA-BASE of 6500 words.

All we need to do, is to write down against each word, whether it is a

-          Skill

-          Attribute

-          Knowledge                   - Location

-          Edu.                                 - Industry

-          Product                          - Function

-          Company                                                             etc. etc.

The moment we do this what was a mere “data-base” becomes a “KNOWLEDGE-BASE”, ready to serve as a “COMPARATOR”.

And as each new bio-data is scanned, it will throw-up words for which there is no “Clue”. Each such new word will have to be manually “categorised” and added to the knowledge-base.

Then what is the advantage of calculating for – each word

-          Each prefix

-          Each suffix

-          Each phrase

-          Each sentence

Its probability of occurrence ?

The advantages are :

 

# 1

Detect “unlikely” prefix/suffix

Suppose ARDIS detects

“ Manor Achievement”

ARDIS detects that the probability of

-          “Manor as prefix to “Achievement” is NIL

-          “Minor”                                                is 0.00009

                                                                          (say nil)

Hence the correct prefix has to be

-          “Major” (and not “Manor”) for which the probability is say, 0.4056.

 

# 2

ARDIS detects

 

Mr. HANOVAR

It recognises this as a spelling mistake and corrects automatically to

Mr. HONAVAR

OR

        It reads.

Place of Birth: KOLHAPURE

It recognises it as “KOLHAPUR”

Or Vice-Versa, if it says my name is : KOLHAPUR

 

# 3

Today, while scanning (using OCR), When a mistake is detected, it gets highlighted on the screen or an asterisk/underline starts blinking.

This draws the attention of the operator who manually corrects the “mistake” after consulting a dictionary or his knowledge-base.

Once ARDIS has Calculated the probabilities of Lakhs of words and even the probabilities of their “most likely sequence of occurrences”, then, hopefully the OCR Can Self-correct any word or phrase without operator intervention.

So the Scanning accuracy of OCR should eventually become 100% - and not 75% - 85% as at present.

 

# 4

Eventually, we want that

-          A bio-data is scanned

 

 


And automatically

-          reconstitutes itself into our converted BIO DATA FORMAT.

 

This is the concept of ARGIS (automatic resume generating intelligence software)

Here again the idea is to eliminate the manual data-entry of the entire bio-data – our ultimate goal.

But ARGIS is not possible without first installing ARDIS and that too with the calculation of the “probability of occurrence” as the main feature of the software.

By studying & memorising & calculating the “probabilities of occurrences” of Lakhs of words/phrases/sentences, ARDIS actually learns English grammar thru “frequency of usage”.

And it is this KNOWLEDGE-BASE which enable ARGIS to reconstitute a bio-data (in our format) in a GRAMMATICALLY CORRECT WAY.

 

 

 

 

 

 

 

                                                                                                                                                                                1/12/1996

 

 

 

 

 

                                                                                                                                                                                24/11/1996

BASIS FOR

A

WORD RECOGNITION SOFTWARE

 

Any given word (a cluster of characters) can be classified (in English) into one of the following “Categories”:

 

 

 


                                         

          Verb

          Adverb

          Preposition

          Adjective

Noun

 
                                                         

                                                         

                                         Common Noun

                                         Proper Noun

So the first task is to create a “directory” of each of this category. Then each “word” must be compared to the words contained in a given directory. If a match occurs then that WORD would get categorized as belonging to that category. The process has to be repeated again and again by trying to match the word with the words contained in each of the categories TILL a match is found.

If no “match” is found, that word should be separately stored in a file marked

“UNMATCHED WORDS”.

Everyday, an expert would study all the words contained in this file and assign each of these words a definite category, using his “HUMAN INTELLIGENCE”.

In this way, over a period of time, the human intelligence will identify/categorise each and every word contained in ENGLISH LANGUAGE. This will be the process of transferring human intelligence to computer.

 

 

 

 

 
                         

 

 

 

 

 

 

 

 

 

 

 

                                                                           

 

Essentially the trick lies in getting the computer (Software) to MIMIC the process followed by a human brain while scanning a set of words (i.e. reading) and by analysing the “Sequence” in which these words are arranged, to assign a MEANING to each word or a string of words (a phrase or a sentence).

I cannot believe that no one has attempted this before (especially since it has so much commercial value). We don’t know who has developed this software and where to find it so we must end-up rediscovering the wheel !

Our computer-files contain some 900,000 words which have repeatedly occurred in our records-mostly converted bio-datas or words captured from bio-datas.

We have in our files, some 3500 converted bio-datas. It has taken us about 6 years to accomplish this feat i.e.

-          approx. 600 converted biodatas/year

 

OR

-          approx. 2 biodatas converted every working day !

Assuming that all those (converted) bio-datas which are older than 2 years are OBSOLETE, this means that perhaps no more than 1200 are current/valid/useful !

So, one thing becomes clear.

The “rate of obsolescence” is faster than the “rate of conversion” !

Of course, we can argue,

“ Why should we waste/spend our time in “converting” a bio-data? All we need to do is to capture the ESSENTIAL/MINIMUM DATA (from each biodata) which would qualify that person to get searched/spotted. If he gets short-listed, we can always, at that point of time, spend time/effort to fully convert his bio-data”.

In fact this is what we have do so far-because there was a premium on the time of data-entry operators. That time was best utilised in capturing the essential/minimum data.

But if latest technology permits/enables us to convert 200 biodatas each day (instead of just 2 biodatas) with the same effort/time/cost, then why not convert 200 ? why be satisfied with just 2/day ?

If this can be mode to “happen”, we would be in a position to send-out / fax-out/e: mail, converted bio-datas to our clients in matter of “minutes” instead of “days” – which it takes today !

That is not all.

A converted bio-data has far more KEYWORDS ( knowledge -skills-attributes-attitudes etc.) than the MINIMUM DATA. So there is an improved chance of spotting the RIGHT MAN, using a QUERRY which contains a large no. of KEYWORDS.

So, to-day, if the clients “likes” only ONE’ Converted bio-data, out of TEN sent to him (a huge waste of everybody’s time/effort.), then under the new situation he should be able to “like” 4 out of every 5 converted bio-datas sent to him !

This would vastly improve the chance of at least ONE executive getting appointed in each assignment. This should be our goal.

 

 

This goal could be achieved only if,

Step

# 1 – each biodata received every day is “Scanned” on the same day

# 2 – Converted to TEXT (ASCII)

# 4 – WORD_RECOGNISED (a step beyond OCR- optical-CHARRACTER recognised)

# 5 – Each word “categorised” and indexed and stored in appropriate FIELDS of the DATABASE

# 6 – Database “reconstituted” to create “converted” biodata as per our standard format

# 3 – PEN given serially

 

Steps # 1/2/3 are not difficult

Step   #4 is difficult

           # 5 is  more difficult

           # 6 is most difficult

But if we keep working on this problem, it can be solved

50% accurate in 3 months

70%  accurate in 6 months

90% accurate in 12 months

Even though there are about 900,000 indexed WORDS in our ISYS file, all of these do not occur (in a biodata/record) with the same frequency. Some occur far more frequently, some frequently, saome regularly, some occasionally and some rarely.

Then of course (in the English language) there must be thousands of other words, which have not occurred EVEN ONCE in any of the biodatas.

Therefore, we won’t find them amongst the existing indexed file of 9000,000 words.

It is quite possible that some of these (so far missing words) may occur if this file (of words) were to grow to 2 million.

As this file of words, grows and grows, the probabilities of

-          a word having been left out

      and

-          Such a left-out word likely to occur (in the next biodata), are “decreasing”.

 

 

 

 

The frequency distribution curve might look like follows:

 

 

 


  

 

 

 

 


  10         20           30           40           50           60           70           80           90           100

       (% of words in English Language or in ISYS – 900,000)

 

Meaning

Some 20% of the words (in English language make-up, maybe 90% of all the “occurrences”.

This would become clear when we plot the frequency distribution-curve of the 900,000 words which we have already indexed.

And even when this population grows to 2 million, the shape (the nature) of the frequency- distribution curve is NOT likely to change ! Only with a much large WORD-POPULATION, the “accuracy” will marginally increase.

So our Search is to find,

Which are these 20% (20% X 9 Lakh = 180,000) words which make-up 90% “area under the curve” i.e. POPULATION?

Then focus our efforts in “Categorising” these 180,000 words in the first place.

If we manage to do this, 90% of our battle is won.

Of course this pre-supposes that before we can attempt “categorization”, we must be able to recognise each of them as a “WORD”.






















































No comments:

Post a Comment